THOSE OF US WHO ARE into graphics processors sure have it easy these days. We're addicted to advances in technology, and we seem to get a hit every couple of months or so in the form of a new spin of a GPU or graphics card. What's more, graphics chip companies AMD and Nvidia cook up an entirely new form of drug with startling frequency. Less than two years since the debut of the Radeon X1000 series, the former ATI is now ready to release its new generation of technology, the Radeon HD family. What's more, we've been complaining because the thing is actually late. Nvidia's competing GeForce 8-series technology hit the market last November, and GPU-watchers have been waiting impatiently for the battle to be joined.
Who could blame us, really, for being impatient? The GeForce 8800 is a stunning achievement, and we're eager to see whether AMD can match it. You'll have to forgive the most eager among us, the hollow-eyed Radeon fanboys inhabiting the depths of our forums, wandering aimlessly while carrying their near-empty bottles of X1000-series eye candy and stopping periodically to endure an episode of the shakes. We've all heard the stories about AMD's new GPU, code-named R600, and wondered what manner of chip it might be. We've heard whispers of jaw-dropping potential, but—especially as the delays piled up—doubts crept in, as well.
Happily, R600 is at last ready to roll. The Radeon HD 2900 XT graphics card should hit the shelves of online stores today, and we have spent the past couple of weeks dissecting it. Has AMD managed to deliver the goods? Keep reading for our in-depth review of the Radeon HD 2900 XT.
Into the R600
The R600 is easily the biggest technology leap from the Radeon folks since the release of the seminal R300 GPU in the Radeon 9700, and it's also the first Radeon since that represents a true break from R300 technology. That's due in part to the fact that R600 is designed to work in concert with Microsoft's DirectX 10 graphics programming interface, which modifies the traditional graphics pipeline to unlock more programmability and flexibility. As a state-of-the-art GPU, the R600 is also a tremendously powerful parallel computing engine. We're going to look at some aspects of R600 in some detail, but let's start with an overview of the entire chip, so we have a basis for the rest of our discussion.
The R600's most fundamental innovation is the introduction of a unified shader architecture that can process the three types of graphics programs—pixel shaders, vertex shaders, and geometry shaders—established by DX10's Shader Model 4.0 using a single type of processing unit. This arrangement allows for dynamic load balancing between these three thread types, making it possible for R600 to bring the majority of its processing power to bear on the most urgent computational need at hand during the rendering of a frame. In theory, a unified shader architecture can be vastly more efficient and effective than a GPU with fixed shader types, as all DX9-class (and prior) desktop GPUs were.
A high-level diagram of the R600 architecture like the one above will no doubt invoke memories of ATI's first unified shader architecture, the Xenos GPU inside the Xbox 360. The basic arrangement of functional units looks very similar, but R600 is in fact a new and different design in key respects like shader architecture and thread dispatch. One might also wish to draw parallels to the unified shader architecture of Nvidia's G80 GPU, but the R600 arranges its execution resources quite differently from G80, as well. In its GeForce 8800 GTX incarnation, the G80 has 128 scalar stream processors running at 1.35GHz. The R600 is more parallel and runs at lower frequencies; AMD counts 320 stream processors running at 742MHz on the Radeon HD 2900 XT. That's not an inaccurate portrayal of the GPU's structure, but there's much more to it than that, as we'll discuss briefly.
First, though, let's have a look at the R600 chip itself, because, well, see for yourself.
Like the G80, it's frickin' huge. With the cooler removed, you can see it from space. AMD estimates the chip at 700 million transistors, and TSMC packs those transistors onto a die using an 80nm fab process. I measured the R600 at roughly 21 mm by 20 mm, which works out to 420 mm².
I'd like to give you a side-by-side comparison with the G80, but that chip is typically covered by a metal cap, making pictures and measurements difficult. (Yes, I probably should sacrifice a card for science, but I haven't done it yet.) Nvidia says the G80 has 680 million transistors, and it's produced on a larger 90nm fab process at TSMC. I've seen die size estimates for G80 that range from roughly 420 to 490 mm², although Nvidia won't confirm exact numbers. R600, however, doesn't have to rely on a separate chip to provide display logic, so it's almost certainly smaller overall.
Command processing, setup, and dispatch
I continue to be amazed by the growing amount of disclosure we get from AMD and Nvidia as they introduce ever more complex GPUs, and R600 is no exception on that front. At its R600 media event, AMD chip architect Eric Demers gave the assembled press a whirlwind tour of the GPU, most of which whizzed wholly unimpeded over our heads. I'll try to distill down the bits I caught with as much accuracy as I can.
Our tour of the R600 began, appropriately, with the GPU's command processor. Demers said previous Radeons have also had logic to process the command stream from the graphics driver, but on the R600, this is actually a processor; it has memory, can handle math, and downloads microcode every time it boots up. The reason this command processor is so robust is so it can offload work from the graphics driver. In keeping with a DirectX 10 theme, it's intended to reduce state management overhead. DirectX 9 tends to group work in lots of small batches, creating substantial overhead just to manage all of the objects in a scene. That work typically falls to the graphics driver, burdening the CPU. Demers described the R600 command processor as "somewhat self-aware," snooping to determine and manage state itself. The result? A claimed reduction in CPU overhead of up to 30% in DirectX 9 applications, with even less overhead in DX10.
Next in line beyond the command processor is the setup engine, which prepares data for processing. It has three functions for DX10's three shader program types: vertex assembly (for vertex shaders), geometry assembly (for geometry shaders), and scan conversion and interpolation (for pixel shaders). Each function can submit threads to the dispatch processor.
One item of note near the vertex assembler is a dedicated hardware engine for tessellation. This unit is a bit of secret sauce for AMD, since the G80 doesn't have anything quite like it. The tessellator allows for the use of very high polygon surfaces with a minimal memory footprint by using a form of compression. This hardware takes two inputs—a low-poly model and a mathematical description of a curved surface—and outputs a very detailed, high-poly model. AMD's Natalya Tatarchuk showed a jaw-dropping demo of the tessellator in action, during which I kept thinking to myself, "Man, I wish she'd switch to wireframe mode so I could see what's going on." Until I realized the thing was in wireframe mode, and the almost-solid object I was seeing was comprised of millions of polygons nearly the size of a pixel.
This tessellator may live in a bit of an odd place for this generation of hardware. It's not a part of the DirectX 10 spec, but AMD will expose it via vertex shader calls for developers who wish to use it. We've seen such features go largely unused in the past, but AMD thinks we might see games ported from the Xbox 360 using this hardware since the Xbox 360 GPU has a similar tessellator unit. Also, tessellation capabilities are a part of Microsoft's direction for future incarnations of DirectX, and AMD says it's committed to this feature for the long term (unlike the ill-fated Truform feature that it built into the original Radeon hardware, only to abandon it in the subsequent generation). We'll have to see whether game developers use it.
The setup engine passes data to the R600's threaded dispatch processor. This part of the GPU, as Demers put it "is where the magic is." Its job is to keep all of the shader cores occupied, which it does by managing a large number of threads of three different types (vertex, geometry, and pixel shaders) and switching between them. The R600's dispatch processor keeps track of "hundreds of threads" in flight at any given time, dynamically deciding which ones should execute and which ones should go to sleep depending on the work being queued, the availability of data requested from memory, and the like. By keeping a large number of threads in waiting, it can switch from one to another as needed in order to keep the shader processors busy.
The thread dispatch process involves multiple levels of arbitration between the three thread types in waiting and the work already being done. Each of the R600's four SIMD arrays of shader processors has two arbiter units associated with it, as the diagram shows, and each one of those has a sequencer attached. The arbiter decides which thread to process next based on a range of variables, and the sequencer then determines the best ordering of instructions for execution of that thread. The SIMD arrays are pipelined, and the two arbiters per SIMD allow for execution of two different threads in interleaved fashion. Notice, also, that vertex and texture fetches have their own arbiters, so they can run independently of shader ops.
As you may be gathering, this dispatch processor involves lots of complexity and a good deal of mystery about its exact operation, as well. Robust thread handling is the reason why GPUs are very effective parallel computing devices, because they can keep themselves very well occupied. If a thread has to stop and wait for the retrieval of data from memory, which can take hundreds of GPU cycles, other threads are ready and waiting to execute in the interim. This logic almost has to occupy substantial amounts chip area, since the dispatch processor must keep track of all of the threads in flight and make "smart" decisions about what to do next.
In its shader core, the R600's most basic unit is a stream processing block like the one depicted in the diagram on the right. This unit has five arithmetic logic units (ALUs), arranged together in superscalar fashion—that is, each of the ALUs can execute a different instruction, but the instructions must all be issued together at once. You'll notice that one of the five ALUs is "fat." That's because this ALU's capabilities are a superset of the others'; it can be called on to handle transcendental instructions (like sine and cosine), as well. All four of the others have the same capabilities. Optimally, each of the five ALUs can execute a single multiply-add (MAD) instruction per clock on 32-bit floating-point data. (Like G80, the R600 essentially meets IEEE 754 standards for precision.) The stream processor block also includes a dedicated unit for branch execution, so the stream processors themselves don't have to worry about flow control.
These stream processor blocks are arranged in arrays of 16 on the chip, for a SIMD (single instruction multiple data) arrangement, and are controlled via VLIW (very long instruction word) commands. At a basic level, that means as many as six instructions, five math and one for the branch unit, are grouped into a single instruction word. This one instruction word then controls all 16 execution blocks, which operate in parallel on similar data, be it pixels, vertices, or what have you.
The four SIMD arrays on the chip operate independently, so branch granularity is determined by the width of the SIMD and the depth of the pipeline. For pixel shaders, the effective "width" of the SIMD should typically be 16 pixels, since each stream processor block can process a single four-component pixel (with a fifth slot available for special functions or other tasks). The stream processor units are pipelined with eight cycles of latency, but as we've noted, they always execute two threads at once. That makes the effective instruction latency per thread four cycles, which brings us to 64 pixels of branch granularity for R600. Some other members of the R600 family have smaller SIMD arrays and thus finer branch granularity.
Let's stop and run some numbers so we can address the stream processor count claimed by AMD. Each SIMD on the R600 has 16 of these five-ALU-wide superscalar execution blocks. That's a total of 80 ALUs per SIMD, and the R600 has four of those. Four times 80 is 320, and that's where you get the "320 stream processors" number. Only it's not quite that simple.
The superscalar VLIW design of the R600's stream processor units presents some classic challenges. AMD's compiler—a real-time compiler built into its graphics drivers—will have to work overtime to keep all five of those ALUs busy with work every cycle, if at all possible. That will be a challenge, especially because the chip cannot co-issue instructions when one is dependent on the results of the other. When executing shaders with few components and lots of dependencies, the R600 may operate at much less than its peak capacity. (Cue sounds of crashing metal and human screams alongside images of other VLIW designs like GeForce FX, Itanium, and Crusoe.)
The R600 has many things going for it, however, not least of which is the fact that the machine maps pretty well to graphics workloads, as one might expect. Vertex shader data often has five components and pixel shader data four, although graphics usage models are becoming more diverse as programmable shading takes off. The fact that the shader ALUs all have the same basic set of capabilities should help reduce scheduling complexity, as well.
Still, Nvidia has already begun crowing about how much more efficient and easier to utilize its scalar stream processors in the G80 are. For its part, AMD is talking about potential for big performance gains as its compiler matures. I expect this to be an ongoing rhetorical battle in this generation of GPU technology.
So how does R600's shader power compare to G80? Both AMD and Nvidia like to throw around peak FLOPS numbers when talking about their chips. Mercifully, they both seem to have agreed to count programmable operations from the shader core, bracketing out fixed-function units for graphics-only operations. Nvidia has cited a peak FLOPS capacity for the GeForce 8800 GTX of 518.4 GFLOPS. The G80 can co-issue one MAD and one MUL instruction per clock to each of its 128 scalar SPs. That's three operations (multiply-add and multiply) per cycle at 1.35GHz, or 518.4 GFLOPS. However, the guys at B3D have shown that that extra MUL is not always available, which makes counting it questionable. If you simply count the MAD, you get a peak of 345.6 GFLOPS for G80.
By comparison, the R600's 320 stream processors running at 742MHz give it a peak capacity of 475 GFLOPS. Mike Houston, the GPGPU guru from Stanford, told us he had achieved an observed compute throughput of 470 GFLOPS on R600 with "just a giant MAD kernel." So R600 seems capable of hitting something very near its peak throughput in the right situation. What happens in graphics and games, of course, may vary quite a bit from that.
The best way to solve these shader performance disputes, of course, is to test the chips. We have a few tests that may give us some insight into these matters.
The Radeon 2900 XT comes out looking good in 3DMark's vertex shader tests, solidly ahead of the GeForce 8800 GTX. Oddly, though, we've seen similar or better performance in this test out of a mid-range GeForce 8600 GTS than we see here from the 8800 GTX. The GTX may be limited by other factors here or simply not allocating all of its shader power to vertex processing.
The tables turn in 3DMark's pixel shader test, and the R600 ends up in a virtual dead heat with the GeForce 8800 GTS, a cut-down version of the G80 with only 96 stream processors.
This particle test runs a physics simulation in a shader, using vertex texture fetch to store and access the results. Here, the Radeon 2900 XT is slower than the 8800 GTX, but well ahead of the GTS. The Radeon X1950 XTX can't participate since lacks vertex texture fetch.
Futuremark says the Perlin noise test "computes six octaves of 3-dimensional Perlin simplex noise using a combination of arithmetic instructions and texture lookups." They expect such things for become popular in future games for use in procedural modeling and texturing, although procedural texturing has always been right around the corner and never seems to make its way here. If and when it does, the R600 should be well prepared, because runs this shader quite well.
The best way to solve these shader performance disputes, of course, is to test the chips. We have a few tests that may give us some insight into these matters.
Next up is a series of shaders in ye old ShaderMark, a test that's been around forever but may yet offer some insights.
The Radeon HD 2900 XT lands somewhere north of the GeForce 8800 GTS, but it can't match the full-fledged G80 in ShaderMark generally.
ShaderMark also gives us an intriguing look at image quality by quantifying how closely each graphics cards' output matches that of Microsoft's reference rasterizer for DirectX 9. We can't really quantify image quality, but this does tell us something about the computational precision and adherence to Microsoft's standards in these GPUs.
DirectX 10 has much tighter standards for image quality, and these DX10-class GPUs are remarkably close together, both overall and in individual shaders.
Finally, here's a last-minute addition to our shader tests courtesy of AMD. Apparently already aware of the trash talk going on about the potential scheduling pitfalls of its superscalar shading core, AMD sent out a simple set of DirectX 10 shader tests in order to prove a point. I decided to go ahead and run these tests and present you with the results, although the source of the benchmarks is not exactly an uninterested third party, to say the least. The results are informative, though, because they present some difficult scheduling cases for the R600 shader core. You can make of them what you will. First, the results, and then the test explanations:
The first thing to be said is that G80 again appears to be limited somehow in its vertex shader performance, as we saw with 3DMark's vertex tests. That hasn't yet been an issue for the G80 in real-world games, so I'd say the pixel shader results are the more interesting ones. Here are AMD's explanations of the tests, edited and reformatted for brevity's sake:
1) "float MAD serial" - Dependant Scalar Instructions Basically this test issues a bunch of scalar MAD instructions that are sequentially executed. This way only one out of 5 slot of the super-scalar instruction could be utilized. This is absolutely the worst case that would rarely be seen in the real-world shaders.
The GeForce 8800 GTX is just under three times the speed of the Radeon HD 2900 XT in AMD's own worst-case scenario, the float MAD serial with dependencies preventing superscalar parallelism. From there, the R600 begins to look better. The example of the float4 MAD parallel is impressive, since AMD's compiler does appear to be making good use of the R600's potential when compared to G80. The next two floating-point tests make use of the "fat" ALU in the R600, and so the R600 looks quite good.
2) "float4 MAD parallel" - Vector Instructions This test issues 2 sequences of MAD instructions operating on float4 vectors. The smart compiler in the driver is able to split 4D vectors among multiple instructions to fill all 5 slots. This case represents one of the best utilization cases and is quite representative of instruction chains that would be seen in many shaders. This also demontrates [sic] the flexibility of the architecture where not only trivial case like 3+2 or 4+1 can be handled.
3) "float SQRT serial" - Special Function This is a test that utilizes the 5th "supped up" [sic] scalar instruction slot that can execute regular (ADD, MUL, and etc.) instructions along with transcendental instructions.
4) "float 5-instruction issue" - Non Dependant Scalar Instructions This test has 5 different types of scalar instructions (MUL, MAD, MIN, MAX, SQRT), each with it's own operand data, that are co-issued into one super-scalar instruction. This represents a typical case where in-driver shader compiler is able to co-issue instructions for maximal efficiency. This again shows how efficiently instructions can be combined by the shader compiler.
5) "int MAD serial" - Dependant DX10 Integer Instructions This test shows the worst case scalar instruction issue with sequential execution. This is similar to test 1, but uses integer instructions instead of floating point ones.
6) "int4 MAD parallel" - DX10 Integer Vector Instructions Similar to test 2, however integer instructions are used instead of floating point ones.
We get the point, I think. Computationally, the R600 can be formidable. One worry is that these shaders look to be executing pure math, with no texture lookups. We should probably talk about texturing rather than dwell on these results.