Riders On The Stream

We take more than a peek at the innards of the latest highhorsepower, pixel-grinding top guns as NVIDIA’s G80 squares off against ATI’s R600

It’s been an impressive sight to anyone even loosely following developments in the graphics industry. NVIDIA’s G71 and ATI’s R580 nearly doubled performance in all the games that were new back then, and visuals were improved dramatically. There was, however, a limit to what either vendor could do with DirectX 9.0’s Shader Model 3.

Then Windows Vista was announced, along with the new, backwards-compatible DirectX 10. Speculation ran wild as everyone tried their hand at calculated guesses as to what the new standard would amount to in gaming terms. More importantly, who would emerge top dog from among NVIDIA and ATI-two companies that have been slugging it out since DX 8.1 days. With both companies now offering DX 10 capable GPUs, all speculation has been laid to rest. There is still uncertainty, because we still don’t know how the latest cards perform under DX 10. Here, we hope to clarify things, however, by peeking under the bonnets of ATI’s R600 and NVIDIA’s G80.

A Demystification Of The Basics
Shaders: Colouring at the speed of light!
As we all know, while rendering any 3D scene, texture (pixel) and geometry (vertex) information is sent to the graphics card for processing. Earlier, cards used to have a set of fixed algorithms hard-wired into them; this was used to process the data. It was known as the FFP (Fixed Function Pipeline). Game programmers could only select from the limited functions hard-wired onto those cards. Restrictions on the programmer and limitations in the flexibility of the code itself meant games couldn’t look as realistic as developers wanted.

Shaders were developed in 2001, and their sole purpose was to give programmers more liberty with the code. There were primarily two types of shaders-Pixel and Vertex. On a software level, a shader is in essence a small algorithm that describes mathematically how a particular material or texture is rendered onto an object. This shader algorithm also specifies how light affects the appearance of the object. So a shader actually defines an object’s physical properties in a 3D scene. This (shader) code is loaded into the graphics card’s memory, and from there fed directly into the graphics pipeline, where it is executed.

On a hardware level, a shader unit is essentially a little engine of its own that can process a particular type of 3D information. Shader units are designed to execute shader code based on a particular API, or compatible with a certain set of APIs-such as OpenGL and Microsoft’s DX. A graphics card will contain multiple shader units, enabling it to process, in parallel, large volumes of data. DX 10 makes programmers’ lives even easier, as there is no distinction between pixel and vertex shaders on the hardware level. Some of the software restrictions have also been removed. This gives coders the liberty to implement new and improved techniques to give improved visual quality… the Holy Grail for developers and gamers.

Graphics Pipeline: But Where Are The Pipes?
This isn’t like a real physical pipe per se. The word pipeline refers to individual stages required to perform a much larger task. The key to remember is that a graphics card (much like a CPU) is rapidly fed with a lot of data to be processed. In order to achieve this repetitive task, some streamlining is needed. Each pipeline represents a single task to be performed in order to give you that gorgeous display on-screen. Multiple pipelines get fed with data and instructions; they work on their specialised tasks, complete them, and pass the results forward. In a nutshell, the graphics pipeline is tasked with accepting some form of a 3D scene and processing it to deliver a 2D scene to your monitor. Keep in mind that this is not a simple process, since every object in a rendered 3D scene is basically a completely discrete 3D model-referred to as an entity-and involves the following:

  • The Transformers: Transforming objects in part or whole from their original space to the desired space in the 3D scene. This part of the process involves scaling. Size scaling of 3D objects is not easy, as every entity has its own coordinate system (remember Coordinate Geometry in high school maths?), and the entire object has to be accurately rendered from each and every angle while maintaining this scale. This stage of the pipeline is also tasked with rotation of the objects in the scene, also called transformation. This rotation could be along either the X- or Y-axis, or both.
  • The Wide Angle: The next step involves the viewer and his perspective view and orienting everything in the scene to that view. This is called the camera angle. This also involves scaling, rotation, and even translation (movement along a defined axis). All this is done to appease the field of view of the user; nearer objects will be larger than objects far away.

    Matters of the Core 
    Parameters  8800 Ultra  8800GTS     HD2900XT X1950XTX 7900GTX 
    No. Of Transistors (million)  681  681  700  384  278 
    Core Speed (MHz)  612  500  740  650  650 
    Number Of Shader Units  128  96  320  48 8  24 8 
    Shader Unit Speeds (MHz)  1500  1200  *   *  *
    Memory Bandwidth      103.2  64  106  51.2  64 
    No. of TMUs  32  24  16  16  24 
    No. of ROP/clock  24  20  16  16  16 
    Number Of Z Compare Units  192/48  160/40  16/32  16  32 
  • Exterminatus Triangulus: The third stage involves efficiency management. It’s time to remove all the hidden triangles that are uselessly composing invisible objects. There’s no use rendering the back surface of an object, or the detail behind a wall facing the perspective view of the viewer, is there? This stage reduces the work the graphics card has to do, and is called Occlusion Culling. In modern games where there are transparent objects that themselves may have other objects visible through them, this part of the pipeline is very important. For example, fish swimming in a pond cannot be treated as invisible.
  • Light My Fire: Next, the scene has to be lit (illuminated). This is done keeping in mind the locations of light sources in the scene and its effects on the Objects. The properties of each object such as change in colour, reflectance, and refraction, among other surface properties of each object, in response to the light from a particular source, has to be taken into account. Take into consideration six different light sources of varying intensities and you’ll see how difficult this stage can become! Once again, the triangle-cutting scalpel is applied to speed things up by reducing GPU load. All that needs to be lit in a scene is what appears to the viewer from his point of view.
  • Lose that Z axis: This involves the final conversion from 3D to 2D, that is, your monitor. All the data that is held in textures and vertices must now be converted to pixels displayable by your monitor. Then there are the final touches to be added including Anisotropic Filtering, Anti-Aliasing, Shadows, Fog, Alpha Blending, Shading, Filtering, stencils, and depth, among others. This entire stage is called Rasterization.

With the basics clear, we can move on.

Making Shade(r) Light(er)
TMU: A texture mapping unit is responsible for adding detail to a bitmap that is part of a 3D image.

Primitive: In 3D animation, this means the most elemental level of detail, the very basic of building blocks. A point, a line, a polygon, a bitmap, all represent Primitives.

MADD: Stands for “Multiply Add”. This is a floating-point operation consisting of a multiplication and an addition instruction fused together, mainly because it is more efficient to execute for any processing unit than a separate multiply and add function. It is called a multiply accumulate, and in terms of algebra, is denoted by (a * b) c, where “c” is added to the product of a and b.

Transcendental Operations: The term literally means higher-order operations, or very simply operations requiring complex computation to arrive at a result. For example, sin(), cosine(), logarithm(), exponent() etc., as opposed to simple arithmetic operations like power(), multiply(), divide(), and add().

SIMD: The acronym for Single Instruction Multiple Data. This is a technique where a single instruction is performing the same function simultaneously on multiple streams of data. It’s used in image and video streaming, where data streams are usually identical (in this case, pixels).

MRT: The acronym for Multiple Render Targets. This is a technique in programming implemented in shader hardware to conserve resources. The shader unit renders a single texture, which is then used to simultaneously paint multiple pixels that share this texture in a 3D scene.


Unlikely Twins

Let’s take a look at the salient core specifications of the G80 and R600, with the R580 , (X1950XTX) and G71 (7900GTX) thrown in for reference.

Some (rare!) common ground and a very significant addition over all previous-generation hardware is the Geometry Shader Unit, which is present on both cards. Geometry shaders are an extension of vertex shaders; they can work on many vertices (in 3D terms, called a mesh), unlike vertex shaders, which work on a single vertex. The unified shaders on the R600 and G80 and all their derivatives can now perform pixel, vertex, or geometry operations. This translates to efficient use of shader units-the load is balanced between pixel, vertex, and geometry, and none of the shader units are left idle. Traditionally, in the case of heavy, pixel-processing frames, the vertex shader units would idle, though in an always-on state (consuming power).

The R600 a.k.a. HD2900XT’s 80nm fabrication process propels it to higher clock speeds within acceptable thermal limits as opposed to the 8800GTX/Ultra a.k.a. G80’s 90nm process.

With more than double the shader units and a higher clock, not to mention the 512-bit memory interface, the HD2900XT might seem to bury the 8800GTX, but this isn’t an apple-to-apple comparison; the two companies’ approach designing and implementing DX10 parts is quite different. The 320 vs. 128 shader-unit argument is equally pointless. And as you will see, there is much more at play then simple on-paper specs.

Is 320 Equal To 128?
The more shaders the merrier, is a general thumb rule. This  because  a single stream processor can do only so much work in a single clock. Pixel operations in particular are complex and multi-threaded, often needing more than one SP working on the same pixel data (called a dependency).

ATI’s 320 Stream Processors:
Is more necessarily better?
ATI has taken pains to explain that most of the Stream Processors (hereon referred to as SPs) on the R600 aren’t capable of special function operations, that is, they are simpler than the SPs on the G80. Also, if you look at the diagram alongside, you’ll see that five of these SPs are bunched together in a cluster, and only one SP in the block (the large one) can handle either a regular floating point (FP) such as a MADD operation or a special function operation such as SIN (sin), COS (cosine), LOG (logarithm), or EXP (exponent). This SP is called a special function SP by ATI. The other four SPs can handle only simpler integer operations such as ADD (add) and MUL (multiply). There is a sixth unit in the SP cluster, which is a branch execution unit responsible for tackling flow control operations like looping, comparisons, and calling subroutines.

Another limiting factor is that each cluster of five SPs can handle only a single thread, be it vertex, pixel, or primitive, in a single cycle. This means essentially that each bunch of five SPs are handling only a single FP operation per cycle, while a single SP in the bunch can do a transcendental operation (read: special operation).

 While ATI has gone through pains to ensure that each cluster of 5 SPs is always busy, it is extremely difficult to ensure that each SP in a single cluster is always occupied. Therefore the R600 essentially consists of 64 SPs with 5 ALUs (Arithmetic Logic Units) each, though ATI prefers to refer to the total number of SPs as 64 x 5 = 320.

NVIDIA’s 128 Stream Processors:
Parallelism unparallel?
At the very outset, NVIDIA designed the SPs on the G80 to be independent. No clusters here-just 128 SPs (96 on the 8800GTS) that can perform a single operation per cycle, unless one operation is a simple one. This could be either an FP operation (like a MADD), or a special function (SIN, COS, etc). A single SP can complete a MADD and one integer operation like a MUL in a single clock cycle. So we can have 128 threads running in parallel in a given clock cycle. However this doesn’t mean 128 discrete operations because vertex and pixel data contains multiple threads. For example, eight SPs may be working on a single thread in a single clock! Keep in mind that the G80 has eight SIMD blocks each containing a cluster of 16 SPs (see diagram on page xxx).


ATI’s R600: Note the clusters of 5 SPs bunched together

Before getting fed to the SPs, each vertex operation is divided into 16 threads, while pixel and geometry operations are divided into blocks of 32 (16 threads are executed over two clocks).

Which is better, 320 or 128?
Each SP on the G80 can perform two simultaneously-issued scalar operations per cycle, say a MADD MUL. This is in contrast to the R600, where only one SP in a cluster of five can handle such an operation. So the G80 can handle twice the number of threads (128 vs. 64) in parallel.

However, due to the fact that each thread is worked upon by 5x the SPs on the R600, more work is potentially getting done on each of the 64 threads than on the 128 threads on the G80. In the best case scenario, the R600 can execute five parallel operations on each thread. This is, of course, assuming no dependencies among any of the five threads. The G80 can only handle a single operation (albeit on double the number of threads, that is, 128).

Notice we said “best case” because such a situation is rare in real code, and this peak performance (5x) is not sustainable, and possible only rarely. The performance difference in such a case is restricted to the difference in the number of SPs: 320 / 128 = 2.5x. Do remember that these scenarios are largely code-dependent. In a worst case-where there are no dependent threads-we’d see the G80 with a 2x advantage, that is, 128 threads vs. 64. All the above are hypothetical situations, and we’ll see developers optimising code for one or the other GPU.



NVIDIA G80’s internals: Note the clusters of Stream Processors and Texture Mapping Units

There’s another variable to consider, in that the G80’s shaders are clocked at much higher speeds (see table Matters Of The Core). In fact, at double the speed of the R600 (1500 MHz vs. 740 MHz), the overall throughput should remain the same. This is more or less confirmed when both companies reveal their GPUs’ performance figures-475 GigaFLOPS on the R600, and 520 GigaFLOPS on the G80.

In our opinion, in practice, therefore, the G80 should in most cases be significantly faster than the R600, unless code has been heavily optimised for the R600, wherein its heavy-duty computing engine comes into use.

Texturing: How Much Is Enough?
There are two methods in vogue for displaying spectacular visual effects in games. The older method employs complex textures specifically, consisting of complex code written for each texture, while newer games-in general; there are exceptions-use shaders. All GPUs now feature programmable hardware shaders, for which code can easily be developed.

Look at the table Matters Of The Core and you’ll see the Texture Mapping Units (TMUs) on the R600 remain unchanged from the earlier Radeons at 16 (texture filter units). The TMU on R600 is brand-new, with many improvements over predecessors, namely improved Anisotropic Filtering, filtering of FP 32-bit textures, in addition to support for larger texture samples in compliance with DX 10. 

The TMUs on the R600 are additionally able to work on both unfiltered and filtered textures. While filtered textures mostly represent image-based data like pixel textures, unfiltered textures represent vertex data and other blocks of data (which are not related to images). In true compliance with DX 10, the R600 can display FP formats from 11:11:10, (R:G:B), up to 128-bit.

NVIDIA’s G80 is at a direct advantage with double the number of TMUs (32). Although NVIDIA has double the number of texture filtering units (64), the number of texture address units is only 32. The G80 only works on filtered texture samples, somewhat limiting its capabilities (keeping in mind true unified architecture), but texture filtering should be faster on the G80 because of sheer numbers.

Pixel Mascara: Shading Up To Kill
Pixel operations bring up the last few steps in the rendering pipeline before the output is diverted to the screen. ATI’s nomenclature for this part of the core is Render Back Ends, while NVIDIA uses the handle ROP Units (Raster Operation Units). The ROP unit is basically a complex processor responsible for compression and decompression of textures, Alpha Blending, frame buffer, and Anti-Aliasing, to name a few functions.

The R600 has the same number of ROPs as its predecessor, that is, four; one for each SIMD unit. The maximum throughput these four ROPs maintain is 16 pixels per clock (four pixels per ROP). Although ATI has mentioned improvements in efficiency over the previous-generation R580 , the huge number of shader units (320 or 64, whichever way you look at it) will be seriously bottlenecked in the ROP portion of the R600’s hardware pipeline.

The G80 has six ROP units, each capable of four operations per clock. That’s 24 Raster Operations per clock on the whole. So while theoretically the G80 has more processing power, remember here that the ROPs are clocked at core speeds, so the R600 should be able to hold its own here (740 MHz vs. 648 MHz), with maybe a slight edge going to NVIDIA’s hardware. We don’t expect the difference to effectively more than 10 per cent either way.

Memory Bandwidth: Multi-Lane
Traffic Control!
ATI’s Ring Bus was talked about on the R500 series. This was basically dual, 256-bit unidirectional buses used to carry blocks of vertex, pixel, and data around the various subsystems of the core and eventually to the memory. The R600 adds two more parallel buses for a total of data path width of 1024 (256 x 4) bits. This Ring Bus is fully duplex. ATI has referred to this design as fully distributed, meaning there is no central hub or controller. Defying conventional topologies, the controller itself is distributed over the entire ring. 

The kind of bandwidth the R600 provides may seem a gross overkill for today’s games-but when one takes into consideration DX 10 and the additional memory load, this kind of headroom makes sense. ATI has not gone GDDR4 this time, although it is supported by the memory controller.

Which of the two architectures is faster is still debatable
The G80 uses a controller similar to the earlier GeForce 7 series. There are now six paths of 64 bits each (instead of four sections), so in essence, the 8800GTX and Ultra versions sport a 384-bit memory controller. Incidentally the lower-end 8800GTS features a 320-bit bus, presumably with one less path than the GTX and Ultra versions, that is. five instead of six. There are 12 chips of GDDR3 memory aboard the 8800GTX/Ultra, 64 MB each, and each gets a discrete data path spanning 32 bits. The 8800GTS has 10 such memory chips. GDDR4 is supported on the G80’s controller, but we’re yet to see it implemented.

Different Stream, Same Horizon?
These two are undoubtedly the fastest graphics hardware on the planet, and easily outperform all previous-generation cards by at least 200 per cent in all DX9 titles. DX10 is the proverbial dark horse here, and just how well both these new architectures do is yet to be seen. Of course, this was something both these cards were bred for! Sure, DX 9 proves overkill for them, but we know how developers love to bring hardware to its knees. We fully expect this trend to continue, but fervently hope this is achieved through better visual content and not just horribly optimised code as we’ve seen in the past. In fact, there are a few DX10 titles on the market that exemplify this point. There have been flashes of brilliance, games like Far Cry, F.E.A.R., Oblivion, and Company Of Heroes being shining examples.

Which of the two architectures is faster is still debatable. We’re trying hard to get our hands on both cards to end the suspense once and for all with a full-fledged test. ATI has already launched a 1 GB version of its HD2900XT, with which they hope to combat the 8800 GTX/Ultra. NVIDIA is expected to launch something to the tune of an 8900GTS/GTX, but details are sketchy, and the company itself tight-lipped as of now. We can also expect to see an 8950GX2 (remember the two-GPU, two-PCB 7950GX2?).

One thing’s for sure-which architecture does better will also largely depend on which side of the fence game developers choose to sit on, optimising their code for one or the other GPU architecture. NVIDIA with their “The Way it’s meant to be played,” and ATI with their “Get into the Game,” are beckoning developers with promises of title promotion and support. It’s just a matter of developers going red or green… As for us, our verdict is reserved until both contestants grace our PCI-Express slots!  

Michael Browne
Digit.in
Logo
Digit.in
Logo