Insight: On Parallel Streams

By Team Digit Published Date
01 - Aug - 2008
| Last Updated
01 - Aug - 2008
Insight: On Parallel Streams

Can ATI’s new architecture do what the HD2xxx and HD3xxx families of GPUs couldn’t?


Michael Browne

Not long ago, both ATI and NVIDIA claimed to be a little more environment friendly (read green) than they had been previously, in terms of power consumption of their latest 3D accelerators. The top priority, as we know (and conflicting with anything green), is a GPU that is more powerful so that games can run faster and look better. Performance and efficiency are two binaries whose existence together is extremely rare. Looking at NVIDIAs GTX2xx architecture last month, (July 2008), we see something more powerful than their earlier G80/G92, and therefore more power hungry. We rejoice about the former, after all a boost of 30 to 75 per cent is no small matter. Complain as we may about the latter issue, there is very little power saving one can expect from a geometric increase in transistor count (from 754 million to some 1.4 billion). Power issues aside and visual nirvana firmly in sight, ATI was quick to offer the 48xx series a little after the GTX2xx launch. No brainstorm this—after all ATi has had their backs solidly planted against a wall since November 2006 and the G80 (GeForce 8800GTX), which was one architecture that they (ATI) had absolutely no retort to. Their floundering position in the GPU market with failures like the HD2900XT and pseudo failures like the HD3870, while 8800GTX’s sold like hot cakes is ample testament to this. ATI did have a fan following in the mid-range segment, but NVIDIA completely dominated the flagship segment, and the performance war gave way to a reign of peace with the mighty G80 trampling ATI’s offerings beneath its 128 shader units a.k.a. stream processors.

With multithreading in mind, ATI’s new Radeon HD4870 and HD4850 feature a staggering 800 SPs each, unheard of in any previous GPU. There’re also plans to launch an X2 version, which is simply a larger PCB which houses two HD4870 GPUs therefore taking the SP count to a colossal 1,600 SPs. In terms of fabrication processes, ATI has a slight advantage with their new series having migrated to a 55-nm manufacturing process much earlier. Incidentally, NVIDIA’s newer GPUs like the GeForce 9800GTX and the GeForce GTX280/260 are manufactured on a 65-nm process. A 65-nm core will run hotter than a 55-nm core, not to mention costing more, as more silicon is used per GPU. Note that NVIDIA also plans a die shrink to 55-nm, with their (yet unseen) GeForce 9800GTX .

Codenamed the RV770, the Radeon HD48xx series has also seen a tremendous growth in transistors up to 956 million, from the previous high of 666 million. Of course the GTX2xx is still the most populous GPU transistor wise with close to 1.4 billion transistors, but this is bigger than anything we’ve seen from ATI. Opposed to a larger die and older manufacturing process, the RV770 is cheaper to manufacture than the GTX2xx due to the move to a smaller fabrication process.

Do note that we aren’t doing any in depth comparisons between the two companies’ offerings, and this is strictly a look at ATI’s new wonder kid. Correlations, if any, are made simply for the sake of better understanding. Last year (August 2007 to be exact) we took a look at both NVIDIA’s and ATI’s new architectures—the G80 and the R600 in our insight feature Riders On The Stream. It was interesting to note that both companies’ approach to SPs was radically different. Instead of the rigid architecture that NVIDIA followed by fixing the number of SPs in relation to the amount of texture processors and memory channel controllers, ATI used a more dynamic approach. While NVIDIA had 128 scalar SPs on their G80 ATI used 64 SPs. We say 64 SPs and not 320 SPs since each of these SP consists of five ALUs, (Arithmetic Logic Units), which is how they arrived at the magic figure of 320 SPs, that is, 64 x 5. By the proper definition of an SP ATI’s 320 units on the RV 670 (Radeon HD38xx) cannot be termed as SPs any more than the HD48xx can claim to have 800 SPs as they aren’t really independent, scalar processors.

In essence, ATI’s 64 SPs are capable of working on more complex (read multithreaded) operations than NVIDIAs simpler SPs. In ATI’s case, it’s also true that each cluster of 5 ALUs, (1 SP), can only work on a single thread at a time. In case of a complex, multithreaded operation, this setup would be brilliant as one SP from ATI is more powerful than one NVIDIA SP. But in case of simple operations which can be handled by a single ALU, the other 4 ALUs in one SP would be unoccupied or possibly under occupied. In case of a simple operation, NVIDIA nulls the advantage ATI has with a more powerful SP, and it come down to sheer parallelism, where 128 SPs would outperform 64 SPs—a scenario which has occurred often enough in the past, and was reason for the HD3870s failure against the 8800GTX.

ATI’s underlying architecture hasn’t changed—so 800 SPs on the new HD48xx is really equal to 160 SPs (800/5). NVIDIA too has retained the same scalar architecture. So what has ATI done differently? According to their press releases ATI claims that their execution units on the HD48xx are 40 per cent more efficient than those on their earlier HD3870. There is no way to validate this, but the sheer number of SPs, that is, 160 is definitely enough to make the 48xx series far faster than anything else that ATI has marketed before, the previous best being their HD3870 with 64 SPs.

Both AMD and NVIDIA are guilty of using unnecessarily confusing nomenclature and designations on the various parts that their GPUs consist of. Both ATI GPUs, the RV670 and RV770 have an SIMD core (see figure 5). This core consists of 16 SPs grouped together. While the older RV670 had four such SIMD units, the new RV770 drops in an additional six SIMD units, taking the total to ten SIMD units. Besides this increase in the number of SIMD units, (and therefore SPs), the RV770 is identical to the RV670.


NVIDIA groups eight SPs and two SFUs (Special Function Units) into a cluster called a Streaming Multiprocessor (SM). These clusters of SMs are grouped into a TFC or Texture / Processor Cluster (see figure 4). In terms of nomenclature, NVIDIAs referrals to SPs is more accurate since each of their SPs is an independent, pipelined, microprocessor capable of working on a single thread.

While both have different methods of grouping their respective SPs, it is interesting to note that each SP cluster on the Radeon HD48xx has one complex ALU and four simple ALUs, and this complex ALU is basically what NVIDIA calls an SFU. ATI obviously has more SFUs but whether this significant or not is anyone’s guess. Divergently ATI clubs its Texture Units and Texture Cache along with its SP clusters, while NVIDIA pipelines their SPs to their Texture Units and Texture Cache.

For anyone looking at the RV770, it’s obvious that ATI has taken a huge jump in the total SP count, and gone with a more brute force approach, something like what NVIDIA did in 2006. In fact, the jump from 64 SPs to 160 SPs is more impressive from a manufacturing standpoint than the jump from 128 to 240 SPs. When you take into account that each of the 160 SPs on the RV770 has five ALUs that handle complex computations, ATI’s HD48xx series seem even more impressive.

Another first for ATI and the industry in general is the move to GDDR5 memory on the HD4870, although the cheaper HD4850 cards will still utilise GDDR3 memory. ATI’s memory subsystem on the HD4870 offers nearly the same bandwidth as the GeForce GTX 260, but with a much simpler (and cheaper to manufacture) 256-bit bus. NVIDIA uses 448-bit and 512-bit buses, which complicate the PCB design and push costs up. GDDR5 memory on the HD4870 runs at 900 MHz, but due to having two parallel data paths the effective bandwidth is quadrupled, as opposed to GDDR3 or GDDR4, which can only double the data throughput. Therefore, for a clock of 900 MHz, the effective data frequency becomes 3.6 GBps. We’re told that GDDR5 can do as much as 1.2 GHz, which equates to a data throughput of 4.8 GBps. GDDR5 has the advantage of offering ATI more bang for buck as the costs involved with producing 3600 MHz, 256-bit GDDR5 is much less than the cost NVIDIA has to bear for producing 2200 MHz, 512-bit memory.

With DX9 and DX10 and future standards, (DX10.1 and DX11), we’re seeing a move from texturing to shader-intensive operations across all game titles. Most games released over the past couple of years already utilise shaders for certain details as a shader is much more flexible and has much less demands from hardware than a texture has, given similar levels of detailing in a given scene. While a texture requires having a reference image that is used for creating elements in a scene, a shader relies on program code. The best possible showcase for shaders, are games like Crysis, S.T.A.L.K.E.R, Oblivion, Splinter Cell Double Agent and UT3. ATI has geared their RV770 to be a shader-heavy beast, and while the number of texture units goes up from 16 to 40, the ratio of SPs to texture units is 4:1 (160:40). NVIDIA has also woken up to this fact, and the GTX280 has hardly upped its texture unit count from the G80 / G92 days (from 64 to 80 units). For NVIDIA this ratio is 3:1 (240:80), as opposed to 2:1 (128:64) with the G80.

Whether the Radeon RV770 is the better performer in real world games or not, isn’t clear. What we cannot dispute however, is that ATI has come back with all guns blazing. The HD4870 and HD4850 do have one very significant advantage over the competition—they are less monolithic in nature, and therefore ATI has more legroom to manoeuvre prices as well as scale performance up or down. We know that NVIDIA can’t really up performance of their GTX 280 without a die shrink. The fact is that ATI has finally built a tremendous card just as NVIDIA did with the G80 and the new GTX2xx, and what’s more important is they’ve learnt well from their mistakes with the R600 and RV670. We’re told that the successors for both RV770 and GTX2xx are also ready, waiting in the wings to come after two fine competitors have slugged it out.

With upcoming titles like Far Cry 2, Fallout 3, Crysis Warhead, Dragon Age, Dawn Of War 2 in the wings, it seems we’ll need all the crunching horsepower we can use. The best part of the RV770’s existence is the fact that at last after a year and a half, we’re seeing competition in the high and mid-range graphics card segments. We’re seeing tremendous price drops, which is good for users as well as the industry because this speeds up adoption of such graphics solutions.


Team DigitTeam Digit

All of us are better than one of us.