There are war clouds on the horizon. To our regular readers the ongoing battle of words and rattling of silicon sabers between NVIDIA and Intel would not come as news. Intel is set to unleash its Larrabee processors in the near future which will go against the graphic accelerators of today; as manufactured by NVIDIA and AMD/ATI. Meanwhile, the other side has been adding General Purpose to their GPUs for a while now; the so-called GP-GPUs which even today are all set to take over tasks traditionally executed by the CPU. On a more general level, the entire industry is moving towards massive parallelization—instead of discrete processors performing specific tasks, we are looking at processors that can wear multiple hats and churn everything from physics, to sound, to graphics.
But let’s limit the battleground to the CPU vs. GPU scene, and as a subset to the diametrically opposite directions NVIDIA and ATI are taking with their new generation hardware.
Meet The New Boss
The G80/G92 processors from NVIDIA have been unchallenged both in pure performance and in the price-to-performance ratio the past years. There has simply not been a better card to get than one based on either the 8800-series or the 9800-series. The G80 enjoyed a nice stint on the throne.
NVIDIA would like to repeat that success with their just-released GTX 200 hardware. In many ways the new graphics chip from the green team is a consolidation of their past successes and their future aspirations. The GTX 200 is essentially a more efficient G80/G92 core, and is simultaneously a bigger version of its successful sibling. At 1.4 billion transistors, the new core is a monster, and using a 65nm fabrication process, NVIDIA claims that the GTX 200 is the largest and the most complicated piece of silicon that they have yet produced.
NVIDIA’s ambitions with the GTX 200 chips is just as large: not only do they plan to retain their crown in the graphics industry, but they also hope to make substantial inroads into the traditional CPU market with this new chip. Somewhat amusingly, NVIDIA terms these twin goals as Gaming Beyond and Beyond Gaming. Marketing monikers aside, let’s take a close look at the changes wrought through this new chip by dissecting its internals.
(Almost) Same As The Old Boss
The heart of the GTX 200 beats with the same chewy goodness that made the G80 core so popular. But it’s stronger and faster. Before we delve deeper into its anatomy, it’s interesting to note that the way the core has been designed, it can fulfill dual roles. The GTX 200, by way of its massively parallel transistors can either work in a graphics mode, wherein it acts as in the traditional programmable GPU sense; additionally, it can double up duty in a “compute” mode in which some parts pertaining to graphics processing are turned off in favour and the chip acts as a non-traditional general processor—capable of churning through massively parallel tasks and floating-point calculations such as those needed to simulate high-level physics interactions. This is where the Beyond Gaming / Gaming Beyond roles arise from.
Internally, the core of this processor is all about threads. The core itself is mostly comprised of ten clusters called Thread Processing Clusters or TPC. Each of this cluster is in turn made up of clusters of transistors called the Streaming Multiprocessors (SM): there are three SMs in each TPC, for a total of 30 SMs in each GTX 200 core. Each SM is further comprised of many Streaming Processors or SPs. There are eight SPs in each SM, making for a total of 240 streaming processors for the entire chip. In comparison, the GeForce 8 and 9 series of cards had a total of 128 streaming processors—this is just one of the examples demonstrating how the GTX 200 is a souped up version of the older G92/80 core.
This structure of 240 streaming processors is where this chip from NVIDIA gets most of its parallel processing strength. As we mentioned earlier, the chip uses two different processing models. For execution across the TPCs, the architecture is MIMD (multiple instruction, multiple data). For execution across each SM, the architecture is SIMT (single instruction, multiple thread). SIMD machines such as a traditional CPU operate at a reduced capacity if the input is smaller than the MIMD or SIMD width. SIMT ensures the processing cores are fully utilised at all times.
For each SM, the chip can handle up to 32 groups of 32 threads, for a total of 1,024 threads per SM. Since the chip has 30 SMs, it can in total handle up to 30,720 threads in hardware. Comparing this with the 12,288 maximum concurrent threads that a GeForce 8800 GTX, should give you an indication of the thread processing capability of the current design versus the older generation.
The GTX 200 breaks everything down into threads and then leverages its inherent capabilities to crunch them in parallel. The chip’s design lends itself to complex integer and floating-point math, memory operations, and logic operations. Each core being a hardware-multithreaded processor with multiple pipeline stages that execute an instruction for each thread every clock. Threads can be of various types which help the chip with both graphic and traditional CPU tasks including pixel, vertex, geometry, and compute threads. Programmatically, thus, every instruction is executed as a series of threads. From the programmer’s perspective, SIMT allows each thread to take on its own path. Since branching is handled by the hardware, there is no need to manually manage it within the vector width.
According to NVIDIA, “GPU processing is centered on computation and throughput, where CPUs focus heavily on reducing latency and keeping their pipelines busy”. All GeForce GTX 200 GPUs include a substantial portion of die area dedicated to processing, unlike CPUs where a majority of die area is dedicated to onboard cache memory. Rough estimates show 20 per cent of the transistors of a CPU are dedicated to computation, compared to 80 per cent of GPU transistors.
With a thread-driven architecture, keeping the transistors fed with data becomes very important. While in CPU latency is reduced by using onboard cache memory (which comprises a significant percentage of a CPU die-estate), the GTX 200 GPU dedicates its die area to computing and thread scheduling units. Hardware thread scheduling ensures all processing cores attain nearly 100 per cent utilisation. The GPU keeps itself fed and latency down by shuffling threads across its TPCs—if a particular thread is waiting for a memory access, the GPU can perform zero-cost hardware-based context switching and immediately switch to another thread to process.
The TPCs are not the only transistors helping with the graphics processing. We can consider the TPC structures to be a common tool that the chip employs for both compute and graphic tasks. Graphics processing though also brings in the power of some dedicated processing units such as the Raster Operations Processors, or ROPs. The ROPs can be considered as a painter’s workshop—it is where all the vertices which are transformed and computed in the TPCs come to be filled with colours and rendered to give the pixels which will then be used to create the final image. This final scene is assembled based on color and position data generated for each pixel. Anti-aliasing and blending are done into the framebuffer (where the final image is drawn) inside these ROPs.
The chip has eight ROPs in total versus the prior generation’s six ROP partitions. The G80 core could output 24 pixels per clock and blend 12 pixels per clock; the GTX 200 chip is capable of an output and blend of 32 pixels per clock.
While not performing graphics tasks, the ROP structures are not leveraged for compute calculations. In this mode, a thread scheduler manages the threads. Texture caches are also used to combine memory accesses for greater efficiency to the read/write operations. Furthermore a small cache of memory is located within each of the SMs. This local memory is used to share data between the eight SP which constitute each SM without the need to call data from an external memory subsystem. This goes further to increase computational efficiency.
Other Structural Changes
An important change to the internals is double-precision, 64-bit floating point computation support. This addition to the architecture will benefit mathematically intensive applications such as those in high-end scientific, engineering, and financial fields. Each SM incorporates a double-precision 64-bit floating math unit, for a total of 30 double-precision 64-bit processing cores. Each of these double-precision unit performs a fused Multiply-Add operation that fully IEEE 754R floating-point specification compliant. The overall double-precision performance of all 10 TPCs of a GeForce GTX 200 GPU is roughly equivalent to an eight-core Xeon CPU, yielding up to 90 gigaflops.
This new chip also adds to the framebuffer—this is the video memory of a graphic card. While SLI implementation of previous generation chips have memory up to 1GB, the gigabyte is shared across the chips which comprise the multi-core setup in an SLI mode—practically thus, memory available to each core is around 512MB. The GTX 200 chip however, makes full use of a 1GB framebuffer. This is of huge benefit to current and future games, especially since the variety of textures used by modern games is only increasing. For example normal maps are used to enhance surface realism, cubemaps for reflections, and high-resolution perspective shadow maps for soft shadows. This means much more memory is needed to render a single scene than classic rendering which relied mainly on the base texture. Apart from this modern engines make use of advanced techniques such as deferred rendering which is also benefited by the addition to the framebuffer’s size. Deferred rendering in particular consumes an immense amount of video memory and memory bandwidth, especially when used in conjunction with antialiasing. In addition to the increase in memory, the path to the memory has also increase inside the GTX 200 chips. In these new chips the memory interface width is expanded to 512 bits, up from 384 bits in previous-generation GPUs. This has been achieved using eight 64-bit-wide frame buffer interface units. This increase in the interface width has meant a greater memory bandwidth.
Brains To The Muscle
With its 65nm fabrication and a 1.4 billion transistor count, the GTX 200 is expected to run hot. Indeed, the TDP of the chip can get as high as 236 Watts! However, NVIDIA has taken a holistic approach to power management and the actual thermal profile of the chip will vary greatly depending what it will be used for.
Broadly, the chip operates under four different thermal envelopes:
lIdle/2D power mode (approx 25 W)
lBlu-ray DVD playback mode (approx 35 W)
lFull 3D performance mode (up to 236 W)
lHybridPower mode (effectively 0 W)
The HybridPower mode is enabled when a GeForce card is used in conjunction with special nForce motherboards: in this case, the onboard graphics chipset is used for 2D tasks and the dedicated graphics card is effectively switched off.
While HybridPower might be considered a feature outside of the GTX200—the chip itself has several tricks up it sleeve to reduce power draw. For one, when running 3D applications, the driver can seamlessly switch between power modes based on GPU utilisation. These new chips come with activity monitors which act as watchdogs over the traffic inside the GPU. Based on the activity reported, the GPU driver can dynamically throttle the performance mode by making changes to the clock speed and the voltage level—all transparent to the user. This behind-the-scene speed throttling goes a long way to supplying just the needed juice to the GPU and keeping the temperature and the power draw down. Apart from this throttling, the GPU also has clock-gating circuitry, which effectively shuts down blocks of the GPU which are not being used at a particular time, further reducing power during periods of non-peak GPU utilisation.
NVIDIA wants to take its GPU outside the graphics processing threshold. It has already taken steps to that end—the largest being the introduction of its C-like CUDA interface for programmers. Where NVIDIA would like to see its GPU (and ATI too, as a matter of fact) is more general computing. To this end some programs have already been showcased—traditional tasks which are either accelerated greatly or completely executed on the GPU. Adobe, for example, showcased a version of its Photoshop application which was GPU-accelerated and was capable of quickly loading and editing a 442MP file, at 2GB size. The edits done were rudimentary—zooming in and out, rotation and so on. But as a taste of things to come it was an impressive display, given how slow Photoshop can get while manipulating large images. Another piece of software used the GPU to encode video files faster than real-time playback, yet another was a ray-tracing application which made use of the GPU. Other tasks the GPU is being called for include protein folding via a Folding@Home client and physics simulations for liquids, soft bodies, and fabrics.
There are to be two variants based on the GTX 200 chip—the GeForce GTX 280 will be the higher-end part (priced at around $650), and the GeForce GTX 260 will lose some of its components and be priced slightly lower (at about $399). Early benchmarks show the GTX 280 to be on par with the 9800 X2 cards on some tests, and ahead of the X2 in others. Going on the data that’s out there, NVIDIA might just lose its price to performance crown to its rival’s graphic chipset. ATI/AMD has taken a completely different approach to designing its graphics part. No technical information is available to make an objective analysis between the two approaches, but from a holistic point-of-view: while NVIDIA has taken the bigger, more powerful approach; ATI has gone down the smaller, scalable route. ATI’s new chip—the RV770 seems designed to run in pairs and efficiently. Early tests put the RV770-based, single chip solution the Radeon HD 4850 within range of the 9800GTX, and at $200 or less. The 4850 in crossfire configuration gets within touching distance of the GeForce GTX 280, and at $100 less. AMD/ATI will also release a $300 card the Radeon HD 4870 which will run two RV770 chips on the same graphics card and will come equipped with GDDR5 memory—about 80 per cent faster than GDDR3 used in the GeForce cards.
One thing’s for sure—things are certainly heating up in the graphics market. At one end you have the ATI and NVIDIA rivalry, which has already lead to price drops from the NVIDIA camp; on the other hand you have the upcoming challenger to the market—Intel. Who knows what that will bring to the scene? Well, cheaper prices for sure. This is exciting times for us consumers indeed!