Big Blue Is Coming!

Published Date
01 - Nov - 2006
| Last Updated
01 - Nov - 2006
 
Big Blue Is Coming!
With all the hype that's surrounding Sony's upcoming PlayStation 3 (PS3), you've no doubt heard that its heart and soul will be the Sony-Toshiba-IBM Cell processor (the Cell Broadband Engine, to be precise)-a nine-core monstrosity that will ensure buttery-smooth frame rates even in the most crowded scenarios, and generally ensure a wholesome gaming experience for all. So what is it that makes the Cell such a formidable force? Naturally, it's the nine processor cores on the chip. But if it were as simple as that, this article would be a waste of space, wouldn't it? Before we go into its innards, however, we need to take a look at why the Cell exists in the first place.

Please Sir, I Want Some More
In the Old Days, it was office applications that drove processor speeds-after all, what other use did we have for PCs anyway? However, as processors got faster, it wasn't just for the thrill of faster word processing that we bought them-a vicious cycle had begun. The more powerful processors got, the more we wanted them to do for us. Games and multimedia became more demanding, and soon it was these applications that would have manufacturers sweating to take out the next ultra-fast processor.

To get these applications performing at their best, it all boiled down to just one thing-CPUs needed to execute more instructions per second. The easiest way of doing this, of course, was to kick up the CPU's clock frequency, which triggered the "MegaHertz (and later, the GigaHertz) Wars." This was all very well for a while, but there was also the matter of executing more instructions in each cycle of that clock signal. Naturally, the way to do this is have instructions run simultaneously-the concept is called Instruction Level Parallelism (ILP), and goes on in your CPU right now. The processor collects instructions, sees which it can run in parallel, and then does so.

A look at the way the Cell is built
Now, your average computer program is a single-threaded application-put simply, churning out a single stream of instructions, oblivious to the fact that there is a processor under the hood that's trying to figure out which of those instructions it can run simultaneously to increase performance. The result? Every new generation of processors delivered only a 10 to 20 per cent boost in performance: quite distressing.

But things are changing. The GigaHertz wars are over, and everyone has realised that the only way to squeeze more performance is to have multiple program threads running at the same time-Thread Level Parallelism (TLP). The way to do this is to increase the number of processors doing the work-first with the powerful dual-processor workstations, and now today's dual-core, tomorrow's quad-core, and in five years, eighty-core processors (at least, that's what Intel tells us).

While Intel and AMD have chosen to have identical cores on their chips, IBM, as usual, decided to do things a little differently.

The Package
The Cell consists of a single PowerPC Processing Element (PPE) and eight identical Synergistic Processing Elements (SPE). The PPE is a simpler version of IBM's previous PowerPC processors (more on this in a bit), and basically functions as the general-purpose core in this collection. The eight SPEs are specialised SIMD (Single Instruction, Multiple Data; see box SIMD) processors, designed specifically to make quick work of complex mathematical problems. With the SPEs, IBM has removed the cache, substituting it instead with 256 KB of local storage-the name has changed, but the purpose remains the same; but read on to find out why it's not really cache.

The Cell has been optimised to work with parallel workloads-and few things can be broken into as many parallel processes as games. Rendering frames requires processing millions of pixels, and this task can be broken into as many parallel tasks as one wants. Ditto physics and AI calculations. The Cell's obvious leaning towards better physics and AI means game scenarios can get more crowded (the more objects you place in a scene, the more complex physics calculations become) without any loss in performance. In fact, the Cell is even capable of taking up the job of the GPU, though industry experts suggest that it won't be replacing the dedicated GPU just yet.

The Cell's main purpose in the PS3 is to ensure that there are no bottlenecks whatsoever, so obscene amounts of memory bandwidth are the order of the day. You no doubt remember Rambus-the people who came up with RDRAM-a technology that was undoubtedly superior to DDR-SDRAM, but never really took off. Cell sports Rambus' new XDR memory controller, which gives each processor core a memory bandwidth of 25.6 GBps-more than twice that of any PC processor. This also comes very close to the 32 GBps that GPUs get from their memory controllers, so the Cell won't bottleneck the GPU the way PC processors do.

But what's so special about these processors that make them so fast? The secret to Cell's performance is a...

...Return To Innocence
The Cell's cores are all in-order cores-a design that hasn't been around since the days of the old Pentium processors. Processors since that time have been out-of-order processors (no, not in the "doesn't work" sense), and in the world of general-purpose processing (office applications and so on), it's these out-of-order processors that deliver a better performance. So why did IBM take Cell back into the Dark Ages? Let's understand the difference between in-order and out-of-order execution first. Consider these instructions:

1.    A = B C
2.    D = A E
3.    X = Y Z

You'll notice that instruction 2 has to be executed after 1, because it depends on the resulting value of A. Instruction 3, however, is independent of the other two. An in-order processor, as the name suggests, will execute these instructions in the order that it receives them. If all the data it needs is readily available in the processor's cache, then the instructions proceed swiftly, and all is well.

Suppose, now, that the value of B isn't in the cache, but C is. It's only a matter of four or five CPU clock cycles for it to fetch C, but to get to B, it has to first search the cache and encounter a "cache miss," following which it has to access the system's main memory to get its data, resulting in a delay of a couple of hundred CPU cycles.

All this time, instruction 3 has to wait its turn, even though the idle CPU cycles which were wasted hunting down B could have been used to process it and get the job over with. This is where out-of-order processors came in. Out-of-order processors use an "Instruction Window" (something like a buffer where incoming instructions are stored), within which it looks for instructions that can be run independently, and processes them in parallel, so a cache miss isn't such a disaster.

A microscopic view of the cell- the black"bands"you see on either side are the SPEs; the PPE occupies the top left

Put simply, the difference is the same as between shopping for items in the order that they're written on your list, or by picking up whatever items on the list that you can see, irrespective of order. Out-of-order processors have been immensely successful in multi-tasking environments (which is the most common scenario for PC use) because they receive instructions from multiple programs, all independent of each other.

Coming back to our original question: why does the Cell forego such obvious advantages to go with the in-order approach?

Firstly, apart from the dire consequences of a cache miss, in-order cores perform quite well. Secondly, and more importantly, in-order processors are simple-without the added circuitry that enables out-of-order execution, the transistor count of the in-order core is low, which is how they're able to fit nine cores on that little chip. To get around this cache-miss hassle, IBM has resorted to a neat trick.

We mentioned before that each SPE has its own local storage rather than a cache- this is because unlike cache memory, which has its caching logic hard-wired into it, this local store is accessible to the programmer. What this means is that the onus is now on the programmer (rather, the compiler) to ensure that any data that the SPE needs is available in this local store exactly when it needs it, minimising delays, or at the very least, making them predictable. Note that this applies only to the SPEs-the PPE still has a traditional L1 cache, and gets its performance from the speeds that the Cell will run at-between 3.2 and 4 GHz.

Raw speed isn't the only thing behind the Cell, though. It does pack another really

SIMD
The reason the Cell's SPEs are able to demolish mathematical calculations the way they do is a concept called SIMD-Single Instruction, Multiple Data. The name is suggestive enough, but here's an easy way to understand how this works. If your average, everyday processor was to add a set of ten numbers to another set of ten this is how the instructions flow:

Take the first number from here, take the first number from there, add them.

Take the second number from here, take the second number from there, add them.
... and so on.

Such processors are called scalar processors, and are generally suited for everyday applications. However, to perform complex calculations, scientists found this approach painfully slow, which led to the development of the vector processor. Instead of operating on only one number at a time, vector processors operate on sets of numbers at a time-one single instruction, but multiple data. So for the Cell's SPE, the above task becomes as simple as:

Take all these numbers, take all those numbers, add them.

This technique lends itself perfectly for processing pixel data for rendering a scene, and more importantly, the physics and AI processing that will be the Cell's responsibility.

Maximum Security Cell
With Sony-known for its fervent efforts to protect its intellectual property-being one of the driving forces behind Cell, it's no surprise that the anti-piracy angle has been well thought up. The Cell will feature application security at the hardware level. Cracking software involves reading what it puts into the system memory, and thus far, it's been the operating system that prevented hackers from doing this. Compromise the OS kernel, however, and all falls to naught. The Cell, however, doesn't let even the OS kernel see what's inside the SPEs' local memory, so the question of hackers peeking into application memory doesn't come up. Voila! Automatic anti-piracy!

Big Picture
Cell isn't just a processor-'tis but a mere piece in the game for World Domination (TM). Cell's software is compiled into little "apulets," which will distribute themselves to all available Cell processors-be it on the same board, the same LAN, even over the Internet. The result is a massive grid computer, where every idle Cell you're connected to becomes a possible candidate to offload computing on. Imagine gaming on your PS3, while your Cell-enabled HDTV and PDA crunch numbers for a research lab in one corner of the Earth trying to find a cure for cancer!

So will we ever see the Cell in our desktops? Quite possibly, but there are a few things about this whole thing that might throw a spanner or two into the machinery. Firstly, there's the PPE. While it's way ahead of the competition for gaming, it will still lose out to current processors when it comes to general-purpose applications. However, throwing in more PPEs on the chip could well turn that around. Secondly, the Cell absolves itself of a lot of the responsibilities of regular processors, instead, offloading them on to the developers who will be writing the compilers for them. The insane task of writing a compiler that can intelligently exploit the Cell and all its features might just put programmers off, and that would be the end of the Cell's PC prospects.

The only people willing to code to that level are game developers, so game consoles, at least, will see a lot more of the Cell. For now, Sony and Toshiba plan to release Cell-based HDTVs and PDAs in the near future, and the Cell will probably conquer the living room before it does the desktop.




Team DigitTeam Digit

All of us are better than one of us.