Cool Code

Published Date
01 - May - 2007
| Last Updated
01 - May - 2007
Cool Code
Our lives are now practically infested with mobile devices. Whether it's your old cell phone, a shiny new PDA, or the upcoming Ultra-mobile PCs (UMPC), the quest now is to pack in as much PC-like functionality into these gadgets as possible. The challenges come in when you realise that the processors for these devices don't have the luxury of living in a capacious cabinet with monster-sized heat-sinks to keep them cool-with the walls closing in on them, they must deal with computation in a completely different way.

Nearly 75 per cent of all embedded processors are based on the ARM Architecture (Advanced RISC Machine, or Acorn RISC Machine), because it meets such requirements beautifully-it's half the size of an old 486 CPU, is almost as powerful (with the potential for more), doesn't heat up even under load, and consumes but a smidgen of battery power. How does it do it, and why don't we have such features in our desktop processors?

History Lesson, I: Less Is More
Today's mobile processors owe their architecture to a division in processor design philosophy that started somewhere in the 1970s. Flashback (in black-and-white, if you like) to a dark age where no compilers existed, and processors were hand-coded in their own assembly language instructions. These instruction sets were quite simplistic, and something as simple as comparing two numbers could take ten instructions to execute. Imagine trying to write an operating system with that! But that was just half the problem.

When you gave an instruction to a processor, you do so as a hexadecimal op-code, which was stored in a register on the CPU itself before being executed. This register (a pipeline of many registers in today's processors) is an expensive resource to waste, and the number of instructions required for the simplest operations made that painfully obvious-more so when processors were getting powerful enough to do so much more. The goal, then, was to create instructions that were the same length (in bits, that is), but led to the execution of many micro-operations. For example, the 8086's JPE (Jump If Equal) instruction loads two numbers from the system's memory, subtracts them, uses that result to determine whether they're equal or not, and then jump to a new instruction if they are indeed equal. Contrast this with its predecessor, the 8085, with but the humble JMP instruction, which made the processor jump to the specified memory location.

This was the birth of Complex Instruction Set Computing (CISC)-simple instructions that performed complex tasks. It didn't waste CPU resources as much, and gave programmers relief while they waited for the rise of higher-level languages and their compilers.

History Lesson, II-Enough Is Too Much
With the arrival of languages like C, programming trends started shifting away from the ungainly assembly languages to the more elegant (by those standards, anyway) higher-level languages-which used compilers that translated code to assembly language. As it turned out, these compilers weren't exploiting the potential that CISC had to offer-mostly because compiler designers were at a loss to figure out which instruction to use when. The 8086 had 32 different "Jump If..." instructions to choose from, and more were added with every new processor. The time and effort that went into designing CISC processors started to seem... futile.

The processor design community split into two schools of thought-one concentrated on developing better compilers for CISC processors (notably the Intel x86, which is all-pervasive today), while the other believed that CISC processors were grossly overdesigned, and a simpler, more compiler-programmer-friendly instruction set was in order. The idea came to be known as Reduced Instruction Set Computing (RISC), and came with its own set of advantages and disadvantages.

Every RISC instruction takes one single clock cycle to execute, so the CPU never has to wait for complex instructions to finish executing; the time saved is even more prominent when instructions are sent alternately to multiple processors. Most operations would take the same time on both RISC and CISC processors regardless, but the hardware required to decode RISC instructions is less complex, which makes it easier and cheaper to implement. The most significant disadvantage of the RISC architecture, however, is that it relies quite heavily on software-a badly written compiler can have a devastating effect on performance, as could badly-written programs.

In a nutshell, the simplicity of implementing an RISC machine makes it perfect for mobile processors, so that's precisely what Acorn decided to use when they came up with the ARM.

At The Drawing Board
The biggest thing to consider when designing a mobile processor is power consumption. Every single decision that follows-architecture, clock speed and so on-is driven by the need to use as little power as possible, and naturally so. How would you like it if you had to keep recharging your new PDA phone every three hours? Keeping power consumption low also ensures that less of it is dissipated as heat-which is good, because you don't want to be saddled with a heat-sink in your pocket. Or worse.

It's also important that these processors cost very little, which in turn means that they should be easy to design.

So in 1983, Acorn Computers designed the ARM (Acorn RISC Machine), a 32-bit RISC processor which had a third of the transistors as the Motorola 68000-which was six years older than the ARM-and could still actually do stuff.

Taking More RISCs
Consider this bit of code that calculates the GCD (Greatest Common Divisor) for the numbers i and j :

while (i != j)
      if (i > j)
          i -= j;
          j -= i;
   return i;

The most significant disadvantage of the RISC architecture is that it relies quite heavily on software

The premise is that if the lower of i and j is subtracted from the greater perpetually till they're finally equal, the final number is the GCD.

When this code is compiled to x86 assembly, you'll have instructions that:
1. Check the equality of i and j; the CPU will execute the following code if they're equal, or will start executing code from another (specified) memory location if not. This is called branching-when there two possible outcomes of the same situation.

2. Once inside the loop, check which of i and j is greater and subtract the lower from it-resulting in another possible branch.

The x86 processor has an instruction pipeline, which loads instructions from the system's memory and keeps them ready for the CPU to execute. In the above case, let's say that while the CPU is performing the first comparison, the code for the rest of the loop is already in the pipeline. If i and j are equal, however, all that code is unnecessary, so the pipeline has to be cleared, and instructions need to be loaded from a new memory location all over again. This wastes the pipeline, CPU time; as we've mentioned before, these are expensive resources, so wastes are unacceptable. The x86 architecture compensates for this with a branch predictor, which does exactly what the name implies. Today's predictors are accurate around 99 per cent of the time, and the overall gains offset the losses caused by the one percent.

loopCMPRi,Rj            ;set condition "NE (not equal)" if (i != j)
                ;            "GT (greater than)" if (i > j),
                ;        or  "LT (less than)" if (i < j)          
    SUBGT  Ri, Ri, Rj        ; if "GT", i = i - j; 
    SUBLT  Rj, Rj, Ri        ; if "LT", j = j - i;
    BNE loop        ; if "NE", then loop

ARM, however, would have none of this. Branch predictors add complexity, so Acorn decided to use the last four bits of their instruction code as conditional code, which forgoes the branching issue altogether-for smaller loops, at least. The ARM assembly code for the same operation goes thus:

ARM doesn't have to fight a Megahertz war-it can get more done for per clock cycle, so it can run on a lower clock frequency without too much loss

The result of the CMP (Compare) instruction is stored as a condition flag in the Current Processor State Register (CPSR) on the CPU for later reference.

Of the 32 bits that form the ARM's instruction, 28 are used for the actual instruction, and the remaining four carry the condition flag-this is ARM's own way of implementing RISC.

So when the SUBGT (Subtract if Greater Than) instruction is sent to the CPU, it isn't going to waste CPU cycles loading the condition flag and then checking it-the flag is right there, and depending on its contents, the instruction executes.

The final instruction tells the ARM to branch back to the label "loop" if i and j aren't equal. Notice that at no point here does the ARM have to jump to different memory locations-saving precious nanoseconds.

This clean, efficient way of going about things means that the ARM doesn't have to fight a Megahertz war-it can get more done for per clock cycle, so it can run on a lower clock frequency without too much loss. Lower clock frequencies translate to less heat generated within the chip, so the only thing heating up your ARM is probably the sunlight that you're basking in.
  • Jargon Buster

  • Registers
    Every processor has a number of registers built right on to the chip. These serve as temporary storage for data or instructions that come into the processor, or for storing important information like the processor's current state, or the memory address that it needs to access next. This is the fastest possible storage in your PC, so wasting it is a processor design no-no.

    Assembly Language
    When you give instructions to a microprocessor, you do so in Assembly Language. The instructions involve very granular operations in the processor-moving a byte of data from the main memory to the processor's internal registers, for instance. You write assembly code in the form of mnemonics-MOV, ADD, DIV and so on, which are then converted to hexadecimal  operation codes or op-codes, which are finally given to the processor.

    Microprograms (written in microcode) take op-codes and convert them into real processor actions. The ADD instruction, for instance, will bring numbers from the system's memory to the processor's registers before adding them-it's the microcode that implements this part. Microcode is hard-wired into the processor when it's designed, and optimising it for performance is right up there with the designers' biggest headaches of all time.

    Coffee With Jazelle

    We've no doubt talked about how Java works before, but here's a refresher anyway: when you compile a program written in Java, it doesn't get converted to the assembly code for the processor you're working on. Instead, it's translated to Java bytecode, which runs on the Java Virtual Machine (JVM). The JVM then translates the bytecode to assembly language for the processor. The advantage here is that unlike other languages, where you write different code for different platforms (very tedious), you write code just for the JVM. Now you have code that will run on any platform that has a JVM built for it!

    The problem with Java, as should be fairly obvious, is that the time taken for the JVM to translate Java bytecode to assembly causes it to take a huge performance hit; this cannot do.

    Enter Jazelle, ARM's way of combating this performance lag. The Jazelle Direct Bytecode eXecution (DBX) makes a JVM for ARM redundant-the ARM7, ARM10, and ARM11 families of processors can now run Java code directly! If you've ever experienced the tedium of using a Java application on your mobile phone, this is a ray of hope. To see the difference that Jazelle makes, watch the video at products/esd/jazelle_home.html.

    Bells And Whistles
    RISC works wonderfully with data that comes in long, continuous streams-video and music, for example-making it a natural choice for any portable multimedia player (PMP). The iPod, among many others, sits on an ARM7-based processor, which is built specifically for multimedia. It's tiny, fast, and lasts hours on a single battery charge. Among the other licensees of the ARM, Freescale Semiconductor builds the i.MX, based on the ARM9 architecture. These processors feature on-board Multimedia Card (MMC) and Memory Stick controllers, a Multimedia Accelerator chip and a Bluetooth accelerator. The architecture supports Windows CE, Linux and Symbian, and you'll be seeing devices based on this in the next few months.

    Pocket Powerplants
    With the world going mobile and the new emphasis on performance per watt, we've started seeing new developments in processor design, which now permit even the good old x86 processor to be plonked into a mobile device. Samsung's Q1 Ultra runs a low-voltage Intel Core 2 Duo, and runs Vista Home Premium with no problems at all! The three-and-a-half hour battery life is disappointing, but the fact that a desktop-class processor can run in this environment is an indication of things to come.

    But does that mean that the ARM will become obsolete? For a long time, no. There's hardly a competitor in the battery life department, and even when it is muscled out of the PDA / UMPC market, portable media players and smartphones will ensure that the ARMs still have punch.  

    Team DigitTeam Digit

    All of us are better than one of us.