A year ago, I wrote a story hypothesizing that AMD’s long-term success might hinge on future Bobcat-based processors rather than on Bulldozer. I wrote it before we learned that Krishna and Wichita, the two 28nm follow-ups to Brazos built at GlobalFoundries, had been canceled. AMD eventually admitted this and put two new Brazos-based designs on the roadmap: Kabini and Temash. Kabini targets netbook/notebook form factors, with Temash planned as the follow-up to AMD’s first tablet SoC, codenamed Hondo.
To date, AMD’s tablet APUs have found very little market, though the company claims it’ll show off multiple design wins at CES. After spending some time with both Surface and Samsung’s Ativ Smart PC, I think AMD has a real opportunity to win back market share in 2013 — provided that Kabini can ship on time. The 28nm laptop chip is expected by Q2 of next year; Temash, the tablet part, will probably launch in the back half of 2013.
A bit of history
AMD’s Bobcat CPU was designed to compete with Intel’s Atom at the upper end of that CPU’s performance and power curve. Every microprocessor design can be thought of as a balance between power consumption, performance, and manufacturing difficulty. Of the three new chips AMD delivered in 2011, Bobcat is the only one that hit all three. Bulldozer missed its power consumption and performance targets; Llano hit both of these, but was difficult to manufacture.
Brazos (that’s the APU) showed up with its game face on, right as netbook sales began to slump. It’s still an important part of AMD’s sales, but the spotlight has mostly been on AMD’s big-core x86 hardware. With Atom, Intel has focused on improving power consumption and moving to SoCs rather than driving raw performance (the first out-of-order Atom, Valleyview, arrives in 2014). That means 28nm Kabini/Temash has a chance to reignite a performance battle in this segment of the market.
AMD disclosed a significant amount of information on Jaguar at Hot Chips last August. The new core refines and polishes much of what made Brazos successful, without significantly changing much of the underlying hardware. From a high-level perspective, the two are nearly identical.
Bobcat's block diagram
Jaguar block diagram
Bobcat is above, Jaguar below.
Almost — but not quite. And that’s actually encouraging. CPU architecture analyst Agner Fog describes Bobcat [PDF, page 168] as having “a well balanced pipeline design with no obvious bottlenecks.” When it comes to CPU design, most changes are evolutionary and iterative.
One front-end improvement to Jaguar is the addition of four 32-byte loop buffers. Loop buffers are used to hold a small number of already-decoded instructions. This is useful when the CPU is executing tight loops; it ensures that the decoders aren’t tasked with decoding the same instructions repeatedly. This saves power and speeds overall execution.
Jaguar adds a pipeline stage to increase frequency but keeps the two-issue decoder from Bobcat. On the integer side, the core picks up Llano’s hardware divider unit. Previously, integer division was handled via the floating point unit, which caused a significant delay. Jaguar also includes support for SSE4.2, AVX, and features a larger read order buffer (ROB).
The biggest changes between Jaguar and Bobcat are on the FPU side of the equation. The FPU units are now 128 bits wide, compared to 64 bits on Bobcat. The chip supports 256-bit AVX by breaking the operations into a pair of 128-bit uops, just like Bulldozer and Piledriver do. FPU performance won’t match Trinity — Jaguar can only decode two instructions per clock cycle, compared to four for the larger core — but it should substantially improve over Bobcat.
Next up are the L1/L2 caches. The L1 improvements listed here are all designed to reduce latency penalties and improve FPU bandwidth. Like Bobcat, Jaguar’s L1 is split into a 32K instruction cache and 32K data cache and is two-way set associative.
The L2 cache is a bit different.
Each Bobcat core had a 512K L2 directly attached, clocked at half CPU speed. With Jaguar, AMD has opted to attach a single shared cache to the CPUs. This cache pool is connected via an L2 interface unit, running at full processor speed. The L2 cache itself still runs at 50% core clock.
Going this route has several advantages for AMD. First, it makes more total L2 available to any single core in a single-threaded program. The total number of supported cores is bumped to four (Bobcat was strictly a dual-core design) and it simplifies the chip’s layout. Data lookups and L2 cache misses should both be improved with the new design.
AMD is projecting a 15% IPC gain as well as a 10% frequency gain for the new part. That puts the core in a very interesting position.
AMD is talking about Jaguar/Kabini solely as a quad-core part, but we expect the company will release a dual-core SKU. It’s a sensible way to boost yield and improve availability.
Next page: It all comes down to positioning and timing
AMD has said virtually nothing about Kabini’s GPU, but it’s easy to predict. The only real question is whether it’ll be based on Northern Islands/Cayman (like Trinity) or on Southern Islands/Graphics Core Next. I’m betting GCN. Kabini is being built at TSMC, which is where AMD already built the Radeon 7000 series. AMD did port Northern Islands to 32nm, but that work was done at GlobalFoundries. It makes far more sense to base Kabini’s GPU off the 28nm IP already in place than to respin Cayman yet again.
We’d bet on a GPU with at least 64 GCN cores, eight texture units (TMUs), and four render output units (ROPs). This configuration would mirror what Sunnyvale deployed with Brazos. GCN’s cores are markedly more efficient than the VLIW5 hardware inside Brazos; overall performance should increase, even if the total number of shader cores drops to 64. More than 64 is a definite possibility; again the company could fuse off parts of the GPU array to hit power consumption targets.
AMD is claiming that Kabini will target the 9-25W TDP range. The slide implies that all Kabini parts will be quad cores with at least an 11.6-inch screen, but this slide is from June. Sunnyvale will probably be flexible on these points and launch Kaveri SKUs that match the most popular consumer form factors as closely as possible. It’s pointless to compare against Atom’s TDP — Intel and AMD don’t count the same way — but I think Sunnyvale will do its damnedest to leverage GCN against the SGX545 GPU found in Atom.
I’ll be surprised if AMD ships a quad-core Kabini at 9W. 15-19W makes more sense for an introductory quad-core part, especially if the company plans to drive a price/performance wedge between Clover Trail and Ivy Bridge. Higher speed dual-cores will debut around the same power envelope — a 1.4GHz quad-core at 19W might share space with a 1.8GHz dual-core. Any 25W part would obviously be a quad-core, possibly with an enhanced graphics core.
Remember, Kaveri — the red bar — isn’t going to hit until the tail end of 2013, with early 2014 being a far more likely launch date. That means there will be overlap between Trinity’s 17W chips and Kabini in the 15-25W segment. Right now, AMD’s two 17W APUs are dual-cores at 1.9 and 2.1GHz. Their integrated graphics parts have 192 and 256 cores respectively. Could a quad-core Jaguar with a GCN-based GPU outperform Trinity-based hardware in the ultra-thin-and-light segment? I suspect the answer is yes, especially if Kabini integrates more than 64 graphics cores.
Positioning & timing are key
The PC market is currently in an enormous state of flux. Reports on Windows 8 sales are spotty, and no one knows for certain which combination of features and product capabilities are going to catch with customers. Qualcomm is prepping ARM devices, Microsoft has Surface RT, and Intel has bracketed the x86 market with Clover Trail and upcoming Core i5 options.
Kabini and Temash could drive a Goldilocks position between Atom and Ivy Bridge by offering superior performance than the former while weighing less than the Core i5. Will it happen? We don’t know. We don’t know what yields look like or how platform TDP compares to Clover Trail in real-world scenarios. But it’s clearly the company’s best chance to create momentum for itself in 2013.
Intel will never support AMD, but Chipzilla likes its high margins. If AMD can ship a product that forms its own bulwark against ARM’s encroachment at the bottom of the market, Santa Clara will likely let it stand. The marginal gain from slashing Clover Trail prices or increasing clock speed to drive up performance (and power consumption) are tiny, and Valleyview is already on the roadmap.
The flip side to this is that Kabini and Temash have an optimum window that’s 12 months wide at most. Tegra 4 and Cortex-A15 designs will increase pressure from the ARM segment, while Haswell (and eventually Valleyview) narrow the x86 opportunity. Maximum effectiveness means getting these parts into hardware as soon as possible.