New AMD low power X86 core, enter the Jaguar

Stars/Phenom l1d do not have better associativity - quite the contrary it's only 2-way (which does seem low indeed) but of course they are much bigger (64kB). And they were exclusive and write-back, of course.
K8/K10 also had dedicated ECC protection or the L1D, where BD omitted this feature, so now on every single bit error the corresponding cache line must be reloaded from the L2.
 
K8/K10 also had dedicated ECC protection or the L1D, where BD omitted this feature, so now on every single bit error the corresponding cache line must be reloaded from the L2.
Unless you're living very close to the sun I don't think you'd notice it's slower due to that happening once every other year...
 
Unless you're living very close to the sun I don't think you'd notice it's slower due to that happening once every other year...

I don't think it's so low. The susceptibility goes up as voltage and feature size goes down. Check out the soft-error rate models used in this publication: http://www.cse.psu.edu/~mdl/paper/lin-islped04.pdf

For "normal" voltage they had 1 every 10 million cycles, and that was way back on 130nm. So I'd expect the number to be in the seconds, not years.

Nonetheless, write-through caches have much lower error rates.
 
I have to disagree with this one. I have never seen heavy inner loops with more than 75% of MAD/FMA instructions. Even the simplest pure SOA-style dot product loop has 3xFMA+1xMUL per four dot products (75% FMA). Our inner view culling loop (mainly 5 dot products per viewport) has some vector float compares and splats in addition to those (around 50% FMA). 4x4 matrix multiply is another FMA heavy operation, but that is 16 splat, 4 mul, 12 FMA (37.5% FMA). If you consider the fact that splats go to MMX pipeline on BD, the percentage of FMA entering float vector pipeline is again 75% (common for pure dot product based operations). It's very hard to create a function with more than 75% FMA.

Ok, you have small vectors on 3d grahics. I was thinking more on the direction of signal processing and scientific workloads.

But even in those 4-wide case where there are 3 FMA + 1 MUL, 2 FMA units can give throughput of 1 iteration/2 cycles(*), but one adder + one multiplier can only give throuhput of one iteration/4 cycles.
So FMA still give twice the troughput.

(*) (assuming we can parallelize the code so that the serialization of the fma's do not become bottleneck, for example by running "multiple totally independent work items" in parallel in same simd lane)
 
Last edited by a moderator:
I am intrigued by the fact that AMD chose to implement twice as large and twice as associative L1D caches to the (low end) Bobcat than they did for Bulldozer. Did they have to compromise the cache design in Bulldozer to reach high clocks? Current max BD turbo is at 4.5 GHz (and without heat limits BD can overclock up to 8 GHz with LN2). They seem to have lots of extra clock headroom that they cannot use (even on desktops) because of TDP/heat constraints.

Intel seems to be using it's extra clock headroom to improve IPC (clocks haven't improved lately but IPC has). Same seems to be true for Haswell. This kind of development seems to better fit the idea of using down clocked high end parts in 10W/17W ultra portables. BD/PD at 1.6 GHz (AMD A8-4555M) isn't in any way optimal use of hardware (it has so many needless extra transistors dedicated for reaching higher clocks).
But even in those 4-wide case where there are 3 FMA + 1 MUL, 2 FMA units can give throughput of 1 iteration/2 cycles(*), but one adder + one multiplier can only give throuhput of one iteration/4 cycles. So FMA still give twice the troughput.
Yes. But BD has only two FMA units per module (2 cores), while Bobcat/Jaguar have two adders and two multipliers. So both reach the same throughput.
 
I am intrigued by the fact that AMD chose to implement twice as large and twice as associative L1D caches to the (low end) Bobcat than they did for Bulldozer. Did they have to compromise the cache design in Bulldozer to reach high clocks? Current max BD turbo is at 4.5 GHz (and without heat limits BD can overclock up to 8 GHz with LN2). They seem to have lots of extra clock headroom that they cannot use (even on desktops) because of TDP/heat constraints.

That has to be the case, BD's L1D cache is really small by all comparable standards. Even most ARM chips have been using 32KB L1 for a while now, although Cortex-A15 is going to 2-way associativity, regrettably. Still, you can see that Bobcat's decisions are not out of place, rather BD that looks odd.

I'm guessing AMD chose the quite wide associativity in Bobcat's L1D at least partially so they could use VIPT w/o aliasing problems. The L1I could be PIPT or maybe they don't mind aliasing flushes as much there (at least they didn't on BD)
 
That has to be the case, BD's L1D cache is really small by all comparable standards. Even most ARM chips have been using 32KB L1 for a while now, although Cortex-A15 is going to 2-way associativity, regrettably. Still, you can see that Bobcat's decisions are not out of place, rather BD that looks odd.
Yeah the only other "modern-ish" x86 cpu featuring such small l1d cache is - p4. Actually up to Northwood only 8kB (4-way) but Prescott bumped it to 16kB (8-way).
I dunno though sacrificing cache size/associativity so you could reach higher clock speeds which you actually can't reach in practice anyway sounds like a colossal mistake to me. The p4 at least could actually hit higher clock speeds even in practice (not that it helped it mind you but small l1d cache was probably the smallest of its problems).
Though atom isn't that far ahead of BD there with its 24kB/6-way l1d cache :).

I'm guessing AMD chose the quite wide associativity in Bobcat's L1D at least partially so they could use VIPT w/o aliasing problems. The L1I could be PIPT or maybe they don't mind aliasing flushes as much there (at least they didn't on BD)
That Bobcat paper mentions L1 ITLB and cache are accessed in parallel which would imply virtual indexing. It also mentions though the itlb isn't actually accessed if it's the same page as previous fetch hence it shouldn't really matter for performance.
 
Forgive me, but what do VIPT and PIPT stand for? Once I get what they stand for, I can just look it up, so y'all don't need to explain the whole thing.

Thanks in advance.
 
Basically my Jaguar vs Trinity/PD investigation began, because I wanted to get deeper insight how a 17W (2 module, 4 core) low clocked Trinity would compare to the new Jaguar core based APU (both have four cores, similar clock rate, similar TDP and can sustain 2 uops/cycle/core). After the trade show event (in January), there has been zero news about the 17W Trinity, and it has been eight months since. I am just wondering if AMD are going to replace it with a Jaguar based APU. A 1.815 GHz (=1.65*1.1) Jaguar based APU should be very close in performance compared to the 17W Trinity running at the rumored 1.5-1.6 GHz clocks. That's why IPC comparisons make sense. I am just trying to figure out how they would compare in a TDP constrained setting.
FWIW looks like the 2-module ULV Trinity part (A8-4555M) has been released. Unlike the 1-module version (A6-4455M, released ages ago) it didn't quite make it to 17W though instead it's now a 19W part. Clocks 1.6Ghz/2.4Ghz - so turbo clock should be higher than Jaguar but I don't know how often it's actually able to clock up that much. In any case the clocks are quite a bit lower compared to the 25W part (A10-4655M - 2Ghz/2.8Ghz) - though a 1.8Ghz quad-core Kabini might also need 25W. I couldn't find information about the a8-4555m gpu other than it's called 7600G, could be either a 4 simd part or a 6 simd part with low clocks). In any case a 4 CU GCN part should look quite favorable to that though trinity ulv should still be faster there because of dual channel memory.
But probably a better comparison of quad-core Kabini would be against ULV 2-module Kaveri.
 
Damn! That's pretty swanky...

I'm looking to buy a long-lifed Win8 Pro dockable tablet for my wife this year; I'd love to get one of those Jaguar cores over the Atom options that would otherwise fit the bill.
 
The L2 interface takes up about as much space as an entire core.
There's a whitepaper out there talking about the power optimization tech used for Jaguar, and the addition of that interface added a large amount of active flops to the design relative to other components.

I wonder what other scalability measures it has besides allowing for a shared 4-bank L2.
 
I think I found the whitepaper you are referring to over on Calypto. I'll give it a read later tonight...
 
The L2 interface takes up about as much space as an entire core.
It's more like a fifth "dedicated" core among the rest, with a lot of active logic and power control functions beside facilitating the interface arbitration.

Well, the bus interface unit in a BD module is not small either. ;)
 
As far as I know, and subject to change, we'll probably see Kabini (desktop Jag) around May-ish. Lisa Su had a tentative roadmap in her CES2013 talk. Take with adequate grain of salt though, AMD's roadmaps are fluid.
 
Damn! That's pretty swanky...

I'm looking to buy a long-lifed Win8 Pro dockable tablet for my wife this year; I'd love to get one of those Jaguar cores over the Atom options that would otherwise fit the bill.

There are rumors that the next MS Surface Pro will be Kabini-based, in addition to a higher-end Haswell model if I recall correctly. For whatever that's worth.
 
Back
Top