New AMD low power X86 core, enter the Jaguar

Jaguar: Double the FPU width (and datapaths), larger ROP and higher frequency (with same power consumption). I'd like to know how far down they can scale this power-wise.

Wrt. Steamroller: It seems they've looked at all the issues we've discussed here. Reduced I$ misses, wider dispatch to cores, better branch prediction and "major improvements to store handling". It'll be interesting to see what that adds up to.

Cheers
 
Jaguar: Double the FPU width (and datapaths), larger ROP and higher frequency (with same power consumption). I'd like to know how far down they can scale this power-wise.
Well it seems that they also came with a new cache hierarchy. The L2 is inclusive and shared.

They haven't disclose any TDP for those CPUs but I would assume that they are to aim as low as with their predecessors ( and the scrapped ones) so 4.5 Watts for the slowest SKU.
They definitely can't go into the phone realm as Intel is doing with its Atom. They may do well in the tablet market and above, still I feel they might still be too power hungry for the low end tablets.
The problem is they have to get those parts out fast, the roadmap states 2013 and they haven't provide more details. That's bothering as Intel is to launch their OoO Atom with most likely way better power characteristic in q4 2013. AMD execution problem is going to push them off the cliff for good at this rate. They are missing Windows 8 launch that's bad. They would have moved a wagon of those cores.

By the way what with the black about those CPU core in the SMS. I can't find proper article in English be it, Anandtech, Tom's hardware, techreport.

Wrt. Steamroller: It seems they've looked at all the issues we've discussed here. Reduced I$ misses, wider dispatch to cores, better branch prediction and "major improvements to store handling". It'll be interesting to see what that adds up to.

Cheers
Sadly when streamroller is release so will be Haswel... They may just go past their phenonII part in single thread performances with those cores. It's bad. There are still of plenty of issue unfixed (as anand and hardware.fr point in their articles, L3 cache, FP/SIMD scheduler).
With further tweaking, fixing and their new high density library, Excavator may be when Bulldozer comes together. The whole issue is that's in 2014.
If Haswel is a strike as Conroe, Nehalem were, they are in a shitty situation no matter what.

I wonder if AMD sort of acknowledge the intrinsic issue in BD and they are fixing plenty of things so they look committed to this architecture (they don't have choice anyway as I guess that even reusing the inner of BD to make a "standard" it would take 2 years or more to come with something new... 2 years without new products when their previous architecture has alreay given everything it could ... is not an option).
If they survive I would not be that surprised if at they point they split their module in 2 plain cores and forget with the lot of headache the module introduce. They may at that point also come with a new cache hierarchy.
 
Jaguar seems to be a really big improvement over Bobcat. It has slightly reduced power consumption... and still AMD has managed to double the core count, double the total L2 cache size, improve clocks by 10%, improve general x86 IPC by 15% and vector IPC by 100%.

If my math is correct, we should see around 2.53x (2.0*1.10*1.15) performance in generic (four thread) x86 software and around 5.06x (previous * 2.0) performance in vector processing (integer/float) compared to Bobcat. Not bad at all :)
 
If my math is correct, we should see around 2.53x (2.0*1.10*1.15) performance in generic (four thread) x86 software and around 5.06x (previous * 2.0) performance in vector processing (integer/float) compared to Bobcat. Not bad at all :)

The wider SIMD paths will help in some general purpose workloads too, SSE is widely used for copying, string search etc.

Seems like a good fit for a W8 tablet SOC.

Cheers
 
Jaguar IPC shouldn't be far away from Bulldozer's.

The original Bobcat introduction marketing slides described Bobcat goal was to have 90% of the IPC compared to their high end desktop CPUs. If I remember correctly Bobcat was compared to Phenom at that time. If we are just calculating IPC based on their marketing department numbers, Jaguar should have on average pretty much identical IPC compared to Phenom (0.9*1.15 = 1.035). Phenom II has around 5%-25% higher IPC than Phenom (benchmarks: http://www.anandtech.com/show/2702/9). Bulldozer on the other hand has slightly worse IPC than Phenom II. All things combined, the Jaguar IPC shouldn't be that far away from Bulldozer's. Of course Bulldozer scales to much higher clocks, and both Piledriver and Steamroller improve the IPC further. Still the IPC is very respectable for such a low power CPU.
 
The original Bobcat introduction marketing slides described Bobcat goal was to have 90% of the IPC compared to their high end desktop CPUs. If I remember correctly Bobcat was compared to Phenom at that time.

I think that they never solidified the claim, and my hunch (based on how Bobcat fared in practice) was that they were actually comparing to an older K8 or something. I don't think there's any workload where Bobcat gets reasonably close to Phenom (at equal clocks).
 
I think that they never solidified the claim, and my hunch (based on how Bobcat fared in practice) was that they were actually comparing to an older K8 or something. I don't think there's any workload where Bobcat gets reasonably close to Phenom (at equal clocks).

http://www.xtremesystems.org/forums...Core-performance-analysis-Bobcat-vs-K10-vs-K8

benchfin2.png



things could look quite bad for AMD when jaguar beats trinity in non FMA FP.......lol.
 
Last edited by a moderator:
http://www.xtremesystems.org/forums...Core-performance-analysis-Bobcat-vs-K10-vs-K8

benchfin2.png



things could look quite bad for AMD when jaguar beats trinity in non FMA FP.......lol.
Thanks for that data :)

I remember reading an article (anandtech) stating that Bobcat was in between Atom and K10. Now that I see your results I realized that I did'nt pay attention to something important: the result are not normalized for clock speed. AMD came close to their 90% claims.

EDIT you post that chart in the "predict next gen etc" /forum. There seems to be some concerns about the perfs of those parts.
 
A lot more detail from the actual talk here:

http://www.theregister.co.uk/2012/08/29/amd_jaguar_core_design/

All of these components are spread out on the Jaguar chip in an "amoeba-like" floor plan that Rupley says "took a lot of blood, sweat, and tears" to come up with and that was created using tools developed by the ATI side of the house to build AMD's GPUs. "We had some initial floor plans that were really terrible," admits Rupley, as the CPU designers learned to use the GPU tools better.

The Bobcat core weighs in at 4.9 square millimeters in area using the 40 nanometer process at TSMC, and if Jaguar were implemented in the same process it would have about 10 per cent more area, according to Rupley. But lucky for AMD and its customers, Jaguar cores will be implemented in 28 nanometer processes and will only need 3.1 square millimeters of space.

It's interesting that the layout tools from the ATI side are now in full deployment on CPU core designs. They don't mention if Jaguar will have a unifed CPU/GPU memory space; it seems like something they could do since they're pairing off the CPU w/ a GCN based GPU. Also, I guess when each core is that small, it doesn't make much sense to invest in sizable circuitry to power them each down individually.
 
A lot more detail from the actual talk here:

http://www.theregister.co.uk/2012/08/29/amd_jaguar_core_design/



It's interesting that the layout tools from the ATI side are now in full deployment on CPU core designs. They don't mention if Jaguar will have a unifed CPU/GPU memory space; it seems like something they could do since they're pairing off the CPU w/ a GCN based GPU. Also, I guess when each core is that small, it doesn't make much sense to invest in sizable circuitry to power them each down individually.
Do somebody knows if those "layout tools" are the same as what AMD called " high density libraries" and plan to use on Excavator cores?
 
So slightly better budget notebooks, tablets-with-fans, and curious little ITX boards are one the way? ;)
 
Do somebody knows if those "layout tools" are the same as what AMD called " high density libraries" and plan to use on Excavator cores?

They probably are one and the same, and it's likely that the same libraries are responsible for the dramatic increase in density from ATI's 38XX series to to the 48XX series; I certainly wasn't expecting a leap from 320 shaders to 800 in one generation!

It's interesting that on the BD fpu, there's a large regular striped area on the left that doesn't look like cache:

http://images.anandtech.com/doci/6201/Screen Shot 2012-08-28 at 4.38.31 PM.png

Maybe those are ALUs? I need Hans De Vries. :) The density is very nice for die space and power but Anand says these denser units don't clock as well. I'll settle for two of these denser FPUs per module over a single fast clocked FPU.
 
They probably are one and the same, and it's likely that the same libraries are responsible for the dramatic increase in density from ATI's 38XX series to to the 48XX series; I certainly wasn't expecting a leap from 320 shaders to 800 in one generation!
I guess it's a reasonable guess to make.
It's interesting that on the BD fpu, there's a large regular striped area on the left that doesn't look like cache:

http://images.anandtech.com/doci/6201/Screen Shot 2012-08-28 at 4.38.31 PM.png

Maybe those are ALUs? I need Hans De Vries. :) The density is very nice for die space and power but Anand says these denser units don't clock as well. I'll settle for two of these denser FPUs per module over a single fast clocked FPU.
I guess both pictures are simulation, the blank part is just that blank space. I guess it helps to enlighten the gain in space.
Wrt to clock speed, AMD it self stated that those library best serves power constrained designs. Everything is power constrained nowadays... lol.
But I think it's going to hace an impact on OC but it could prove a way lesser evil than falling even further from INtel with regard to transistors density and power characteristics.

------------------
By the way there are inaccuracy in the article of The Register, they stated that each cores has 512kb of L2, that's incorrect the 2MB of cache are shared by the 4 cores.
Overall I like hardware.fr article better (more info /slides and less inaccuracy), go Frenchies / Belgians :)
 
I guess both pictures are simulation, the blank part is just that blank space.

I meant that region w/ the multi-colored, vertical striped pieces on the left of the connected, non white-space region present in both panels. It's pretty clear what features are homologous across the two panels, and I'm pretty sure the unmoved boxes are cache or memory cells of some sort. The striped region is a mystery; it looks much more regular than the blobby regions, and its topology seems less affected by the optimization.

By the way there are inaccuracy in the article of The Register, they stated that each cores has 512kb of L2, that's incorrect the 2MB of cache are shared by the 4 cores.
Overall I like hardware.fr article better (more info /slides and less inaccuracy), go Frenchies / Belgians :)

I'm wondering what difference in power having a full-speed cache would have made. The shared cache is definitely a nice single-thread feature that probably consolidates some circuitry across the cores and possibly the GPU as well.

Jaguar looks like a pretty good upgrade over Bobcat, but it's not enough reason for me to get a new netbook. I'd be much more convinced if they dramatically improved the displays they put in those things, and I'm glad Apple's pushing the industry toward high density IPS screens.
 
Jaguar seems to be a really big improvement over Bobcat. It has slightly reduced power consumption... and still AMD has managed to double the core count, double the total L2 cache size, improve clocks by 10%, improve general x86 IPC by 15% and vector IPC by 100%.

If my math is correct, we should see around 2.53x (2.0*1.10*1.15) performance in generic (four thread) x86 software and around 5.06x (previous * 2.0) performance in vector processing (integer/float) compared to Bobcat. Not bad at all :)

Your math is not correct.

You can't just multiple the "overall ipc increase" and simd width numbers.

When the vector width doubles, in order for the performance to really double, the performance of all others parts of the chip that are needed for those fp calculations(instruction fetch, decode, memory bandwidth etc) should also double. But those other parts are only getting 15% increase in performance. So the real performance increase is somewhere between 1.15 and 2, not 1.15*2.

Though they also widened the L1D<->FP datapaths, so that L1D bandwidth to FPU is also doubled, so that won't become worse bottleneck.
 
You can't just multiple the "overall ipc increase" and simd width numbers.
That wasn't intentional :oops:. The correct peak vector (int/float) performance should be 1.1*2.0*2.0 = 4.4x. Fortunately they also doubled the datapaths, and cache sizes (double perf part will go though the caches twice as fast). Unfortunately memory bandwidth is still unknown as is the GPU paired with the core. You need more memory bandwidth to be able to reach 4.4x performance. Hopefully AMD reveals the other parts of the APU soon. If they intend to change the GPU to GCN (as most reviewers seem to believe), I expect a big bump in GPU performance as well, and a better GPU would make them bandwidth constrained (all Trinity APUs are also bandwidth constrained. Memory overclocks give large benefits to performance).

Another thing to discuss is how they have implemented the 256 bit vector (AVX) support for the chip, and how Bobcat handled 128 bit SSE instructions in it's half width 64 bit vector pipelines. Did Bobcat split 128 bit vector ops in the decoders to two 64 bit uops? Does Jaguar the same for 256 bit ops (split to 128 bit uops)? If both architectures split the large vectors to two uops, the Jaguar frontend/icaches/etc should be able to process same amount of 128/256 bit instructions as the Bobcat processed 64/128 bit instructions. This would mean that the frontend/icaches/etc are also able to sustain the 2x performance. If the new architecture can populate the pipelines better (better frontend/etc), we might even see slightly more than 2x vector processing performance (real performance, not peak). But this of course assumes we are not bottlenecked elsewhere (and as you said, we likely are).
 
I think that they never solidified the claim, and my hunch (based on how Bobcat fared in practice) was that they were actually comparing to an older K8 or something. I don't think there's any workload where Bobcat gets reasonably close to Phenom (at equal clocks).

It is true: http://www.anandtech.com/bench/Product/328?vs=116

K10 based on Deneb I think is about 15-20% faster, so comparing to Bobcat the difference is probably 25-35%. That's not small.
 
Back
Top