New AMD low power X86 core, enter the Jaguar

They could have included 4 CUs instead and clocked them to half, and had exactly the same performance, but at a lower TDP.
Obviously die cost would go up in this case.

Power may not necessarily be lower because although your dynamic power may be equal (or even better due to a lower voltage), the static (leakage) power may will increase and it may end up as a tradeoff in peak power scenarios. Idle power will surely increase, so this coupled with increased die costs will dictate smaller being better.
 
They could have included 4 CUs instead and clocked them to half, and had exactly the same performance, but at a lower TDP.

Die size? Kabini is a bit smaller than an IVB dual-core, although the latter isn't an SOC.
It's between the Apple A5 and A6 in size.
 
SNB/IVB would be 4, BD is also only 2 (per int core, though it can dispatch/retire more but decode is 4 shared by 2 int cores).
But I remember reading the theoretical IPC of Core 2 (Conore) is 4 too ! Does that mean the therotical maximum has not seen an improvement since that day?

Also did you count the supposed fused micro-ops ? or is it just a theoretical niche too ? unattainable during practice ?
After a long search .. It seems Core2 and Nehalem can both decode 4 instructions , 3 simple and 1 complex fused micro-ops.

In SandyBridge, Intel claims 5 instructions can be decoded (probably possible by the new u-ops cache) . 3 simple 1 fused and another 1 macro fused .My source here is Realworldtech's article , although it states they are 5 instructions , the diagrams show only 4!
http://www.realworldtech.com/sandy-bridge/4/

In Haswell it's basically the same as , 5 instructions , 3 simple and 2 fused.

In Bulldozer it is as you said .
 
But I remember reading the theoretical IPC of Core 2 (Conore) is 4 too ! Does that mean the therotical maximum has not seen an improvement since that day?

Also did you count the supposed fused micro-ops ? or is it just a theoretical niche too ? unattainable during practice ?
Well that "4" wasn't entirely correct. Even Core2 could theoretically, thanks to macro-op fusion, decode 5 instructions per clock (4 "normal" and 1 fused). And since those 5 decoded instructions would only be 4 uops the rest of the chip can handle them easily as well (4 uops can go into ROB per clock, and be retired as well). I guess since Sandy Bridge (when the uops can come from uop cache) those 4 uops could theoretically represent more than 5 x86 instructions, but I don't think it would actually be possible to execute them at the same time (because macro-op fusion is mostly compare+jump and core2 to ivy bridge cannot execute more than one such instruction per clock). With Haswell it could possibly work (as it should be able to handle two branches per clock) which while the throughput would still be 4 uops per clock those could possibly represent 6 x86 instructions.
But anyway this is a highly theoretical value. The idea behind a cpu design is to increase real-world IPC, any idiot can build a 8-wide inorder architecture (ok not quite idiot-proof with x86 due to complex decoding) with a theoretical IPC of 8 and achieving 0.1 in practice just burning power on all your unused parts of the chip. intel did lots of things to increase real-world IPC since Core2 while not really making the design wider (with the exception of haswell, while still restricted to 4 uops per cycle there's now 8 execution ports instead of 6).
 
:oops: I have the netbook based on an 1GHz mp6 :p, turned into a SoC with 1GHz CPU, 256K full speed L2, 2D graphics and sound. it's a 1.2 watt x86 SoC, the laptop is called Gecko Edubook and dates from 2009, but it has to be repaired (if possible)

I thought it was kind of a 486. Not strictly equal to the mp6 surely, for one thing the Rise mp6 boasts about MMX but the derived SoCs at best have compatibilty at extremely reduced performance.
 
Obviously die cost would go up in this case.

Power may not necessarily be lower because although your dynamic power may be equal (or even better due to a lower voltage), the static (leakage) power may will increase and it may end up as a tradeoff in peak power scenarios. Idle power will surely increase, so this coupled with increased die costs will dictate smaller being better.

Isn't the GPU power-gated in Kabini? I know that even power-gated units still leak a bit, that is the gate itself leaks some power, and the bigger the unit, the bigger the gate; the bigger the gate, the bigger the gate leakage. But is it significant enough to matter?

After all, Apple seems to be pretty happy with that kind of trade-off (obviously on different designs).
 
Isn't the GPU power-gated in Kabini? I know that even power-gated units still leak a bit, that is the gate itself leaks some power, and the bigger the unit, the bigger the gate; the bigger the gate, the bigger the gate leakage. But is it significant enough to matter?

After all, Apple seems to be pretty happy with that kind of trade-off (obviously on different designs).
Don't forget the clock of the 3.9W Temash part is already _very_ low (apple easily exceeding that frequency though I know frequency alone doesn't tell you much), I doubt you could gain anything at all by decreasing it further and using more units instead. Even the other 8/9W parts all don't exceed 300Mhz (ok one does with turbo) which is still very low for an architecture which apparently is designed to reach 1Ghz (ok maybe on a slightly different process?).
So more CUs would probably only start being slightly helpful at 15W and certainly at 25W.
 
I wonder if it would be possible to get a single strong core at 15w TDP.
It's a piece of cake for IVB, I have a desktop Pentium 2020 at 2.9 GHz that uses <18W with both cores running Prime 95 torture testing and ~15.5W with Cinebench 11.5.
 
Don't forget the clock of the 3.9W Temash part is already _very_ low (apple easily exceeding that frequency though I know frequency alone doesn't tell you much), I doubt you could gain anything at all by decreasing it further and using more units instead. Even the other 8/9W parts all don't exceed 300Mhz (ok one does with turbo) which is still very low for an architecture which apparently is designed to reach 1Ghz (ok maybe on a slightly different process?).
So more CUs would probably only start being slightly helpful at 15W and certainly at 25W.
I agree that 2 CUs was the correct choice for Temash tablet APU. My critique was targeted towards Kabini APU.

The Kabini notebook APU is just an (2x) overclocked Temash (no extra cores, and no extra CUs). If they had spend more resources to create/validate a separate SOC configuration (with 4 CUs) for Kabini (instead of just upping the clocks), they could have created an APU with both (slightly) lower TDP and (slightly) higher GPU performance. The manufacturing cost would have of course been slightly higher as well (+2 CUs require a small amount of extra die space).
 
You would then need dual channel memory to increase that GPU performance. That'd be starting to be another class of system.
If you want a low watt notebook with a faster GPU then you should probably look for an underclocked Richland.
 
Last edited by a moderator:
I agree that 2 CUs was the correct choice for Temash tablet APU. My critique was targeted towards Kabini APU.

The Kabini notebook APU is just an (2x) overclocked Temash (no extra cores, and no extra CUs). If they had spend more resources to create/validate a separate SOC configuration (with 4 CUs) for Kabini (instead of just upping the clocks), they could have created an APU with both (slightly) lower TDP and (slightly) higher GPU performance. The manufacturing cost would have of course been slightly higher as well (+2 CUs require a small amount of extra die space).
Ok for Kabini only it probably would make sense. I guess though AMD didn't feel like manufacturing separate dies (or just always going with a bigger die), the gains might not have been worth that.

You would then need dual channel memory to increase that GPU performance. That'd be starting to be another class of system.
If you want a low watt notebook with a faster GPU then you should probably look for an underclocked Richland.
Not really. We're not talking about doubling GPU performance, just something like 4 CUs at 350Mhz instead of 2 CUs at 500Mhz. Also, you could theoretically go to slightly higher clock ddr3l-1866 if more CUs at lower clocks manage to save you some power.
I wonder though if 4 CUs at low clocks would be really faster. Not quite sure if that wouldn't shift bottlenecks in the gpu elsewhere significantly by lowering clocks and adding more CUs (i.e. that one quad-rop block now looks a bit underspecced, same for setup which can only do 1/4 prim/clock) which might need more significant rearchitecting to make this worthwile.
 
Not really. We're not talking about doubling GPU performance, just something like 4 CUs at 350Mhz instead of 2 CUs at 500Mhz. Also, you could theoretically go to slightly higher clock ddr3l-1866 if more CUs at lower clocks manage to save you some power.

I get the feeling from both Richland and Kabini products that they didn't want to go for too high-end expensive and potentially more power hungry memory. Richland Mobile was meant to launch with a DDR3 1833MHz capable part, but they scrapped it and upped the GPU clock instead.
 
I wrote a blogpost (with test data) about Jaguar vs Llano at same clocks (1.5GHz). I tested Llano myself while I sourced Jaguar data from TR, Anandtech as well as PCPer.
Thanks for the comparison. That's 80% IPC of Stars (K10).

Some clock normalized scores of Trinity compared to K10 (http://www.tomshardware.com/reviews/a10-5800k-a8-5600k-a6-5400k,3224-14.html):
7-Zip 77%, Sandra Dhrystone 81%, Whetstone 97%. It's disheartening to notice that K10 is now 6 years old, and it is still the AMD chip with the best IPC.

It would be great if someone had the time and effort to test K10 + Bulldozer + Piledriver + Jaguar at identical clocks (2.0 GHz). That would give us a more clear view how AMDs IPC has progressed (/regressed) in the last years.
 
I can do deneb, but my piledriver box runs ESXi so it would be run on a guest. But when i tested when i first got it i was within 1%-2% of not VM'd machines.
 
It's disheartening to notice that K10 is now 6 years old, and it is still the AMD chip with the best IPC.
At the same time, it's really impressive how much performance they retained in Jaguar while keeping it small. Anandtech says each core is only 3.1 mm2 (excluding L2). Techreport shows it approaching the IPC of a single-channel ULV i3 (though I suspect that platform was gimped in some other way).
 
Back
Top