New AMD low power X86 core, enter the Jaguar

I'd be much more interested if they bundled ~16 jaguar cores without the GPU on a single chip, set up as independent systems each with their own 1-channel memory controller, 1Gbit/s eth and a shared io system. That could actually be pretty good for value as a web server, and would fit the seamicro stuff well.
A server SKU would need some additional RAS features to the micro-architecture and the memory pipeline, but otherwise Jaguar is very well suited for a Niagara-type of high-throughput server SoC with large variety of misc. I/O and dedicated HW blocks.
 
A server SKU would need some additional RAS features to the micro-architecture and the memory pipeline, but otherwise Jaguar is very well suited for a Niagara-type of high-throughput server SoC with large variety of misc. I/O and dedicated HW blocks.

At least now Jaguar support ECC, so they made first step to bring it into server world.
 
A server SKU would need some additional RAS features to the micro-architecture and the memory pipeline, but otherwise Jaguar is very well suited for a Niagara-type of high-throughput server SoC with large variety of misc. I/O and dedicated HW blocks.

I would not say "very well suited for high-throughput" for a core that has absolutely no multi-threading support. OOE only hides relatively short stalls, the cores would be idling when a longer stall occurs.

And when there are _lots_ of threads, OOE is just not needed, multi-threading gets better performance for cheaper/less power.
 
I would not say "very well suited for high-throughput" for a core that has absolutely no multi-threading support. OOE only hides relatively short stalls, the cores would be idling when a longer stall occurs.

It would have similar issue rate. 4x2 instructions /cycle @1.6GHz vs 1x4 instructions/cycle @3.2GHz, similar OOOe capabilities, 4x48 ROB slots vs 2 x 96 ROB slots for Haswell in SMT mode (1x192 in non-SMT mode). Similar LS bandwidth per nanosecond. It would have larger internal exution width and larger branch resolve capabilities.

If you can stuff four times the cores into the same die area as a single Ivy Bridge or Pile driver core and keep the power envelope lower, I'd say it is well suited.

And when there are _lots_ of threads, OOOe is just not needed, multi-threading gets better performance for cheaper/less power.

Oracle (was: Sun) added OOOe to their T4 design, increasing per socket performance compared to T3 while halving the number of cores.

Cheers
 
And they're going back to 16 cores with the T5, keeping the T4 core design. I've read an article on it, when I stumbled on it by chance. There's eight threads per core, so they're selling a 128 thread OOOe CPU really. They've stuffed lots of NUMA links so that 8 socket builds are 1st class citizen, too, giving you up to a computer with 1024 threads :runaway:
It will be a tad more expensive than a console, PC or tablet though (and you have to buy it from Oracle?)


I've seen more than once mentioned on this forum though that not all OOOe are equal, the kind found in e.g Power PC G3/Gamecube/Wiis was said to be much simpler than what's in Pentium Pro/2/3 for instance.
 
And when there are _lots_ of threads, OOE is just not needed, multi-threading gets better performance for cheaper/less power.
That's true until you start to trash your caches. If for a given IPC improvement you pay a similar cost for OoOE or multi-threading you always want to go with the former.
 
Acer Aspire V5 benchmarks (ultraportable with Temash A6-1450 tablet chip):
http://ultrabooknews.com/2013/05/10/live-now-acer-aspire-v5-and-amd-temash-testing/

For the CPU alone they measured a 3W TDP (four cores, 1.4 GHz).

A6-1450 scores 1.23 in multithreaded Cinebench (11.5). In comparision the 17W ULV Sandy Bridge i5-2537M (1.4/2.3 GHz) = 1.34 (Samsung Series 9), and the 16W ULV Ivy Bridge i7-3517U (1.8/2.9 GHz) = 2.79 (ASUS Zenbook Prime UX21A).

If you directly extrapolate the Cinebench result to 2.0 GHz (Kabini at max clock) the score would be 1.76. That's not enough to compete with Ivy Bridge (or Haswell), but it should beat all old Sandy Bridge based ULV models (in multithreaded SIMD code - it of course loses badly in single threaded code).
 
For the CPU alone they measured a 3W TDP (four cores, 1.4 GHz).

That's not TDP, in actual TDP it should be somewhat higher.

So let's compare it to Silvermont.

A 1.8GHz Atom Z2760 "Clover Trail" gets 0.5 points in 3DMark11.

If we assume double the scores with quad core and 1.5x the perf/clock increase that would mean a 1.8GHz Silvermont should get 1.5 points. It probably won't scale linearly, so it may need 2GHz to do so.

Looks like Jaguar may have about 15% advantage per clock over Silvermont, and Bobcat as well.
 
You mean 0.5 points in Cinebench 11.5 yes?

>15% IPC is what AMD has said all along and they appear to have hit that target at least.

Sebbbi - remember none of Intel's big core chips have the southbridge on die yet, so that'll need more power. Based on the leaks of Haswell going around I'm expecting a step backwards in x86 performance in order to pay for the GPU increase and integrated southbridge on the ULV parts.
 
You mean 0.5 points in Cinebench 11.5 yes?

Yes, thanks for pointing that out. :D

Based on the leaks of Haswell going around I'm expecting a step backwards in x86 performance in order to pay for the GPU increase and integrated southbridge on the ULV parts.

ULT version of Haswell is said to be 15% faster than the predecessor, which indicates somewhat higher clock speeds.
 
overall it looks like a great improvement over bobcat, higher ST performance and much higher MT performance....

but still, looking at some of the test it makes me wonder if it's really valid to go with such a low ST performance, there are occasions where you are limited mostly by ST performance for the basic usage, so I don't think the mt benchmarks really represents well the difference between using an i3 ULV vs this CPU.

looking at this

cb11-multi_0.png


povray_0.png


handbrake_0.png


am I wrong in thinking a single ivy/sandy bridge core (with HT), at 3GHz with 2-3MB of l3 would be able to compete with the 1.5GHz quad core Jaguar for MT, and for ST it would be a totally different level, I wonder if it would be possible to get a single strong core at 15w TDP.
 
remember Kabini is a complete SOC your ivb isn't. At low power missing this when comparing performance in a TDP skews results.


As to a 3ghz IVB in 15 watts. it would all come down to voltage. small bumps in voltage have a large effect on power consumption.
 
So what is the IPC for a single Jaguar core ? I read it's less than 1 !!
Also , for comparison purpose , I would like to know the IPC for a single SandyBridge Core i7 and and Bulldozer core .
 
IVB core in 15W at 3Ghz should be quite doable. After all intel has 17W TDP parts out there which reach 3.3Ghz with Turbo, though I don't know if it can hold that clock when using one core at full load, maybe it can but near certainly the IGP must be idle to do it. And the clock goes down pretty rapidly with smaller TDPs, the 13W parts (granted the chip was never meant to get that low) do not exceed 2.6Ghz.
So if you have a multithreaded load you really rather want 2 cores at 1.5Ghz rather than 1 at 3Ghz even for IVB, as that will be much better for power consumption. Don't forget those Kabini power measurements show the 4 cores consuming something like 6W at full load (the gpu taking up the rest of the 15W TDP), whereas that single IVB core would be probably roughly twice that for the same (multithreaded) performance.

So what is the IPC for a single Jaguar core ? I read it's less than 1 !!
Also , for comparison purpose , I would like to know the IPC for a single SandyBridge Core i7 and and Bulldozer core .
The theoretical max sustained IPC is 2 for Jaguar, same as it was for Bobcat (it can decode/dispatch/retire 2 ops per cycle after all). SNB/IVB would be 4, BD is also only 2 (per int core, though it can dispatch/retire more but decode is 4 shared by 2 int cores). Ok that's x86 IPC the picture gets more complicated if you look at executed ops in the core (which is what's used past the decode stage).
Now the IPC you get in practice is something else entirely and will depend on a LOT more factors, though it will definitely be lower, and increasing this for a cpu implementation is HARD (and the higher your theoretical IPC, the more trouble you have to achieve some real-world IPC improvement - this is the reason after all lots of low-power cores being restricted to two-wide). It will also vary wildly depending on the code. Last time I did some profiling for some code I had trouble getting it to exceed 1.0 (on a K8 though which could in theory sustain 3 but in practice is probably very close to Jaguar). Looking at the benchmarks I guess IVB has about 1.6 times higher IPC in real world (for a single thread) compared to Jaguar, which given the two times higher theoretical performance is a very good result.
 
Last edited by a moderator:
I really think they needed a turbo core feature in all models, not just the top end tablet chip. It really sounds like they have turbo sorted out in that one chip though, I guess tests will show for sure.
 
SNB/IVB would be 4, BD is also only 2 (per int core, though it can dispatch/retire more but decode is 4 shared by 2 int cores).
But I remember reading the theoretical IPC of Core 2 (Conore) is 4 too ! Does that mean the therotical maximum has not seen an improvement since that day?

Also did you count the supposed fused micro-ops ? or is it just a theoretical niche too ? unattainable during practice ?
 
So what is the IPC for a single Jaguar core ? I read it's less than 1 !!
Also , for comparison purpose , I would like to know the IPC for a single SandyBridge Core i7 and and Bulldozer core .
I remember reading some IPC comparison article/benchmark years ago, but I can't find it anymore (it might have been from Realworldtech or Anandtech). If I remember correctly, in the general purpose integer test Bulldozer average IPC was 1.1, Sandy was 1.7, and Bobcat was 0.8. According to the Jaguar (Kabini A4-5000) benchmarks, 1.5 GHz Jaguar beats 1.6 GHz Bobcat by 22% (23.46% IPC increase). Extrapolating from the Bobcat IPC score, Jaguar average IPC should be very close to 1.0 in that general purpose integer test. However average IPC of each architecture might be completely different in another test case. If I remember correctly, Sandy's IPC in a mixed SIMD+integer code test was 2.9 (in the same benchmark article). But unfortunately I can't remember how the other chips performed in that test (and if I remember correctly they also used hyperthreading in that test to fill the Sandy Bridge core better = gain better IPC).

1.0 IPC is actually quite good if you compare Jaguar to the chips it's going to replace and compete against. In-order PPC CPUs (current gen consoles) have average IPC of around 0.2, ATOM has average IPC of around 0.5 and Bobcat around 0.8. In recent benchmarks 1.5 GHz Jaguar beat 1.7 GHz Cortex A15, indicating it has higher IPC than the top of the line ARM CPU. I don't think the IPC is a problem for Jaguar.

Jaguar would badly need dynamic CPU clocking (turbo). Intel radically improved their dynamic clocking for Ivy Bridge, and thus the 17W parts can clock up to 3.0 GHz (single threaded tasks / boost burst performance). Haswell improved this further by shortening the idle<->alive transition time to around 1 ms. ARM of course has focused on dynamic clocking / turning off chip parts since day one (as mobile/integrated chips are their main business area).

One thing I do not understand in AMDs Kabini/Temash SOC configurations is the GPU. It has only 2 CUs. They could have included 4 CUs instead and clocked them to half, and had exactly the same performance, but at a lower TDP. Or even better... they could have created a dynamic GPU clocking system similar to Intel, and had both lower power consumption in normal usage and much higher performance at demand. 17W Ivy Bridge parts have 350 MHz nominal GPU frequency, and turbo up to 1200 MHz (4.8x boost). This is something AMD needs badly, if they want to conquer the tablet/ultraportable market.
The perf/Watt improvement over Brazos is astounding.
Yeah, it even surpasses even P4->Core2 :)
 
They could make some laptops with simply amazing battery life - whether anyone actually makes a quality AMD laptop is another matter. They'll probably instead see it as a good opportunity to save money on the battery and/or overvolt the thing for no reason. *sigh*
 
Back
Top