New AMD low power X86 core, enter the Jaguar

The first link to an overview of AMD's power optimizations is on this page. There's also an EE Times article linked there.
I haven't yet read the whole whitepaper, however, as that requires filling out additional personal data.
For some reason, the images on that page aren't loading for me now, but one was a graph showing the number of active flops that shows increasing clock gating efficiency over the development period. There were a lot of existing blocks that did improve a little, but the new block corresponding the L2 interface has a massively higher total. It improves quite a bit, but even afterwards it's still a big chunk of active flops.

http://calypto.com/en/blog/2013/02/04/rtl-clock-gating-analysis-cuts-power-by-20-in-amd-chip/
 
The first link to an overview of AMD's power optimizations is on this page. There's also an EE Times article linked there.
I haven't yet read the whole whitepaper, however, as that requires filling out additional personal data.
For some reason, the images on that page aren't loading for me now, but one was a graph showing the number of active flops that shows increasing clock gating efficiency over the development period. There were a lot of existing blocks that did improve a little, but the new block corresponding the L2 interface has a massively higher total. It improves quite a bit, but even afterwards it's still a big chunk of active flops.

http://calypto.com/en/blog/2013/02/04/rtl-clock-gating-analysis-cuts-power-by-20-in-amd-chip/

Thanks for the link. I don't feel like putting that much personal info in; a search for the same terms yielded the following eetimes article:

http://www.eetimes.com/design/eda-d...rocessor-core-with-RTL-clock-gating-analysis-

Try opening it in an incognito window if it complains. They have a system called PowerPro that they ran over the weekend to give designers suggestions for optimizing their blocks. Wonder what block 3 on the diagrams in page 3 is, maybe something related to the cache system? Seems to be low efficiency during halt.
 
The personal information on the Calypto website doesn't have to be real. In any case, interesting information.
 
Yeah I tried doing that with my spam account, and it didn't seem to work; ah well.

I wonder if AMD's 4 core + gpu basic jaguar block pair into the ones MSFT and SONY are using for their consoles and whether or not this stacks higher. It would be especially nice if PCs or tablets could leverage the nice bandwidth figures from the embedded GDDR5 configuration of PS4 or even the more modest ones Durango which still dwarf PC APU bandwidths commonly seen today.

Edit: Looks like a more complete slide deck is available here:

http://extrahardware.cnews.cz/amd-na-issc-2013-hovorilo-o-architekture-jaguar-mame-kompletni-slajdy
 
Last edited by a moderator:
It would be especially nice if PCs or tablets could leverage the nice bandwidth figures from the embedded GDDR5 configuration of PS4 or even the more modest ones Durango which still dwarf PC APU bandwidths commonly seen today.
Question is what kind of deal they made with Sony ... do they have the right to leverage that effort for PC APUs? I don't think Sony would like to see a laptop APU which could run an almost 1:1 port of PS4 games ... AMD might very well be forced to leave the PC side stuck with slow DRAM standards designed for expansion sockets.
 
Question is what kind of deal they made with Sony ... do they have the right to leverage that effort for PC APUs? I don't think Sony would like to see a laptop APU which could run an almost 1:1 port of PS4 games ... AMD might very well be forced to leave the PC side stuck with slow DRAM standards designed for expansion sockets.

A more exiting proposition might be 8 core Jaguar for SeaMicro servers imo.
 
It would be especially nice if PCs or tablets could leverage the nice bandwidth figures from the embedded GDDR5 configuration of PS4 or even the more modest ones Durango which still dwarf PC APU bandwidths commonly seen today.

For tablets I don't think the PS4 method would work. Is there low power GDDR5?

For PC's it would be very interesting. Queue in rumors of soldered on CPU/APU which would start making this a lot more feasible combined with soldered on memory.

Imagine the jump in Supercomputing as well once you removed the PCIE bottleneck for GPGPU tasks.

Regards,
SB
 
Temash with ULV i3 level of performance

3471
 
4C/4T Jaguar narrowly beating 2C/4T Sandy Bridge at the same clock speed in a highly parallel test isn't very shocking.
 
4C/4T Jaguar narrowly beating 2C/4T Sandy Bridge at the same clock speed in a highly parallel test isn't very shocking.

It is shocking to me to be honest because Bobcat was nowhere near it. This would be like Atom suddenly being on par with the i3 in multithreading too.

The E-350 scores 0.63 in cinebench 11.5, that's 1.6 GHz dual core. I guess a quad bobcat at the same 1.4 GHz clocks would score ~1.0?

Yeah, this is a huge increase in IPC...it has to be.
 
Last edited by a moderator:
the CB results looks good considering the TDP...
I think the lowest performing sandy bridge part is the Celeron 847 (1.1GHz, 17W), and it scores something like 0.42 for a single core, Jaguar does 0.35 using a lot less power I think.
 
The E-350 scores 0.63 in cinebench 11.5, that's 1.6 GHz dual core. I guess a quad bobcat at the same 1.4 GHz clocks would score ~1.0?

Yeah, this is a huge increase in IPC...it has to be.
I think cinebench 11.5 is largely dominated by sse2 code. I'm not sure though if it actually uses any packed instructions or just scalar ones, in the former case a roughly 30% IPC improvement would really be on the low side of expectations otherwise that would be very good indeed.

the CB results looks good considering the TDP...
I think the lowest performing sandy bridge part is the Celeron 847 (1.1GHz, 17W), and it scores something like 0.42 for a single core, Jaguar does 0.35 using a lot less power I think.
I think that's not really a fair comparison (for power), since that Celeron is a power-deoptimized incarnation of sandy bridge (compared to the other ulv chips).

And don't forget, especially for single-threaded tasks integer IPC is more important than fpu one, and you really can't expect that much improvement there (well at least I wouldn't think AMD stated those 15% for nothing...).
 
Last edited by a moderator:
It is shocking to me to be honest because Bobcat was nowhere near it. This would be like Atom suddenly being on par with the i3 in multithreading too.

But Bobcat's perf/MHz was way ahead of Atom's, I don't think it'd really be like that..

You have to consider that the boost you get from Sandy Bridge's HT in Cinebench 11.5 is barely over 20% (at least going from 4T to 8T, don't know about 2 to 4), not really a strong showing for HT. The benchmark probably favors the extra raw FP pipes you get from real cores. Also why BD and PD don't do amazingly well.

The E-350 scores 0.63 in cinebench 11.5, that's 1.6 GHz dual core. I guess a quad bobcat at the same 1.4 GHz clocks would score ~1.0?

Yeah, this is a huge increase in IPC...it has to be.

Not sure if your scaling ratio of under 60% going from 2 to 4 cores is really what should be expected, but I don't really know how well Cinebench 11.5 scales.

Sure, it gets a big boost here because of the 128-bit FPU. Probably a larger boost than it'd get in non-FP stuff.

And don't forget, especially for single-threaded tasks integer IPC is more important than fpu one, and you really can't expect that much improvement there (well at least I wouldn't think AMD stated those 15% for nothing...).

On that point, I believe that 15% number was an average over everything, which would include at least some SIMD FP heavy benches. That'd mean the average for scalar/integer-only would probably be less than 15%.
 
IPC of all threads is up 15%.

IPC of a single thread is up more then 15% because it gets the full 2mb of L2 + much more of the L2 predictor/pre-fetcher time.
 
I am glad to see that Jaguar actually performs close to the expectations I had. It matches (beats by a small margin) Sandy Bridge and Piledriver in multithreaded vectorized workloads (Cinebench), when all CPUs are running at similar TDP and clock range (below 17W, around 1.6 GHz clocks). This was an expected result (even though some didn't believe it could happen).

This also clearly explains why Sony chose Jaguar for their next console. Consoles have limited TDPs (esp for the CPU, as the GPU is more important), and Jaguar likely provided the best raw multithreaded vector math performance for the allocated TDP. For a console it's more important to have good (multithreaded) throughput than good single threaded performance, since all games are specially coded for the closed hardware specs, and will utilize all the CPU cores (and vector math pipelines) available.
FWIW looks like the 2-module ULV Trinity part (A8-4555M) has been released. Unlike the 1-module version (A6-4455M, released ages ago) it didn't quite make it to 17W though instead it's now a 19W part. Clocks 1.6Ghz/2.4Ghz - so turbo clock should be higher than Jaguar but I don't know how often it's actually able to clock up that much.
It's going to be interesting to see how the 19W 1.6 GHz quad core (two module) Piledriver stacks up against Jaguar (15W Kabini) in general purpose code. According to the first benchmarks Jaguar seems to be slightly ahead in vector throughput when all four cores are used (and the clocks are normalized). Factor in the clock difference: 1.65 GHz * 1.1 = 1.815 GHz for Jaguar (AMD slides say it will have 10% higher clocks compared to Bobcat) vs 1.6 GHz for Piledriver (when all cores are taxed the turbo will be off). I would estimate that the multithreaded (general purpose) performance will be pretty close (because Jaguar has slightly higher clocks). In single threaded code Piledriver will likely beat Jaguar handily thanks to the 2.4 GHz turbo (Jaguar has no turbo to match that clock increase). The module architecture will also help Piledriver in 1-2 thread scenarios as well, since each module will only run one thread, and have exclusive access to all the shared resources (such as the L1 instruction cache and decode). Piledriver will likely win many application benchmarks, while Jaguar should be better in some games and CPU heavy software... assuming the GPU performance is identical...

AMD hasn't yet spilled the beans about the GPUs they are going to put into different Jaguars (Temash for tablets and Kabini for ultraportable laptops). I would personally love to have an ultraportable with a 1.84 TFLOPS GPU and 8 GB of fast GDDR5... The current AMD APUs are too much limited by bandwidth. 176GB/s would definitely be a nice improvement over ~15GB/s. That would even make Apple happy (they love fast GPUs in their products) :)
 
Back
Top