Thoughts on next gen consoles CPU: 8x1.6Ghz Jaguar cores

gongo

Regular
slide-1-728.jpg


I find it is woefully under powered...and potential bottleneck..? Is there even a turbo mode in this thingie 'cos 1.6Ghz is W-T-F? How is the memory controller for it? How did AMD convinced both Sony and MS to take it...? Flash back to IBM convincing them the in-order castrated PPE was a good evil...

I don't know how it compares in a closed box...so here is a bench graph from AT comparing its predecessor to the venerable P4
http://www.anandtech.com/bench/Prod....31.32.33.34.35.36.37.38.39.40.41.42.43.45.46

Mumblings from the developer community is that the Jaguar CPU is a side/downgrade from last gen consoles CPU...

Further thoughts?
 
thoughts,

1. 4x number of cores, 2x throughput per core of your P4 comparision
2. how is 8x ~1-1.X IPC threads @ 1.6 sideways from 6 X 0.2 IPC threads.
3. where are these Mumbelings?
4. all the stuff the core can do many times better then Xenons/PPE branch predict, moves between registers, prefetchers etc.
5. there is a balance of cost/power/performance, how about in a given TDP (say 25-30watts) what would you do for CPU?
6. a jaguar CU doesn't have a memory controller, that is a SOC level unit.


this thread is just as crap as 95% of the new threads popping up in this subsection.
 
The new consoles are most likely to use a heterogeneous processor architecture. AMD calls it "HSA". What's special about this HSA is that different kind of cores, in this case x86 Jaguar cores and GCN streamprocessors, can be utilized to work together as a combined processor. AMD calls such a processor "APU".

X86 Jaguar cores are pretty smart, but you only have very few of them. On the other hand, GCN streamprocessors are extremely dumb, but you have literally hundreds of them in your APU. So you want your smart Jaguars to do runtime intensive tasks and your GCN streamprocessors to do parallelizable tasks. This video explains it very, very easy to understand: HSA explained

It's a little bit like the Cell in the PS3, which is also a heterogeneous processor. Heaviliy abstracted one could say that instead of the PPE you will have the 8 Jaguars, and instead of the eight SPEs you will have a couple of GCN streamprocessors.
 
Now now...lets not go into the name calling bits...the purpose of this thread is to invite more expressions from experienced console developers to share their views on the new CPUs....we all know what AMD has said about HSA and what not....reading off their list sheet is nice and all...but the perspective is narrowed down to specifics...
 
I find it is woefully under powered...and potential bottleneck..? Is there even a turbo mode in this thingie 'cos 1.6Ghz is W-T-F? How is the memory controller for it? How did AMD convinced both Sony and MS to take it...? Flash back to IBM convincing them the in-order castrated PPE was a good evil...
if you wanted to know what people think you would do something like.

So i dont really understand the choice of jaguar as console CPU, from what i can see performance per clock seems on the low side and it isn't designed to clock high. The large amount of cores could be useful but how likely are they to be utilised and how much extra devloper effort is there in getting good performance out of 8 weakish lower clocked threads then say 4 more powerful and higherclocked threads.

post something like that and you might get the responce your pretending your after.
 
I agree 1.6GHz seems a little low (was expecting ~2GHz), but I don't know what other choice you were expecting them to make.

Their options were basically:

X64: Jaguar

ARM: Cortex-A15

I thought MS might go with ARM & possibly even design their own core like Apple, but that didn't happen.

I suppose they could have waited for Cortex-A57 or Denver, but that would have pushed the launch even further back.
 
W-T-F....some one seems cranky today..it could just be me being exposed to faster consumer oriented CPU today ...but now now...how ever you put it, lets stay on subject...

Are there better alternatives..? How about Piledriver/Vishera from AMD...? That was the early rumored PS4 CPU..Intel hex-core was rumored to be in 720 at one point...but in the end both went with the stock low power Jaguar cores....no hint of special cpu sauce sadly even...

Another question, how many and what devices are powered by Bobcat today?
 
Are there better alternatives..? How about Piledriver/Vishera from AMD...?

How would you adress the added 100W heat output in a small closed box if CPU was switched to piledriver? Take half of the GPU CUs away and bitch about weak GPU? :devilish:
 
thoughts,

1. 4x number of cores, 2x throughput per core of your P4 comparision.

Maybe I'm misinterpreting your meaning here but Jaguar does not have 2x the throughput per core of Bobcat. It has an optimistic 15% improvement going off AMD's own slides.

Obviously when you add in that 15% and take account of the fact that there are 4x as many cores in the consoles then the comparison with the P4 doesn't look so great for the P4 with the best case scenario based on those benchmarks being (assuming linear scaling which won't be the case but whatever) ~4.5x faster than a single core P4 at 3.6 Ghz.

That in itself doesn't really come across as great but a more interesting comparison IMO is this one with the A10-5800K:

http://www.anandtech.com/bench/Product/328?vs=675

In most cases even 4x the cores and a 15% improvement on top wouldn't quite be enough to match a pair of Piledriver modules (albeit at 3.8Ghz). And obviously the single thread performance vs piledriver is pretty dire at around than 1/3rd.

That certainly raises interesting questions around the decision of using Jaguar rather than the upcoming Steamroller cores that will feature in Kaveri.

A high end Kaveri in the PC space will feature 2 Steamroller modules which while comparable to the console CPU's overall, will still be noticeably faster in multithreaded scenario's and vastly faster in single threaded code. But then consider that the console CPU's double the standard PC Jaguar configurations so if the same would have held true for Steamroller....
 
Another question, how many and what devices are powered by Bobcat today?

about 40million devices, HTPC's, webtops etc, low end dekstops. The point is jaguar is a significate leap above bobcat. an 8 core jaguar should have about 4x the performance of Xenons in what was by far its strongest suit, let alone its weakest one. If code was only vectors then you might be dissapointed but it isn't and no one has done a good job at comparing integer performance between the two.

Maybe I'm misinterpreting your meaning here but Jaguar does not have 2x the throughput per core of Bobcat. It has an optimistic 15% improvement going off AMD's own slides.

64bit FP ALU's vs 128bit, so really its 2.3 times the thorughtput.
 
How would you adress the added 100W heat output in a small closed box if CPU was switched to piledriver? Take half of the GPU CUs away and bitch about weak GPU? :devilish:

Where do you get 100w from? The entire Trinity APU comes in at around that so there's no way that the delta between 8 jaguar cores and 2 Piledriver modules is 100w.
 
Where do you get 100w from? The entire Trinity APU comes in at around that so there's no way that the delta between 8 jaguar cores and 2 Piledriver modules is 100w.

two piledriver modules have a total of 4 128bit FMA' units. 8 jaguar cores have 8 128bit ADD's and 8 128bit MULS. pilediver would have double the clock but unless all your code is FMA FP thoughtput would likely be lower. That is my guess as to why both sony and MS picked jaguar over PD/SR.
 
64bit FP ALU's vs 128bit, so really its 2.3 times the thorughtput.

In SIMD capability yes but that's obviously not a particularly key measure in overall CPU performance - especially considering the HSA nature of the consoles. IPC is 15% higher like AMD themselves say.

And I don't think you can combine the IPC and SIMD improvements in that way. The reality is that a doubling of width in the SIMD units isn't going to directly relate to twice the SIMD performance anyway since other elements of the core which haven't doubled will also factor in on the real world output.
 
I was thinking about the clock and turbo too
To have a system with predictably performance you can't let it adjust the frequency on its own based on thermal room, but maybe they can put an api to let the developer run one or more cores at max frequency only in a particular section of the code
And btw if at least one core is dedicated to os, during game sections it must stay at 800MHz or even lower power state, leaving more room to increase other core frequency
 
two piledriver modules have a total of 4 128bit FMA' units. 8 jaguar cores have 8 128bit ADD's and 8 128bit MULS. pilediver would have double the clock but unless all your code is FMA FP thoughtput would likely be lower. That is my guess as to why both sony and MS picked jaguar over PD/SR.

At 3.8Ghz (A10-5800K speed) a pair of piledriver modules have a peak of 121.6 GFLOPS compared to the 102.4 GFLOPs of the 8 Jaguar cores in the consoles running at 1.6Ghz.
 
In SIMD capability yes but that's obviously not a particularly key measure in overall CPU performance - especially considering the HSA nature of the consoles. IPC is 15% higher like AMD themselves say.

And I don't think you can combine the IPC and SIMD improvements in that way. The reality is that a doubling of width in the SIMD units isn't going to directly relate to twice the SIMD performance anyway since other elements of the core which haven't doubled will also factor in on the real world output.

So what needs to double that hasn't, given that in hotchips they said they doubled everything they needed too in order to double throughtput. Also unless games are very FPU heavy then Xenons and Cell would have bene useless, so im going to guess they are FP heavy :LOL: . Why would 15% IPC improvement only apply to int code but not FPU?

At 3.8Ghz (A10-5800K speed) a pair of piledriver modules have a peak of 121.6 GFLOPS compared to the 102.4 GFLOPs if the 8 Jaguar cores in the consoles running at 1.6Ghz.

yes which it pretty much what i said, but go look at benchmarks that show FMA vs just AVX/SSE4 nothing is seeing a performance doubling.
 
Also unless games are very FPU heavy then Xenons and Cell would have bene useless, so im going to guess they are FP heavy.
Games use whatever resources you have available. If the system architecture is FP strong, you use FP-heavy code. If it's branchy and memory strong, you use branchy and memory hungry code.
 
gongo said:
How about Piledriver/Vishera from AMD...?
The consoles are working with a finite power and heat budget. Every additional watt a Piledriver CPU would have used would have been one not available for the GPU or bandwidth.

Jaguar supports SSE 4.2 and AVX, so it is about as feature complete as any X64 chip available today.
 
So what needs to double that hasn't, given that in hotchips they said they doubled everything they needed too in order to double throughtput.

Fair enough, lets say it can sustain double the peak SIMD throughput for the sake of argument.

Also unless games are very FPU heavy then Xenons and Cell would have bene useless, so im going to guess they are FP heavy :LOL: . Why would 15% IPC improvement only apply to int code but not FPU?

Cell and Xenon where fairly useless. Okay, that's not true, they were decent SIMD performers but unless you don't pay much attention to developers posts on these boards you will know that is far from the main or only driver of CPU performance. You simply cannot just look at how many flops a CPU can push out and conclude its overall power from that. In fact that would be a fairly ludicrous thing to do. If that were the case Haswell would be jumping off the blocks with twice the general performance of Ivybridge regardless of the benchmark. Which we know is obviously not the case.

The 15% IPC will come from tweaks to the overall pipeline, changes in the cache implementation and of course, from the doubling in width of the SIMD units themselves. To say you first double the SIMD units and then add an extra 15% for improved IPC is basically counting the same thing twice.

yes which it pretty much what i said, but go look at benchmarks that show FMA vs just AVX/SSE4 nothing is seeing a performance doubling.

Probably because those benchmarks aren't using FMA instructions. In consoles however if that's what's available then that's what developers will use where possible.

Yes, in a console, Jaguar will have near twice the real world SIMD throughput of Bobcat, but that's not the only way to measure CPU performance and the fact is, in the benchmarks originally posted, the doubled SIMD performance wouldn't have made much difference in the overall comparison. You wouldn't have been seeing double Bobcat performance, you'd have been seeing 15% (at best) more performance.
 
F To say you first double the SIMD units and then add an extra 15% for improved IPC is basically counting the same thing twice.

no it isn't, again watch the hotchips presentation, IPC comes from more agressive front end ( aditional L2 predictor/fetch, more agressive core prefetch/predictor) . Improved scheduling and a big improvement in OOO Load and Store capabilities, they even go as far as giving overall IPC improvement for each area.



Probably because those benchmarks aren't using FMA instructions. In consoles however if that's what's available then that's what developers will use where possible.
No im talking about benchmarks which compare bulldozer AVX vs FMA compiled code. of corse in a SR console devs would take every chance to write code that could be FMA'd but really how offen is that going to be fessable.


Yes, in a console, Jaguar will have near twice the real world SIMD throughput of Bobcat
Yes it will and Dev's will try to use every last drop of it.

the doubled SIMD performance wouldn't have made much difference in the overall comparison. You wouldn't have been seeing double Bobcat performance, you'd have been seeing 15% (at best) more performance.
And yet games developed for a jaguar console wont be "normal" applications.


edit: they other tihng that was said is jaguar would get quite a bit more then 15% IPC increase on single threaded workloads as a single core can use all L2 and the L2 predictor/perfetcher would in effect be dedicated.
 
Last edited by a moderator:
Back
Top