Thoughts on next gen consoles CPU: 8x1.6Ghz Jaguar cores

That sounds a bit optimistic to me (assuming a dual core Sandybridge at that).

Taken across the whole range of benchmarks I think Trinity isn't far behind a 2C/4T Sandybridge, but it does admittedly fall behind in some areas and many games. Richland boosts clocks nicely while improving thermals, and a the 65W model should stack up well against the very fastest 100W Trinity and 65W 2C/4T Sandbridges (and bear in mind that the AMD GPU is probably contributing to a much bigger chunk of the TDP at standard clocks)

Richland looks to be a fantastic HTPC and budget desktop chip - better than PS360 gaming in a tiny 65W box with two sticks of common-or-garden ram. If AMD shipped processors in the volume that Intel did they could probably bin for much faster 65W processors too.

Poor AMD.
 
Taken across the whole range of benchmarks I think Trinity isn't far behind a 2C/4T Sandybridge, but it does admittedly fall behind in some areas and many games. Richland boosts clocks nicely while improving thermals, and a the 65W model should stack up well against the very fastest 100W Trinity and 65W 2C/4T Sandbridges (and bear in mind that the AMD GPU is probably contributing to a much bigger chunk of the TDP at standard clocks)

Richland looks to be a fantastic HTPC and budget desktop chip - better than PS360 gaming in a tiny 65W box with two sticks of common-or-garden ram. If AMD shipped processors in the volume that Intel did they could probably bin for much faster 65W processors too.

Poor AMD.

Yeah Richland does look great. I didn't realise AMD were that close to Intel at the moment but good on them if that's the case.

I'm actually quite excited about AMD APU's from a high end gaming perspective if their intentions of using the GPU as a dedicated GPGPU processor come to fruition. Certainly given the game altering GPGPU capabilities of the new consoles that may become an essential feature for PC gaming to enable competitiveness.
 
Yeah Richland does look great. I didn't realise AMD were that close to Intel at the moment but good on them if that's the case.

Well ... some of their products are close to where Intel was two years ago ... but the FX line isn't feeling the love. There's no Vishera refresh in sight and for whatever reason that isn't using the resonant clock mesh stuff that seems to have been good for Trinity.

If you look at IVB then the perf/watt start to look disappointing again even for Richland, although for common uses like laptop web browsing it will probably still hold up well in battery life.

The one are where AMD can really take Intel on is the gap between Atom and IVB, which is exactly where Jaguar is heading. The Xbox 360 had a power guzzling CPU and it's that that will have meant the 360 still needs a pretty beefy heatsink 7 years on. Jaguar seems like a very friendly CPU to whatever long term plans MSony will have.

I'm actually quite excited about AMD APU's from a high end gaming perspective if their intentions of using the GPU as a dedicated GPGPU processor come to fruition. Certainly given the game altering GPGPU capabilities of the new consoles that may become an essential feature for PC gaming to enable competitiveness.

I wonder if there's another possible outcome for PC - the possible birth of gaming APUs? Given what Richland can do on on 65W with a 128-bit bus, I think a 140W chip with 256-bit DDR4 on 22nm could really provide an amazing experience. It would also be friendly to small form factor PCs as the heat would be located in one spot. It probably wouldn't satisfy the PC hardware enthusiast, but it would fill the gap between mainstream APU and enthusiast discreet GPU while allowing all that fast HSA GPGPU gubbins.
 
But pair up Trinity with a discrete 7970 and then benchmark that against a 7970 running with a top end Ivybridge at reasonable resolutions and you'll probably find yourself CPU limited.
What is a reasonable resolution, sub 1080p? And I guess several people said already, that the API overhead is usually signficantly lower on closed console systems enabling quite a bit more draw calls with the same computing resources. And even if you are right, the devs are not stupid. If there is GPU power to spare (that is also what CPU limited means), they will crank up the effects, the AA, whatever. Typically nobody cares about how much higher than 60 fps you could run on consoles.
But Piledrivers FPU is twice as wide and runs at a much faster clockspeed so it balances out. This is what extremtech say about the FPU in Jaguar compared with Pilepriver:

http://www.extremetech.com/gaming/142163-amds-next-gen-bobcat-apu-could-win-big-in-notebooks-and-tablets-if-it-launches-on-time
Well, extremetech is simply wrong on this. In fact, if you run solely FP SIMD instructions on all cores, the effective width of Piledriver is lower than Jaguar: a BD or PD core can only execute a single 128Bit floating point instruction per cycle, Jaguar can pull off two per cycle under this circumstances.
And what they write about the decoding is also only half true. While Jaguar can indeed decode only two instructions per cycle and core, PD's 4 instructions are per module. Each core will get 4 instructions only every other cycle. That means on average PD is also stuck with 2 instructions per cycle.
There is a set of circumstances, where PD can pull ahead: the modules are only half loaded (just a single core is active), only one of the cores uses FPU instructions heavily, or heavy use of loads (BD/PD has twice the load bandwidth to the L1 cache, store bandwidth is the same).

But outside of that, even FMA brings BD/PD just to parity with Jaguar in a pure FP throughput per clock scenario (what may be attainable is an SMT like speed up of 20% or something through better usage of the available resources). PD gets 16 Flops per module, i.e. 8 flops per core peak with FMA, but only 8 flops per module or 4 flops per core with MULs and ADDs (but flexible in all relative abundances). Jaguar gets 8 flops per core with MULs/ADDs (50:50 mix) and dips to 4 flops per core for solely MULs or ADDs. In any case, throughput per clock is basically never lower on Jaguar than on PD save for the exceptions mentioned above.

Of course one could argue that PD clocks higher. But what is left from that if you have to fit 8 PD cores in 25W? Or do you really get 4 PD cores in 25W to clock more than twice as high as Jaguar? What does it cost in die size in comparison? Can you port PD easily to the 28nm process of your choice? How does this effect the clocks and power consumption?
 
Last edited by a moderator:
Hi,

I've got off my lurkers stool again as this thread seems to be missing what IMO (as a console developer) is the real point about CPU performance going forward. Yes we got stung a little bit with current consoles having no out of order execution, LHS stalls etc.. but now we've optimised and have codebases which are pretty tight in the main, at least the usual heavy lifting parts of our engines are. If it came down to straight FLOPS race on a single core Jaguar just isn't running that fast compared to a 360 core and it might just lose. My prediction is our existing tightly optimised engine code won't derive nearly as much benefit from clever cores as non optimised code / running things like scripting languages will. On balance this is good news for developers.

The big problem however is not how many FLOPS you can throw around or instruction decodes per cycle, it's how you keep all the cores fed. It was pretty easy to saturate the L2 fetch on Xenons making it difficult to max out all three cores. I'm far more interested in effective speculative L2 prefetch, the maximum number of in flight cache fetches and total bandwdith.

Achievable bandwidth with all cores busy is absolutely key.
 
Hi,

I've got off my lurkers stool again as this thread seems to be missing what IMO (as a console developer) is the real point about CPU performance going forward. Yes we got stung a little bit with current consoles having no out of order execution, LHS stalls etc.. but now we've optimised and have codebases which are pretty tight in the main, at least the usual heavy lifting parts of our engines are. If it came down to straight FLOPS race on a single core Jaguar just isn't running that fast compared to a 360 core and it might just lose. My prediction is our existing tightly optimised engine code won't derive nearly as much benefit from clever cores as non optimised code / running things like scripting languages will. On balance this is good news for developers.

The big problem however is not how many FLOPS you can throw around or instruction decodes per cycle, it's how you keep all the cores fed. It was pretty easy to saturate the L2 fetch on Xenons making it difficult to max out all three cores. I'm far more interested in effective speculative L2 prefetch, the maximum number of in flight cache fetches and total bandwdith.

Achievable bandwidth with all cores busy is absolutely key.
Hi... I am quite happy with this gen CPUs because for the first time in history every major console will have an out of order CPU.

In my opinion, console power is relative, these 8 years of IPC improvements alone must do plenty to nullify any frequency disadvantage compared to previous generation consoles on the CPU side, nevermind that they are OOE this time around and using more *real*, actual cores.

The big advantage imho of previous gen CPUs is that working on closed hardware there is more than one way to achieve the performance you desire.

But AMD OOE engine is a big advantage these days, although yo can also make a pretty respectable in order CPU, of course.
 
OoOE nets you no advantage if your code is highly optimised and running in-order as a result. With both consoles have the same CPU architecture, middleware engines can invest in optimisation for benefits across all consoles, so the value of OoOW may even be less this gen than it would have been last. This all depends on the code base being used though. I'm not as optimistic as GraphicsCodeMonkey that the overall codebase this gen is highly optimised, and I think plenty of games will see significant performance advantages from more forgiving CPUs.
 
IMHO the real deal with these CPUs are that at last x86 based vector units will be tested to its knees. And of course, the same with x64 code for first time. ( Wasn´t next Battlefield going to be native 64 bits code? )
 
Last edited by a moderator:
It's certainly true that 90% of the codebase in a typical game will benefit to a greater or lesser extent from OOE. The areas of the code which we've optimised for in order execution, SIMD etc.. are the areas however were the CPU spends the majority of it's time rendering, physics, animation, pathing, particles etc..

I'm in now way against OOE units however in my hardware wishlist it's low down compared to an achievable bandwidth sufficient to keep all cores busy. The better the core is at rawing through instructions the better the bandwidth needs to be, no point in having amazing cores data starved half the time.

X64 is a dual edged sword, pointers will take up even more valuable cache line space than they do already. Of course we're all trying to use indices instead of pointers already right? ;)
 
OoOE nets you no advantage if your code is highly optimised and running in-order as a result.

Out of order execution brings benefits no in-order can hope to achieve. Higher data depency latency tolerance, lower effective memory latency and better I-cache behavior amongst other things. On top of that the 360 CPU and PS3 PPU are really crappy in-order cores.

Cheers
 
OoOE nets you no advantage if your code is highly optimised and running in-order as a result. With both consoles have the same CPU architecture, middleware engines can invest in optimisation for benefits across all consoles, so the value of OoOW may even be less this gen than it would have been last. This all depends on the code base being used though. I'm not as optimistic as GraphicsCodeMonkey that the overall codebase this gen is highly optimised, and I think plenty of games will see significant performance advantages from more forgiving CPUs.

I think the idea now is that the hardware now has enough force and flexibility on the GPU side for the type of workload that benefits from in-order optimisations, that the CPU will be optimised for tasks that benefit from out-of-order ...

Some type of tasks though apparently always benefit from out-of-order. If I remember correctly sebbbi and/or nAo, among others, posted examples of where OoOE performed 50% better, especially on a dual-threaded CPU core. Wasn't one of the examples zip (de)compression? Can't remember.
 
Lossless data decompression will always perform much better on cores with OOE due to the nature of the task, anything which has lots of if/then/else in it will perform an awful lot better

We've spent a lot of time in the PS3/360 generation working out how to write things without branches
 
Lossless data decompression will always perform much better on cores with OOE due to the nature of the task, anything which has lots of if/then/else in it will perform an awful lot better

We've spent a lot of time in the PS3/360 generation working out how to write things without branches

People seem to use the in-order nature of this gen's processors as the reason for all of its problems.. OoO doesn't really directly help with branches, and the big reasons these processors suck with branches is that they have weak branch prediction, a high mispredict penalty, and a huge fetch bubble even on successfully predicted taken branches.

Not that Jaguar has these problems either, but you can get in-order processors that do much better in these cases as well. On the flip side, there are out of order processors that have some of these problems (but not nearly as badly as Xenon/Cell PPE do)
 
What is a reasonable resolution, sub 1080p?
I'd say anything up to and including 190x1200. Beyond that can be considered "enthusiast" resolution IMO.

And I guess several people said already, that the API overhead is usually signficantly lower on closed console systems enabling quite a bit more draw calls with the same computing resources.

True, it's obviously not all about draw call but I take on board what your saying there. i.e. in a console the ratio of a particular CPU to GPU will be a little more in favour of the CPU than those same components in a PC, thus for a "balanced" system you don't necessarily need as much CPU power as you would in a PC.

And even if you are right, the devs are not stupid. If there is GPU power to spare (that is also what CPU limited means), they will crank up the effects, the AA, whatever. Typically nobody cares about how much higher than 60 fps you could run on consoles.

I completely agree. I've never really thought much of the balanced system concept anyway to be honest. If you have an 'excess' of CPU power in a console then the CPU will be given more work to do like helping out with graphics or additional physics etc... and same for the GPU. Even in the PC the concept doesn't hold water since you can pile on image quality, resolution, 3d etc....

But outside of that, even FMA brings BD/PD just to parity with Jaguar in a pure FP throughput per clock scenario (what may be attainable is an SMT like speed up of 20% or something through better usage of the available resources). PD gets 16 Flops per module, i.e. 8 flops per core peak with FMA, but only 8 flops per module or 4 flops per core with MULs and ADDs (but flexible in all relative abundances). Jaguar gets 8 flops per core with MULs/ADDs (50:50 mix) and dips to 4 flops per core for solely MULs or ADDs. In any case, throughput per clock is basically never lower on Jaguar than on PD save for the exceptions mentioned above.

Thanks for improved understanding. So assuming 2 PD/SR modules running at 3.2Ghz and 8 Jaguar cores running at 1.6Ghz you effetively have the following extreme scenarios:

All ADD or All MUL code: Even
All FMADD code: Even
A perfect 50-50 split between ADD and MUL: Jaguar has twice the performance

The reality is of course going to be a mix of the above. Starting from the extreme of the Jaguar based CPU being twice as fast, adding FMADDs into the codebase will close the gap, then a lack of balance between ADDs and MULs in the rest of the code will further close the gap. Now consider that PD in Richmond is running at 4.1Ghz so we could assume Steamroller which would have been the console equivilent to run at a similar speed (given that the expected total power draw of Kaveri at this speed with 8CU's is still only 100w that seems reasonable). That's a 28% speed increase over the baseline position of 3.2Ghz.

When you add it all up it looks like a 2 module steamroller could have been fairly comparable in SIMD throughput while still being faster in scalar code and much faster in any kind of single threaded code.

Then if you consider that the console configuration of Jaguar doubles the PC APU configuration (4 -> 8 cores) so if they would have done the same for Steamroller we'd have been looking at 4 modules rather than 2. And there just wouldn't have been a comparison in that case, even if those 4 modules were running at a much slower clock speed than the desktop varients to save power (say 3.2Ghz).

Of course one could argue that PD clocks higher. But what is left from that if you have to fit 8 PD cores in 25W? Or do you really get 4 PD cores in 25W to clock more than twice as high as Jaguar? What does it cost in die size in comparison? Can you port PD easily to the 28nm process of your choice? How does this effect the clocks and power consumption?

This is obviously the key decision point - could you get 2 or 4 PD/SR modules into the APU within an aceptable power and die size envelope. I honestly don't know much about the relative die size or power output of each core type other than to assume PD/SR would be much larger and hotter. Although as I mentioned above, 2 modules fits nicely on a 32nm APU at 4.1Ghz with 6 CU's running at 844Mhz within 100w so there certanly seems to be plenty of power budget there for an additional 12 CU's all running at 800Mhz on a 28nm APU. Adding another 3 SR modules might be pushing things a bit though.

Note Steamroller is the 28nm version of Piledriver so we should really be talking in SR terms rather than PD.
 
People seem to use the in-order nature of this gen's processors as the reason for all of its problems.. OoO doesn't really directly help with branches, and the big reasons these processors suck with branches is that they have weak branch prediction, a high mispredict penalty, and a huge fetch bubble even on successfully predicted taken branches.

Not that Jaguar has these problems either, but you can get in-order processors that do much better in these cases as well. On the flip side, there are out of order processors that have some of these problems (but not nearly as badly as Xenon/Cell PPE do)

I think people's initial experience with the 360 CPU + PS3 PPE was that the in-orderness really hurt. On 360 however this has been replaced with the problem that if you use all the cores then it gets really easy to saturate the data cache so it's data fetch that's the problem rather than the core itself. If a developer is only using one core they probably still bemoan that lack of OOE. Similarly on the PS3 the lack of OOE doesn't hurt that much when you've starting using the SPUs and there's heavy contention for the main bus.

I definitely do look forward to having proper OOE in the next generation of consoles but it's going to help people with single threaded code that hasn't been optimised for in-order exection more than those who have those multithreaded codebases.
 
Now that we have some developers in here, would you see AVX2 being added to the Jaguar core for either console as value added, or are other things much more important in your mind?
 
I'm pretty sure that the Durango CPU at least will have customisations and the cores won't be vanilla Jaguar.
As bkilian pointed out, they got IBM to add VMX128 to Xenon so it would not be out of the ordinary to see similar customisations in Durango.

Whether that's AVX2 or something else who knows.

But the possibility of significant customisations means we'll likely have to wait for more CPU details to judge its merits.
 
I think people's initial experience with the 360 CPU + PS3 PPE was that the in-orderness really hurt. On 360 however this has been replaced with the problem that if you use all the cores then it gets really easy to saturate the data cache so it's data fetch that's the problem rather than the core itself. If a developer is only using one core they probably still bemoan that lack of OOE. Similarly on the PS3 the lack of OOE doesn't hurt that much when you've starting using the SPUs and there's heavy contention for the main bus.

I definitely do look forward to having proper OOE in the next generation of consoles but it's going to help people with single threaded code that hasn't been optimised for in-order exection more than those who have those multithreaded codebases.

I think what you really mean to say is that people had certain expectations of performance that were not met, and they blamed it entirely on the processor being in-order. I don't think anyone simulated a PS3 or XBox 360 but added some level of OoO somehow and determined that solved all their problems. Sorry, but you haven't done anything to convince me of a technical argument here. Of course I'm not saying OoO doesn't help - although the level of improvement is proportional to exactly how much and what kind of reordering it can do, it's not a binary feature - but Cell PPE has several other problems that would still grossly hurt performance even if it had aggressive OoO.
 
I'm pretty sure that the Durango CPU at least will have customisations and the cores won't be vanilla Jaguar.
As bkilian pointed out, they got IBM to add VMX128 to Xenon so it would not be out of the ordinary to see similar customisations in Durango.

Whether that's AVX2 or something else who knows.

But the possibility of significant customisations means we'll likely have to wait for more CPU details to judge its merits.
That is several years ago and it was IBM, not AMD. What does AVX2 add, which could be of interest for games? 256 Bit Integer instructions? Some bit manipulation or vector shifts? Don't hink so. Gather support is very unlikely as well as FMA, as the whole load/store architecture is probably too weak to get any tangible performance benefit in most cases. So one can save the effort. 256Bit SIMD units are completely out of the question (would imply changing core parts of the design).

AMD got Jaguar just doing its first steps. I really doubt they had much time to fiddle with changes for MS or Sony. If there are customizations beyond getting a 8 core version to run, they will be small. The customization is done on the SoC level, not the cores itself.
 
That is several years ago and it was IBM, not AMD. What does AVX2 add, which could be of interest for games? 256 Bit Integer instructions? Some bit manipulation or vector shifts? Don't hink so. Gather support is very unlikely as well as FMA, as the whole load/store architecture is probably too weak to get any tangible performance benefit in most cases. So one can save the effort. 256Bit SIMD units are completely out of the question (would imply changing core parts of the design).

AMD got Jaguar just doing its first steps. I really doubt they had much time to fiddle with changes for MS or Sony. If there are customizations beyond getting a 8 core version to run, they will be small. The customization is done on the SoC level, not the cores itself.

Sounds reasonable.
 
Back
Top