Future console CPUs: will they go back to OoOE, and other questions.

Well, I can tell you how 1 Xenon core compares to 1 P4 3ghz core. I wrote this little benchmark with some typical STL usage and the result is pretty shocking. Basically the P4 is TWICE faster. The code generated for the PPC actually looks better than the x386 code, yet it fails to perform as fast.
Of course your mileage may vary.

That's worse than I'd expect, I always thought that xenon got around 2/3rds the performance of a p4 at the same speed. Rough guesswork based on Xenon games (asssuming 1 core used only) and their PC equivalents and the types of framerates achieved with both processors (so pretty much irrelevent on anything but a game by game basis, but it seems to hold up well across several games). Did you use 2 threads on the Xenon core?
 
Well, I can tell you how 1 Xenon core compares to 1 P4 3ghz core. I wrote this little benchmark with some typical STL usage and the result is pretty shocking. Basically the P4 is TWICE faster. The code generated for the PPC actually looks better than the x386 code, yet it fails to perform as fast.
Of course your mileage may vary.
This is really shocking. This could potentially means a 60% degradation.

Could you explain more please. Is it a pathological case? FSB dependent?

Could other developers talk how bad is the degradation?
 
That's worse than I'd expect, I always thought that xenon got around 2/3rds the performance of a p4 at the same speed. Rough guesswork based on Xenon games (asssuming 1 core used only) and their PC equivalents and the types of framerates achieved with both processors (so pretty much irrelevent on anything but a game by game basis, but it seems to hold up well across several games). Did you use 2 threads on the Xenon core?

How heavily is that game code optimised for the respective architectures? Assuming equally I guess that would put Xenon at around the level of a X2 3800+ in gaming code. Thats actually better than I would have expected.
 
The problem is when someone say to you "Hey we will give you a 3.2Ghz RISC" you will expect more, and with higher expectation comes higher frustration. Probably developers had a higher expectation.

Then a 4.5GHz Xenon CPU core is more like a ~3GHz (both using a 65nm).

But when you need SIMD and flops you will have a different situation :)

There's really no inherent benefit to using an alternative instruction set at this point, aside from maybe more GPRs. Even if you think the RISC concept still has something to offer (despite the fact that most RISC implementations no longer provide leading performance), PowerPC actually strays from many of those basic design tenants. Of course many console developers are probably experienced with more traditional RISC architectures like the R3000, R4300, and R5900.
 
To be honest I don't know what MS are providing, but IBM are relying on GCC for the PPE, which is essentially the same CPU. I don't think you can compile the vast quantities of stuff that make up a complete desktop operating system with GCC for an architecture like that and expect blanket decent performance without investing a lot of developer effort in optimization. Still - I admit this is arm waving on my part - I don't have much experience of GCC outside of X86. Also, RISC has been waiting for for compliers to take the place of silicon for twenty years now and both Transmeta's stuff and IA64 seem to argue that we haven't reached that point yet...

GCC on x86 seems pretty decent and I imagine that, like most open source projects, this is because it has a lot of development support on the x86 platform. On less common platforms, like SPARC, GCC performance is a good deal below the vendor supplied compilers.
 
This is really shocking. This could potentially means a 60% degradation.

Could you explain more please. Is it a pathological case? FSB dependent?

Could other developers talk how bad is the degradation?

Both versions were compiled with VS2005 maximum optimizations including LTCG. It's not pathological case, more like it would seem the P4's bigger caches and OOOE cope better with memory latencies. The benchmark was strictly single threaded, nothing fancy, just typical usage of lists, vectors, maps, strings, shared pointers, allocations etc.
 
There's really no inherent benefit to using an alternative instruction set at this point, aside from maybe more GPRs. Even if you think the RISC concept still has something to offer (despite the fact that most RISC implementations no longer provide leading performance), PowerPC actually strays from many of those basic design tenants. Of course many console developers are probably experienced with more traditional RISC architectures like the R3000, R4300, and R5900.

Honestly, I prefer MIPS over PPC, but if I can get something new I'd have one huge unified register file and flexible orthogonal SIMD instruction set. Just like the SPE.
PowerPC is showing it's age already. Separate register files are a major pain and obstacle to high performance.
 
There's really no inherent benefit to using an alternative instruction set at this point, aside from maybe more GPRs. Even if you think the RISC concept still has something to offer (despite the fact that most RISC implementations no longer provide leading performance), PowerPC actually strays from many of those basic design tenants. Of course many console developers are probably experienced with more traditional RISC architectures like the R3000, R4300, and R5900.
I have a feeling that RISC concept, still has something to offer, making the logic surrounding the Execution Units simpler and easier to design, and potentially faster.

The CISC x86 based CPUs have advantage of large scalle market which can channel or direct lots of design, fabrication and software resources hiding the architecture disadvantages.

edited: also the RISC concept should be revaluated to the current needs and resources.
 
Both versions were compiled with VS2005 maximum optimizations including LTCG. It's not pathological case, more like it would seem the P4's bigger caches and OOOE cope better with memory latencies. The benchmark was strictly single threaded, nothing fancy, just typical usage of lists, vectors, maps, strings, shared pointers, allocations etc.
And did you have any specific indication where from this degradation happened? What it a mixed benchmark or a series of independent routines more like an actual code?
 
This is really shocking. This could potentially means a 60% degradation.

Could you explain more please. Is it a pathological case? FSB dependent?

Could other developers talk how bad is the degradation?

How is it shocking? What did you expect from a 2-issue in-order core?
 
Honestly, I prefer MIPS over PPC, but if I can get something new I'd have one huge unified register file and flexible orthogonal SIMD instruction set. Just like the SPE.
PowerPC is showing it's age already. Separate register files are a major pain and obstacle to high performance.

I think MAJC might be what you're looking for, but then it didn't go very far.
 
Out of order loads won't help with pointer chasing, because you can't send a load before its target address is calculated. If a chip is pointer chasing, the target address can't be calculated until after the prior load is completed.

It is true in general, but I don't think it's the whole story. If data structures are organized by cache locality, then prefetching will often bring the pointer destinations into the cache. If the structures are allocated regularly enough, one can even use a form of loop induction to predict pointer addresses to be fetched.

Combined with a programming language that ensures locality (e.g. heap compacting generational garbage colletion), it's a big win.
 
I am expecting a performance more like a P4 2.4 Northwood from a 3.2 IOE PPC.

I think this need further/deeper investigation.

Well, bear in mind that the Northwood has 2 double-pumped ALUs and runs at the same base clock speed as the Cell/Xenon, in addition to having OOOE (enabling it to do useful work during cache misses and better exploit ILP), and a better branch predictor.

Quite frankly, the reason we haven't seen these kind of in-order superscalar architectures since the Pentium (P5x) is because they're slow. Now, if you are going to throw a ton of simple cores on a chip with lots of threading to run code with inherently low ILP (i.e. like Niagara), then you are going to have something of a niche. But this is hardly a general case or even the common case.
 
Is the PPC really the exemplar of what we're discussing here though? I think if you look at the present day, then basically what you want is an architecture like Cell, sans the PPU and with something more 'robust' in it's place. ...and Cell itself is not some hypothetical 'what if' situation either; although most of the comparisons have brought it back to the PC, remember that this a real architecture with real design wins outside of the console realm.

The Energy Department wouldn't be putting out for a 16,000 Cell/16,000 Opteron system if 32,000 Opterons would do the job just as well afterall...

So again, I think most of us can accept that the SPEs have definitive real-world advantages - across a number of tasks - to much larger, more expensive, and more complex OOE cores. The weak spot remains the IOE Power core associated with the project, and a number of different design choices in this regard could have made for a more 'complete' package by bringing single-threaded GP performance up to acceptable standards.

There were compromises then at that time that had to be made for power draw, complexity, die size, etc... (plus just plain IBM politics), but maybe future generations of the chip will tweak the 'central' core some to bring some added functionality, since I believek the SPEs themselves have a fairly decent start and future in their own right.
 
The Energy Department wouldn't be putting out for a 16,000 Cell/16,000 Opteron system if 32,000 Opterons would do the job just as well afterall...

This flip side is why didn't they buy 32,000 Cells? (I realize you basically answered this, but I wanted to stick it out all obvious like);)
 
This flip side is why didn't they buy 32,000 Cells?

Well right - and I think if Cell's single-threaded performance were up to par (aka something other than the PPU), maybe they would have.

EDIT: Ok, your own edit sort of confirms we're on the same path here thought-wise, but yeah... :cool:
 
Last edited by a moderator:
Is the PPC really the exemplar of what we're discussing here though? I think if you look at the present day, then basically what you want is an architecture like Cell, sans the PPU and with something more 'robust' in it's place. ...and Cell itself is not some hypothetical 'what if' situation either; although most of the comparisons have brought it back to the PC, remember that this a real architecture with real design wins outside of the console realm.

The Energy Department wouldn't be putting out for a 16,000 Cell/16,000 Opteron system if 32,000 Opterons would do the job just as well afterall...

So again, I think most of us can accept that the SPEs have definitive real-world advantages - across a number of tasks - to much larger, more expensive, and more complex OOE cores. The weak spot remains the IOE Power core associated with the project, and a number of different design choices in this regard could have made for a more 'complete' package by bringing single-threaded GP performance up to acceptable standards.

There were compromises then at that time that had to be made for power draw, complexity, die size, etc... (plus just plain IBM politics), but maybe future generations of the chip will tweak the 'central' core some to bring some added functionality, since I believek the SPEs themselves have a fairly decent start and future in their own right.


Frankly the SPE's are fine for what they are, but you wouldn't want one as your only processor.

Work out how many instructions you need to read an unaligned byte, or compare SPE program size to almost anything else for the same task.
 
How heavily is that game code optimised for the respective architectures? Assuming equally I guess that would put Xenon at around the level of a X2 3800+ in gaming code. Thats actually better than I would have expected.

No idea, comparing performance subjectively has a whole slew of other problems to worry about before you even focus on what the code looks like. Considering they're PC-Console ports/cross developments and early in the 360's life cycle, I doubt they're very well optimized for the 360. Does the 360 really stand to gain that much performance, besides expanding code to take advantage of all 3 cores + dual threads per each?

It's not pathological case, more like it would seem the P4's bigger caches and OOOE cope better with memory latencies. The benchmark was strictly single threaded, nothing fancy, just typical usage of lists, vectors, maps, strings, shared pointers, allocations etc.

How does the 360's memory latencies compare to a PC's? About the same? Significantly worse? Better?
 
Frankly the SPE's are fine for what they are, but you wouldn't want one as your only processor.

Work out how many instructions you need to read an unaligned byte, or compare SPE program size to almost anything else for the same task.

And that's totally fair. I'm not saying you'd want SPEs alone per se, rather just that in the present Cell architecture, it's the PPU that's the weak link. If they had gone with... well Barbarian's example of a Conroe-based central core let's say... I mean who could not love that chip?
 
Back
Top