Future console CPUs: will they go back to OoOE, and other questions.

ok I know that the Xenon CPU's PPE cores are In-Order Execution and so is the PPE in CELL.

In-Order CPUs are simpler than Out-of-Order-Execution CPUs. transistors and complexity is saved but there is a major price in performance.

originally, the Xenon design team at MS wanted the CPU cores to be Out of Order Execution, but as IBM got to work they had to make them In-Order. it was disappointed for the team from what I read of The Xbox 360 Uncloaked. actually here it is:

Another setback was that IBM had also decided that it couldn’t do out-of-order execution. This was a modern technique that enabled microprocessors to dispatch a number of tasks in parallel. A sequence of instructions was broken into parallel paths without regard to order so they could be executed quickly, and then put back into the proper sequence upon completion.
Instead, IBM had to make the cores execute with the simpler, but more primitive, in-order execution. Out-of-order consumed more space on the chip, potentially driving up the costs and raising the risks. When Microsoft’s Jeff Andrews went to Jon Thomason and told him the news, it was like a bombshell. One by one, many of the Mountain View group’s biggest technological dreams were falling by the wayside.


okay, the CPU cores in the last generation: R5900 variant in Emotion Engine,
Gekko in Gamecube and the Intel PIII/Celeron in Xbox were all Out-Of-Order-Execution CPUs, am I right? (btw, what about Dreamcast's SH4?)

.....which would make Nintendo Wii the only console of this new generation to have an OoOE CPU (Broadway) if Gekko was an OoOE.

ok, do you guys think that Xenon's successor and next-gen CELL processors will use OoOE cores or not ?
 
This gen, Sony and MS could either implement a big chip with several processors on it, and those processors could either have a lot of OoOE or a lot of execution units, but not both at once.

As semiconductor tech improves, I'm sure we'll indeed see both at once at some point in the future, not unlikely already next gen. I am also sure that we'll never see a gen of consoles where either Sony or MS drops the execution units for more OoOE tho. That's oldfashioned these days, and won't lead the way to our glorious computing revolution future! ;)
 
Doubt it. IMHO for Sony its pointing to a Cell2, which significantly ups the LS of its SPUs. I read in some IBM-Stories, that the algorithms they simulated would hit a sweetspot with 4MB LS. The PPE could get some OooE, but I think the transistors will be used for more execution units or cache.
 
Not all OoOE is the same, though. Perhaps future console CPUs will have limited forms of it. Spending much die space on the feature is not necessarily smart for a console CPU.
 
Not all OoOE is the same, though. Perhaps future console CPUs will have limited forms of it. Spending much die space on the feature is not necessarily smart for a console CPU.
Perhaps to the level that Itanium has, but I don't know how far you really want to push this lest history repeat itself. On the PC, we've basically needed something like an 11x increase in transistor budget since the last of the in-order CPUs just to eke out around double the IPC. Granted, this is more a result of new IPC-boosting additions constantly snowballing on tope of each other, but that's generally how it carries out. If you instead scaled up the number of cores, that's massively more bang for the buck on the hardware side.

What is still a thorn in that design path is how software will scale. The idea of not being necessarily "fast" but "high throughput" is not the most straightforward problem to solve -- particularly for games, where there are linear dependencies and time spent is everything.
 
Last edited by a moderator:
As long as you don't have a lot of legacy code that you want to incease the IPC for, I think going for in order execution with more lean cpu cores, instead of a few fat OOO cores will be the alternative of choice and consoles don´t have that kind of legacy code.
 
Perhaps to the level that Itanium has, but I don't know how far you really want to push this lest history repeat itself. On the PC, we've basically needed something like an 11x increase in transistor budget since the last of the in-order CPUs just to eke out around double the IPC. Granted, this is more a result of new IPC-boosting additions constantly snowballing on tope of each other, but that's generally how it carries out. If you instead scaled up the number of cores, that's massively more bang for the buck on the hardware side.

As transistor budgets continue to scale with smaller process geometries, I would expect cores to increase in complexity. Eventually, you reach a point of diminishing returns as you pursue increasingly complex methods of boosting IPC, but these in-order console CPUs (PPE/X360) haven't even come close to that yet.
 
As transistor budgets continue to scale with smaller process geometries, I would expect cores to increase in complexity. Eventually, you reach a point of diminishing returns as you pursue increasingly complex methods of boosting IPC, but these in-order console CPUs (PPE/X360) haven't even come close to that yet.

I agree with this. There will be a sweet spot of core complexity vs core count, and I don't think the current console CPUs have hit it because of the limited overall transistor budget they have.
 
As transistor budgets continue to scale with smaller process geometries, I would expect cores to increase in complexity. Eventually, you reach a point of diminishing returns as you pursue increasingly complex methods of boosting IPC, but these in-order console CPUs (PPE/X360) haven't even come close to that yet.
Sure. Which is kind of what I was getting at with the Itanium example in that it was intended to be completely in-order, but over time and reworks, it's gotten sprinklings of self-scheduling and OOO. I think there's still a lot of research to be done yet, but trial and error and field testing is about the only way you're going to find that optimal condition where scaling core complexity and scaling numbers of cores has about the same impact per transistor.
 
What no one has mentioned is power consumption, adding OOO would send the power consumption of the PPE / Xenon cores spiralling upwards. I don't think having a 150 Watt CPU in a console is a terribly good idea.

Intel's Woodcrest runs at the same clock frequency but the cores are consuming close to 40W each, this will be higher in a 90nm process the Xenon is made in.

To save power clock speed could be reduced but that would remove most if not all of the boost given by OOO.

However OOO isn't a magic bullet, while it boosts performance it's doesn't do this to all types of code and in fact reduces the performance of others due to the lower clock speed. OOO is better at control type operations or "branchy integer" code.

If you look at what games do most of the time it's lots of repetitive calculations, for these types of operations an in-order high clocked processor is better.

I do agree though that there is a middle ground which hasn't been met yet, IBM's next server chip - POWER6 may point in the direction as it's reducing (but not removing) OOO capabilities in return for a higher clock speed (it's expected at 4_5 GHz). i.e. it gets the best of both worlds.

I'd like to see a Cell with a POWER6 like core replacing the PPE and enhanced SPEs (better Dual precision and bigger LS). I don't think OOO will ever be added to the SPEs as it'll only reduce their performance.
 
I think that the answer to this will be much more dependent on: how next gen will performe in sales (specially wii vs PS3/360), the others is how will SW, OoOE and other advancements (I remember of seeing some HW that could help a lot in pararel SW) will progress, how much performance will be extracted on the top games, the beneffics of having much more raw power or having better dev environment...

Personally I belive they will go for the way of lower price, easy to devolp and extra features (eg, EyeToy2, speech recog, video chat, but not that high tech) althought with a nice bost in power (but not 35xflops/s like).
 
There is a big difference between consoles and general purpose computers.

To achieve reasonable performance, in order processors need to use optimising compilers to check dependencies and re-order code. By doing this in the compiler before execution (and why shouldn't you do that if you can) you can get a higher overall performance by using the removed complexity in ooo hardware to provide more cores.

The problem with in-order is that the optimising compiler will only produce the optimised code for one target CPU type eg. the Intel P4 but not the AMD Athlon etc . Operating systems like Windows have binary code compiled for i486, i586 etc as well as todays AMD and Intel 32 and 64 bit chips, and has to run on several different processors. Therefore optimising compilers don't work because there is no target for the compiler to optimise for. Hence ooo will always be required for mainstream Windows computers, and in order will always produce poor performance.

For consoles where the code can be compiled to optimised for just one specific target processor on the other hand, in-order-execution cores may well give the performance per transistor advantage predicted. It may also work for Linux on the PS3 if compiled specifically for the PS3. With an OS like Linux where the source code is available for re-compilation, it is quite easy to do this.
 
I personally think that the multi-core, and multi-chip, era has started good. If you look at what keeps current PCs busy most, the Cell chip definitely is the way of the future (lots of media coding stuff). Many chip experts I think agree on that this is the way to go, and the only reason that desktop chipsets haven't already gone there before is because of legacy software.

Very quickly however I think we will see legacy software make way for virtualisation, which has made enormous strides these days. Virtual servers are all the rage, the technology has become very grown up and widely used, and soon it will be used on the desktop to run legacy software efficiently on multi-core/multi-chip systems. OoOE will not be an important feature in the next generation of consoles, I'll wager.
 
Sure. Which is kind of what I was getting at with the Itanium example in that it was intended to be completely in-order, but over time and reworks, it's gotten sprinklings of self-scheduling and OOO. I think there's still a lot of research to be done yet, but trial and error and field testing is about the only way you're going to find that optimal condition where scaling core complexity and scaling numbers of cores has about the same impact per transistor.

Niagara is a good example of this. The architecture remains very much focused on throughput, but Niagara-2 has substantial per-core improvements to boost IPC.
 
Intel's Woodcrest runs at the same clock frequency but the cores are consuming close to 40W each, this will be higher in a 90nm process the Xenon is made in.

Which has zero significance. Power/megahurtz is 100% pointless to compare.

In Power/Performance Woodcrest wins hands down.

To save power clock speed could be reduced but that would remove most if not all of the boost given by OOO.

I disagree. OOO saves power. To get anywhere near the performance of an OOO core an nn-order would have to be:
1. Wider (or faster).
2. Spend significant resources (power)to make critical data-depency latencies smaller, like L1 load-to-use latency
3. Increase the size of on die caches because inlining and loop-unrolling will bloat code.

If in-order CPUs had any kind of power/performance edge over OOO CPUs we'd see laptops with in-order CPUs in them.

However OOO isn't a magic bullet, while it boosts performance it's doesn't do this to all types of code and in fact reduces the performance of others due to the lower clock speed. OOO is better at control type operations or "branchy integer" code.

Clock frequency is more a function of how deeply a CPU is pipelined than anything else. P4, Athlons, Power 5 all runs (clocks) faster than Itanium 2 and UltraSparc.

And stating that "OOO is better at control type operations or "branchy integer" code." is also incorrect. OOO is used to let the CPU schedule instructions around data-dependencies, not control dependencies. Of course a state of the art CPU (with OOO) is more likely to have a state of the art branch predictor and thereby be better at resolving branches correctly

I do agree though that there is a middle ground which hasn't been met yet, IBM's next server chip - POWER6 may point in the direction as it's reducing (but not removing) OOO capabilities in return for a higher clock speed (it's expected at 4_5 GHz). i.e. it gets the best of both worlds.
Nothing indicates that POWER 6 is dumping OOO, they are going for a narrower instruction fetch+decode+retire. But that kind of makes sense, since a whole bunch of server workloads has an IPC (instruction per cycle) throughput <1 and almost all have <2 it makes little sense to be able to sustain 4 or more (like Power4/5).

EDIT: The only thing an in-order CPU has going for it is reduced design complexity and the inherent time-to-market advantage that may give.

Cheers
 
Last edited by a moderator:
The problem with in-order is that the optimising compiler will only produce the optimised code for one target CPU type eg. the Intel P4 but not the AMD Athlon etc . Operating systems like Windows have binary code compiled for i486, i586 etc as well as todays AMD and Intel 32 and 64 bit chips, and has to run on several different processors. Therefore optimising compilers don't work because there is no target for the compiler to optimise for. Hence ooo will always be required for mainstream Windows computers, and in order will always produce poor performance.

though i'd generally agree with you about the mandatory OOO for wintel desktops and the likes (i.e. cpu segments virtually devoid of precise compilation targets) i'm not sure if you don't give too much credit to order-optimising compilers and in-order cpus.

the problem is in the dynamic nature of the pipeline/units stalls at runtime. whereas the compiler can do its best to discover data (in)dependencies at compile time, an OOO cpu can do way more - it can continue to dynamically re-schedule ops and do work at some stations while other stations are blocked/stalled due to events virtually unforeseeable at compile time.
 
darkblu said:
whereas the compiler can do its best to discover data (in)dependencies at compile time
Let's be honest though - compiler schedulers aren't exactly state of the art at this point. All too often they need manual labour to prod them in the right direction.

As for dynamic scheduling - Profile based optimizers could often help there (not to say they are a solution for everything).
 
Let's be honest though - compiler schedulers aren't exactly state of the art at this point. All too often they need manual labour to prod them in the right direction.

yes, you are quite right, unfortunately.

As for dynamic scheduling - Profile based optimizers could often help there (not to say they are a solution for everything).

doh, i have to admit i had totally forgotten about those. must have something to do with the fact that i use such stuff once in a blue moon. but i guess on a closed platform they could prove faily useful.
 
If in-order CPUs had any kind of power/performance edge over OOO CPUs we'd see laptops with in-order CPUs in them.
I don't think that's true. All PCs need legacy support and to run any old code. In order cores can crawl at unoptimized code. If the laptop was a closed system like a console with it's own software and devs had to target in-order, then it'd be a choice worth considering.
EDIT: The only thing an in-order CPU has going for it is reduced design complexity and the inherent time-to-market advantage that may give.
And smaller cores meaning you can cram more onto a chip, and simpler cores meaning you can clock them higher. Why else would both Sony and MS go with IO multicore in their desire for high-performance processors, if a similarly sized and far easier to develop for OOO core would give better performance? In order means more execution units which in turn provides greater peak performance potential, shifting the concerns of efficient instruction usage to the developer. The real-world gains are evident in the likes of Mercury's medical imaging. The existing OOO cores aren't a patch on the slimmed-down multicores of Cell, and if SPEs were OOO and sized up because of that, there wouldn't be as many of them resulting in lower performance.

As consoles provide a closed-box environment with no need for legacy support of existing 'unoptimized for IO' programs, there's no reason to switch to OOO on future consoles except to give devs an easier time at a cost of peak performance of your hardware. As I expect the multi-core paradigm to be better understood and supported on development tools over time, this generation is probably going to be the hardest for IO development, and future consoles will have it easier, making the use of IO less of a restriction. Especially if courses are set up to provide IO coverage for programmers, and universities don't just stick to knocking out C on x86 boxes and considering every student who manages that to be suitable for console development.
 
Back
Top