Future console CPUs: will they go back to OoOE, and other questions.

Megadrive1988 · Sep 2, 2006

ok I know that the Xenon CPU's PPE cores are In-Order Execution and so is the PPE in CELL.

In-Order CPUs are simpler than Out-of-Order-Execution CPUs. transistors and complexity is saved but there is a major price in performance.

originally, the Xenon design team at MS wanted the CPU cores to be Out of Order Execution, but as IBM got to work they had to make them In-Order. it was disappointed for the team from what I read of The Xbox 360 Uncloaked. actually here it is:

Another setback was that IBM had also decided that it couldnâ€™t do out-of-order execution. This was a modern technique that enabled microprocessors to dispatch a number of tasks in parallel. A sequence of instructions was broken into parallel paths without regard to order so they could be executed quickly, and then put back into the proper sequence upon completion.
Instead, IBM had to make the cores execute with the simpler, but more primitive, in-order execution. Out-of-order consumed more space on the chip, potentially driving up the costs and raising the risks. When Microsoftâ€™s Jeff Andrews went to Jon Thomason and told him the news, it was like a bombshell. One by one, many of the Mountain View groupâ€™s biggest technological dreams were falling by the wayside.

okay, the CPU cores in the last generation: R5900 variant in Emotion Engine,
Gekko in Gamecube and the Intel PIII/Celeron in Xbox were all Out-Of-Order-Execution CPUs, am I right? (btw, what about Dreamcast's SH4?)

.....which would make Nintendo Wii the only console of this new generation to have an OoOE CPU (Broadway) if Gekko was an OoOE.

ok, do you guys think that Xenon's successor and next-gen CELL processors will use OoOE cores or not ?

ban25 · Sep 2, 2006

Only the PPC 750 and PIII Celeron (Gamecube and Xbox) implemented OOOE.

Guden Oden · Sep 2, 2006

This gen, Sony and MS could either implement a big chip with several processors on it, and those processors could either have a lot of OoOE or a lot of execution units, but not both at once.

As semiconductor tech improves, I'm sure we'll indeed see both at once at some point in the future, not unlikely already next gen. I am also sure that we'll never see a gen of consoles where either Sony or MS drops the execution units for more OoOE tho. That's oldfashioned these days, and won't lead the way to our glorious computing revolution future!

Npl · Sep 2, 2006

Doubt it. IMHO for Sony its pointing to a Cell2, which significantly ups the LS of its SPUs. I read in some IBM-Stories, that the algorithms they simulated would hit a sweetspot with 4MB LS. The PPE could get some OooE, but I think the transistors will be used for more execution units or cache.

Inane_Dork · Sep 2, 2006

Not all OoOE is the same, though. Perhaps future console CPUs will have limited forms of it. Spending much die space on the feature is not necessarily smart for a console CPU.

ShootMyMonkey · Sep 2, 2006

Not all OoOE is the same, though. Perhaps future console CPUs will have limited forms of it. Spending much die space on the feature is not necessarily smart for a console CPU.

Perhaps to the level that Itanium has, but I don't know how far you really want to push this lest history repeat itself. On the PC, we've basically needed something like an 11x increase in transistor budget since the last of the in-order CPUs just to eke out around double the IPC. Granted, this is more a result of new IPC-boosting additions constantly snowballing on tope of each other, but that's generally how it carries out. If you instead scaled up the number of cores, that's massively more bang for the buck on the hardware side.

What is still a thorn in that design path is how software will scale. The idea of not being necessarily "fast" but "high throughput" is not the most straightforward problem to solve -- particularly for games, where there are linear dependencies and time spent is everything.

Crossbar · Sep 3, 2006

As long as you don't have a lot of legacy code that you want to incease the IPC for, I think going for in order execution with more lean cpu cores, instead of a few fat OOO cores will be the alternative of choice and consoles donÂ´t have that kind of legacy code.

ban25 · Sep 3, 2006

ShootMyMonkey said:
Perhaps to the level that Itanium has, but I don't know how far you really want to push this lest history repeat itself. On the PC, we've basically needed something like an 11x increase in transistor budget since the last of the in-order CPUs just to eke out around double the IPC. Granted, this is more a result of new IPC-boosting additions constantly snowballing on tope of each other, but that's generally how it carries out. If you instead scaled up the number of cores, that's massively more bang for the buck on the hardware side.

As transistor budgets continue to scale with smaller process geometries, I would expect cores to increase in complexity. Eventually, you reach a point of diminishing returns as you pursue increasingly complex methods of boosting IPC, but these in-order console CPUs (PPE/X360) haven't even come close to that yet.

aaaaa00 · Sep 3, 2006

ban25 said:
As transistor budgets continue to scale with smaller process geometries, I would expect cores to increase in complexity. Eventually, you reach a point of diminishing returns as you pursue increasingly complex methods of boosting IPC, but these in-order console CPUs (PPE/X360) haven't even come close to that yet.

I agree with this. There will be a sweet spot of core complexity vs core count, and I don't think the current console CPUs have hit it because of the limited overall transistor budget they have.

ShootMyMonkey · Sep 3, 2006

As transistor budgets continue to scale with smaller process geometries, I would expect cores to increase in complexity. Eventually, you reach a point of diminishing returns as you pursue increasingly complex methods of boosting IPC, but these in-order console CPUs (PPE/X360) haven't even come close to that yet.

Sure. Which is kind of what I was getting at with the Itanium example in that it was intended to be completely in-order, but over time and reworks, it's gotten sprinklings of self-scheduling and OOO. I think there's still a lot of research to be done yet, but trial and error and field testing is about the only way you're going to find that optimal condition where scaling core complexity and scaling numbers of cores has about the same impact per transistor.

ADEX · Sep 3, 2006

What no one has mentioned is power consumption, adding OOO would send the power consumption of the PPE / Xenon cores spiralling upwards. I don't think having a 150 Watt CPU in a console is a terribly good idea.

Intel's Woodcrest runs at the same clock frequency but the cores are consuming close to 40W each, this will be higher in a 90nm process the Xenon is made in.

To save power clock speed could be reduced but that would remove most if not all of the boost given by OOO.

However OOO isn't a magic bullet, while it boosts performance it's doesn't do this to all types of code and in fact reduces the performance of others due to the lower clock speed. OOO is better at control type operations or "branchy integer" code.

If you look at what games do most of the time it's lots of repetitive calculations, for these types of operations an in-order high clocked processor is better.

I do agree though that there is a middle ground which hasn't been met yet, IBM's next server chip - POWER6 may point in the direction as it's reducing (but not removing) OOO capabilities in return for a higher clock speed (it's expected at 4_5 GHz). i.e. it gets the best of both worlds.

I'd like to see a Cell with a POWER6 like core replacing the PPE and enhanced SPEs (better Dual precision and bigger LS). I don't think OOO will ever be added to the SPEs as it'll only reduce their performance.

pc999 · Sep 3, 2006

I think that the answer to this will be much more dependent on: how next gen will performe in sales (specially wii vs PS3/360), the others is how will SW, OoOE and other advancements (I remember of seeing some HW that could help a lot in pararel SW) will progress, how much performance will be extracted on the top games, the beneffics of having much more raw power or having better dev environment...

Personally I belive they will go for the way of lower price, easy to devolp and extra features (eg, EyeToy2, speech recog, video chat, but not that high tech) althought with a nice bost in power (but not 35xflops/s like).

SPM · Sep 3, 2006

There is a big difference between consoles and general purpose computers.

To achieve reasonable performance, in order processors need to use optimising compilers to check dependencies and re-order code. By doing this in the compiler before execution (and why shouldn't you do that if you can) you can get a higher overall performance by using the removed complexity in ooo hardware to provide more cores.

The problem with in-order is that the optimising compiler will only produce the optimised code for one target CPU type eg. the Intel P4 but not the AMD Athlon etc . Operating systems like Windows have binary code compiled for i486, i586 etc as well as todays AMD and Intel 32 and 64 bit chips, and has to run on several different processors. Therefore optimising compilers don't work because there is no target for the compiler to optimise for. Hence ooo will always be required for mainstream Windows computers, and in order will always produce poor performance.

For consoles where the code can be compiled to optimised for just one specific target processor on the other hand, in-order-execution cores may well give the performance per transistor advantage predicted. It may also work for Linux on the PS3 if compiled specifically for the PS3. With an OS like Linux where the source code is available for re-compilation, it is quite easy to do this.

Arwin · Sep 3, 2006

I personally think that the multi-core, and multi-chip, era has started good. If you look at what keeps current PCs busy most, the Cell chip definitely is the way of the future (lots of media coding stuff). Many chip experts I think agree on that this is the way to go, and the only reason that desktop chipsets haven't already gone there before is because of legacy software.

Very quickly however I think we will see legacy software make way for virtualisation, which has made enormous strides these days. Virtual servers are all the rage, the technology has become very grown up and widely used, and soon it will be used on the desktop to run legacy software efficiently on multi-core/multi-chip systems. OoOE will not be an important feature in the next generation of consoles, I'll wager.

ban25 · Sep 3, 2006

ShootMyMonkey said:
Sure. Which is kind of what I was getting at with the Itanium example in that it was intended to be completely in-order, but over time and reworks, it's gotten sprinklings of self-scheduling and OOO. I think there's still a lot of research to be done yet, but trial and error and field testing is about the only way you're going to find that optimal condition where scaling core complexity and scaling numbers of cores has about the same impact per transistor.

Niagara is a good example of this. The architecture remains very much focused on throughput, but Niagara-2 has substantial per-core improvements to boost IPC.

Gubbi · Sep 4, 2006

ADEX said:
Intel's Woodcrest runs at the same clock frequency but the cores are consuming close to 40W each, this will be higher in a 90nm process the Xenon is made in.

Which has zero significance. Power/megahurtz is 100% pointless to compare.

In Power/Performance Woodcrest wins hands down.

ADEX said:
To save power clock speed could be reduced but that would remove most if not all of the boost given by OOO.

I disagree. OOO saves power. To get anywhere near the performance of an OOO core an nn-order would have to be:
1. Wider (or faster).
2. Spend significant resources (power)to make critical data-depency latencies smaller, like L1 load-to-use latency
3. Increase the size of on die caches because inlining and loop-unrolling will bloat code.

If in-order CPUs had any kind of power/performance edge over OOO CPUs we'd see laptops with in-order CPUs in them.

ADEX said:
However OOO isn't a magic bullet, while it boosts performance it's doesn't do this to all types of code and in fact reduces the performance of others due to the lower clock speed. OOO is better at control type operations or "branchy integer" code.

Clock frequency is more a function of how deeply a CPU is pipelined than anything else. P4, Athlons, Power 5 all runs (clocks) faster than Itanium 2 and UltraSparc.

And stating that "OOO is better at control type operations or "branchy integer" code." is also incorrect. OOO is used to let the CPU schedule instructions around data-dependencies, not control dependencies. Of course a state of the art CPU (with OOO) is more likely to have a state of the art branch predictor and thereby be better at resolving branches correctly

ADEX said:
I do agree though that there is a middle ground which hasn't been met yet, IBM's next server chip - POWER6 may point in the direction as it's reducing (but not removing) OOO capabilities in return for a higher clock speed (it's expected at 4_5 GHz). i.e. it gets the best of both worlds.

Nothing indicates that POWER 6 is dumping OOO, they are going for a narrower instruction fetch+decode+retire. But that kind of makes sense, since a whole bunch of server workloads has an IPC (instruction per cycle) throughput <1 and almost all have <2 it makes little sense to be able to sustain 4 or more (like Power4/5).

EDIT: The only thing an in-order CPU has going for it is reduced design complexity and the inherent time-to-market advantage that may give.

Cheers

darkblu · Sep 4, 2006

SPM said:
The problem with in-order is that the optimising compiler will only produce the optimised code for one target CPU type eg. the Intel P4 but not the AMD Athlon etc . Operating systems like Windows have binary code compiled for i486, i586 etc as well as todays AMD and Intel 32 and 64 bit chips, and has to run on several different processors. Therefore optimising compilers don't work because there is no target for the compiler to optimise for. Hence ooo will always be required for mainstream Windows computers, and in order will always produce poor performance.

though i'd generally agree with you about the mandatory OOO for wintel desktops and the likes (i.e. cpu segments virtually devoid of precise compilation targets) i'm not sure if you don't give too much credit to order-optimising compilers and in-order cpus.

the problem is in the dynamic nature of the pipeline/units stalls at runtime. whereas the compiler can do its best to discover data (in)dependencies at compile time, an OOO cpu can do way more - it can continue to dynamically re-schedule ops and do work at some stations while other stations are blocked/stalled due to events virtually unforeseeable at compile time.

Fafalada · Sep 4, 2006

darkblu said:
whereas the compiler can do its best to discover data (in)dependencies at compile time

Let's be honest though - compiler schedulers aren't exactly state of the art at this point. All too often they need manual labour to prod them in the right direction.

As for dynamic scheduling - Profile based optimizers could often help there (not to say they are a solution for everything).

darkblu · Sep 4, 2006

Fafalada said:
Let's be honest though - compiler schedulers aren't exactly state of the art at this point. All too often they need manual labour to prod them in the right direction.

yes, you are quite right, unfortunately.

As for dynamic scheduling - Profile based optimizers could often help there (not to say they are a solution for everything).

doh, i have to admit i had totally forgotten about those. must have something to do with the fact that i use such stuff once in a blue moon. but i guess on a closed platform they could prove faily useful.

Shifty Geezer · Sep 4, 2006

Gubbi said:
If in-order CPUs had any kind of power/performance edge over OOO CPUs we'd see laptops with in-order CPUs in them.

I don't think that's true. All PCs need legacy support and to run any old code. In order cores can crawl at unoptimized code. If the laptop was a closed system like a console with it's own software and devs had to target in-order, then it'd be a choice worth considering.

EDIT: The only thing an in-order CPU has going for it is reduced design complexity and the inherent time-to-market advantage that may give.

And smaller cores meaning you can cram more onto a chip, and simpler cores meaning you can clock them higher. Why else would both Sony and MS go with IO multicore in their desire for high-performance processors, if a similarly sized and far easier to develop for OOO core would give better performance? In order means more execution units which in turn provides greater peak performance potential, shifting the concerns of efficient instruction usage to the developer. The real-world gains are evident in the likes of Mercury's medical imaging. The existing OOO cores aren't a patch on the slimmed-down multicores of Cell, and if SPEs were OOO and sized up because of that, there wouldn't be as many of them resulting in lower performance.

As consoles provide a closed-box environment with no need for legacy support of existing 'unoptimized for IO' programs, there's no reason to switch to OOO on future consoles except to give devs an easier time at a cost of peak performance of your hardware. As I expect the multi-core paradigm to be better understood and supported on development tools over time, this generation is probably going to be the hardest for IO development, and future consoles will have it easier, making the use of IO less of a restriction. Especially if courses are set up to provide IO coverage for programmers, and universities don't just stick to knocking out C on x86 boxes and considering every student who manages that to be suitable for console development.

Future console CPUs: will they go back to OoOE, and other questions.

Megadrive1988

ban25

Guden Oden

Senior Member

Npl

Inane_Dork

Rebmem Roines

ShootMyMonkey

Crossbar

ban25

aaaaa00

ShootMyMonkey

ADEX

pc999

SPM

Arwin

Now Officially a Top 10 Poster

ban25

Gubbi

darkblu

Fafalada

darkblu

Shifty Geezer

uber-Troll!

Similar threads