Future console CPUs: will they go back to OoOE, and other questions.

If in-order CPUs had any kind of power/performance edge over OOO CPUs we'd see laptops with in-order CPUs in them.
The fact that laptops have to support a lot of unparallel legacy code is also a driving force to keep OOO in PCs. A closed environment is a closed environment, on a console you can split your code in to the optimal number of tasks to wring out the most of every clock cycle. Single thread performance will be the king on PCs for a very long time.

EDIT: Shifty beat me. :)
 
Last edited by a moderator:
And smaller cores meaning you can cram more onto a chip, and simpler cores meaning you can clock them higher. Why else would both Sony and MS go with IO multicore in their desire for high-performance processors, if a similarly sized and far easier to develop for OOO core would give better performance?

It was rumoured that MS really wanted a OOO CPU in the X360, but IBM cut them short to meet the deadlines.

In order means more execution units which in turn provides greater peak performance potential, shifting the concerns of efficient instruction usage to the developer. The real-world gains are evident in the likes of Mercury's medical imaging. The existing OOO cores aren't a patch on the slimmed-down multicores of Cell, and if SPEs were OOO and sized up because of that, there wouldn't be as many of them resulting in lower performance.

The savings from going in order is vastly exaggerated, your physical register file will have to have the same size as a renaming register file. Instruction caches will have to be bigger in order to get the same hitrates/performance because of inlining and unrolling bloat. The leaves the ROB, which was <10% of the core in PPRO, P3 and Pentium M, is less than 8% in K8 (all of the schedulers combined), And I'm guessing is a lot less in Core 2 since the branch prediction-, load-store and SSE units are all beefed up (so even though the scheduler supports wider execution, and has more slots to cover more cycles, the relative size is less than PPRO).

The reason the CPU in X360 is an in-order is that it would never have been completed in time otherwise.

As consoles provide a closed-box environment with no need for legacy support of existing 'unoptimized for IO' programs, there's no reason to switch to OOO on future consoles except to give devs an easier time at a cost of peak performance of your hardware. As I expect the multi-core paradigm to be better understood and supported on development tools over time, this generation is probably going to be the hardest for IO development, and future consoles will have it easier, making the use of IO less of a restriction.
Especially if courses are set up to provide IO coverage for programmers, and universities don't just stick to knocking out C on x86 boxes and considering every student who manages that to be suitable for console development.

I really have 2 problems with the above:
1. The sentiment that OOO is so developers can get away with sloppy (or legacy) code. That is not the case at all!!!! There's a whole host of performance hoisting that can only be done on a OOO CPU, and most of these are of the really important kind, hiding latency from the memory subsystem. OOO does not just run sloppy code best, they also run highly optimized code best (and in many cases spectacularly so).

2. The sentiment that developers should rewrite most of their existing codebase to "take advantage" (really: to work around pitfalls) of these arhictectures. With game engines reaching multi-million lines of code this simply becomes un-feasible for most (both time and money). We're not just talking about a re-compile here, we're talking about explicitly (vertically) multi-threading your code to avoid the insane load-to-use and other execution latencies of these CPUs and jumping through all kinds of other hoops.

Cheers
 
Last edited by a moderator:
Perhaps to the level that Itanium has, but I don't know how far you really want to push this lest history repeat itself. On the PC, we've basically needed something like an 11x increase in transistor budget since the last of the in-order CPUs just to eke out around double the IPC. Granted, this is more a result of new IPC-boosting additions constantly snowballing on tope of each other, but that's generally how it carries out. If you instead scaled up the number of cores, that's massively more bang for the buck on the hardware side.
Your statement is misleading. You imply that IPC for an in-order PC CPU would haved remained constant at current clock speeds. That would not be the case because memory latencies have increased significantly. So the IPC of an in-order CPU would be lower than it was 10 years ago. My guess is that the IPC of modern OOO x86 CPU's is somewhat 3-5x that of an comparable in-order x86 CPU, maybe even more for memory intensive applications.
 
It was rumoured that MS really wanted a OOO CPU in the X360, but IBM cut them short to meet the deadlines.
...
The reason the CPU in X360 is an in-order is that it would never have been completed in time otherwise.
This sounds very strange as IBM has plenty of OOO cores to pick from if OOO was really that important to MS. My guess is that IBM convinced MS that if they wanted most overall IPC, then multiple dual threaded in-order cores would give them most performance per transistor.

If you say development time was crucial for the fact that Xenon ending up being in-order, how do you explain the fact that the PPE core in Cell ended up being in-order? Development time was hardly a constraint in that case.
 
This sounds very strange as IBM has plenty of OOO cores to pick from if OOO was really that important to MS. My guess is that IBM convinced MS that if they wanted most overall IPC, then multiple dual threaded in-order cores would give them most performance per transistor.

They really only had one, the PPC970/Power 4 derivative.

If you say development time was crucial for the fact that Xenon ending up being in-order, how do you explain the fact that the PPE core in Cell ended up being in-order? Development time was hardly a constraint in that case.

Man-years, they may have had enough years, but not enough men.

Cheers
 
I don't think that's true. All PCs need legacy support and to run any old code. In order cores can crawl at unoptimized code. If the laptop was a closed system like a console with it's own software and devs had to target in-order, then it'd be a choice worth considering.

Historically, almost all if not all in-order cores have been outperformed by OOOE equivalents.

And smaller cores meaning you can cram more onto a chip, and simpler cores meaning you can clock them higher. Why else would both Sony and MS go with IO multicore in their desire for high-performance processors, if a similarly sized and far easier to develop for OOO core would give better performance?

This one is easy. Both Microsoft and Sony sourced their CPU from the same company and essentially got the same core.

In order means more execution units which in turn provides greater peak performance potential, shifting the concerns of efficient instruction usage to the developer. The real-world gains are evident in the likes of Mercury's medical imaging. The existing OOO cores aren't a patch on the slimmed-down multicores of Cell, and if SPEs were OOO and sized up because of that, there wouldn't be as many of them resulting in lower performance.

The PPE is a 2-issue core. With an in-order CPU, you generally are going to want a narrower, faster core because your IPC is going to be inherently lower anyway. I don't think your Mercury example is entirely relevant because the performance there is strictly based on CELL's asymmetric architecture. There's no reason why IBM couldn't have traded a few SPEs for a slightly larger and more powerful out-of-order PPE. And I think that would have been a beneficial tradeoff, even in game code.

As consoles provide a closed-box environment with no need for legacy support of existing 'unoptimized for IO' programs, there's no reason to switch to OOO on future consoles except to give devs an easier time at a cost of peak performance of your hardware. As I expect the multi-core paradigm to be better understood and supported on development tools over time, this generation is probably going to be the hardest for IO development, and future consoles will have it easier, making the use of IO less of a restriction. Especially if courses are set up to provide IO coverage for programmers, and universities don't just stick to knocking out C on x86 boxes and considering every student who manages that to be suitable for console development.

Or instead of slapping on ever more low-performance cores, they could increase the complexity of the cores and boost overall performance and efficiency. As others have pointed out, compilers really aren't to the point of compensating for the difference, even with profile-guided optimizations and this is unlikely to change. The advantage of an asymmetric design like CELL is that you can cater to both highly parallel throughput-oriented workloads as well as high-speed single-threaded workloads. The best way to do this is to at least boost the IPC of the PPE.
 
and most of these are of the really important kind, hiding latency from the memory subsystem. OOO does not just run sloppy code best, they also run highly optimized code best (and in many cases spectacularly so).

This is an excellent point. Even with the best OOOE chips out there, you are basically covering cache latency. With an in-order CPU, those are just dead cycles, lots of them.
 
There's no reason why IBM couldn't have traded a few SPEs for a slightly larger and more powerful out-of-order PPE. And I think that would have been a beneficial tradeoff, even in game code.
I don't know about that. Even individually, the SPEs are really quite impressive if you know what you're doing. And the improvements you'd see on the PPE vs. the amount lost for sacrificing a few SPEs is of questionable worth in my book. Granted, it would be sizable for the codebases here and now, especially since the PPE isn't really that great in comparison -- the XeCPU cores individually will outperform Cell's PPE rather handily, though they're equal under perfectly ideal conditions. But 2 or 3 years down the road, I think people will be wishing they had more than 10 SPEs. Granted, that may be accelerated by the fact that they don't have much choice but to move everything down there, but they are magic nonetheless.

Actually, if the PPE just had better dynamic SMT scheduling, its IPC would probably go up a good 50%, possibly double in high-CPU-load single-threaded cases, and that would be a minor change in terms of die area. Here's to hoping all the DD3.x talk will mean something.

In any case, I do see it as an ultimately useful and valuable change, but I think it's more suitable for further generations of the hardware. For now, it's basically a SIMD throughput machine.
 
Gubbi said:
They really only had one, the PPC970/Power 4 derivative.
Of course they had more cores, what do you mean? Was that the latest at that time or what?
Gubbi said:
Man-years, they may have had enough years, but not enough men.
Cell is one hell of an expensive chip, 400-500 engineers worked on the chip for a few years, but obviously they didn't find OOO to be the name of the game for some reason.

ban25 said:
Historically, almost all if not all in-order cores have been outperformed by OOOE equivalents.
I can not see anyone arguing against that, for single threaded performance OOO is best, hands down. You are kicking in open doors.

ban25 said:
This one is easy. Both Microsoft and Sony sourced their CPU from the same company and essentially got the same core.
From what I've read Xenon and Cell was developed by entirely separate teams.

ban25 said:
There's no reason why IBM couldn't have traded a few SPEs for a slightly larger and more powerful out-of-order PPE. And I think that would have been a beneficial tradeoff, even in game code.
There are plenty of reasons why IBM didn't trade a few SPE for a slightly faster PPE core. You are completely missing the point of the SPEs. Their primary benefit is that they are targeting the memory wall that common CPU designs keep running into. By having the code running in in local store and handling input/output data via triple buffering or something similar they essentially never stall the CPU and the relatively simple cores will have an extremly high degree of utilisation. It would be insane to trade some SPEs against a 20-40 faster PPE core. Just to examplify, an SPE is capable of about 25 GFlops each, what would it take to increase performance of the PPE with that amount and what would the consequence of that be with regard to memory handling etc.? Then make the exercise one more time and try to increase the performance with the floating poin capacity of two SPEs.
To add some more spice to the task try to maintain the transistor budget. Good Luck!!!!

ban25 said:
This is an excellent point. Even with the best OOOE chips out there, you are basically covering cache latency. With an in-order CPU, those are just dead cycles, lots of them.
The hardware threads of both Xenon and the Cell PPE help with reagard to this to some degree and don't forget that cache misses have basically the same expensive price for OOO and in-order, though the hardware threads help somewhat in this case as well.
 
The hardware threads of both Xenon and the Cell PPE help with reagard to this to some degree

yes. unfortunately, both these cpu's have as many as two hw contexts. an OOOE cpu is significantly more potent when countering arbitrary stalls due to the fact that it has way finer and usually wider control over re-schedulting in comparison to a in-order 2-way SMT. OOOE usually works at individual units (ALUs, etc) level, in-order SMT works at the level of a context.

and don't forget that cache misses have basically the same expensive price for OOO and in-order, though the hardware threads help somewhat in this case as well.

nope. cache misses (and any other kind of stall) is generally handled better by OOOE units. then you get OOOE-with-SMT, which in theory should be all fine and dandy, as long as you don't have a nasty resource bottleneck, i.e. as long as you have really high unit redundancy.

all in all, OOOE and SMT are not mutually-exclusive, as they tackle similar problems but from fairly distant perspectives - one being the fine-grained information of the CPU about its actual momentarily state, and the other - the programmer/os-scheduler's knowledge that certain tasks can be carried concurrently.
 
Last edited by a moderator:
Of course they had more cores, what do you mean? Was that the latest at that time or what?

G3s or G4s. Considering how Apple's machines using these chips were creamed by ordinary PCs that would have been a hard sell.

From what I've read Xenon and Cell was developed by entirely separate teams.

From this (my emphasis):
The CPU was designed uniquely for Microsoft and for use in the Xbox 360 using the system architecture specifically defined around customer requirements......

We used existing PowerPC processor and subsystem technology and designs as a foundation to jump-start the development.

There are plenty of reasons why IBM didn't trade a few SPE for a slightly faster PPE core. You are completely missing the point of the SPEs. Their primary benefit is that they are targeting the memory wall that common CPU designs keep running into. By having the code running in in local store and handling input/output data via triple buffering or something similar they essentially never stall the CPU and the relatively simple cores will have an extremly high degree of utilisation.

This has always been a bollocks argument. The memory wall concerns two things: bandwidth and access latency. The latter being significantly harder to do anything about than the former.

CELL does nothing to alleviate latency, in fact it adds to access latency because the SPEs has to explicitly set up DMA for memory transfers (but to be fair, it does this rather fast).

What it does do, is force a programming model where the developer is tasked with turning a latency bound problems into bandwidth bound ones. This is just a losing proposition IMO (at least in the long run).

And just about any algorithm that is a good fit for local stores is a good fit for caches.

It would be insane to trade some SPEs against a 20-40 faster PPE core. Just to examplify, an SPE is capable of about 25 GFlops each, what would it take to increase performance of the PPE with that amount and what would the consequence of that be with regard to memory handling etc.? Then make the exercise one more time and try to increase the performance with the floating poin capacity of two SPEs.
To add some more spice to the task try to maintain the transistor budget. Good Luck!!!!

Yeah it would be insane if CELL in it's current form were running anywhere near peak FLOPS. Do you think this is the case ?

<snip>...and don't forget that cache misses have basically the same expensive price for OOO and in-order, though the hardware threads help somewhat in this case as well.

Well a modest sized ROB could cover the L1-miss-L2-hit latency of twenty something cycles. If you have a L1 hitrate of 90% and one in three instructions is a load or store you would miss L1 every 30 instructions on average, or every 15 cycles, and then stall for 20 cycles, netting you an overall efficiency below 50%. L1 hitrate becomes very important on an in-order. A OOO would just chug along because it could schedule around these latencies.

Cheers
 
Which has zero significance. Power/megahurtz is 100% pointless to compare.

In Power/Performance Woodcrest wins hands down.

I wouldn't be so sure about that, in floating point SIMD I'd expect a Xenon to demolish even a Woodcrest.

I disagree. OOO saves power. To get anywhere near the performance of an OOO core an nn-order would have to be:
1. Wider (or faster).
2. Spend significant resources (power)to make critical data-depency latencies smaller, like L1 load-to-use latency
3. Increase the size of on die caches because inlining and loop-unrolling will bloat code.

Xenon's SIMD engines have 128 registers and while it doesn't have a huge cache it gets around added latency by using 2 threads.

The cache isn't big because it's being accessed by 4 processors (the 3 cores and the GPU) and can do things like cache locking.

If in-order CPUs had any kind of power/performance edge over OOO CPUs we'd see laptops with in-order CPUs in them.

But really low power devices like Phones and PDAs all use in-order processors. High throughput chips like GPUs (which completely destroy CPUs) are also in-order.

As I said, OOO is useful for certain types of workload.

What complicates matters is OOO is also a useful bandaid for designs which have small numbers of registers - e.g. x86. Without OOO you'd lose all the rename registers and performance would likely plummet (see VIA C3 benchmaks). In that case OOO probably does save power since it's boosting performance so much.

However PowerPC has always had 32 registers so doesn't need OOO quite so much and doesn't have so much of an effect, according to IBM's figures OOO only boosts performance by 30-40%.

That said the PPE is NOT a pure in-order machine, it does OOO loads...

Historically, almost all if not all in-order cores have been outperformed by OOOE equivalents.

If history goes back to 1998 yes, before that the in-order Alpha was outgunning the out-of-order PA-RISC. Before the Alpha the fastest CPUs were all huge multi-chip things, the fastest of the fast being Cray's machines, all high clocked in-order designs, all beating the living S**t out of IBM's OOO mainframes.

The savings from going in order is vastly exaggerated, your physical register file will have to have the same size as a renaming register file. Instruction caches will have to be bigger in order to get the same hitrates/performance because of inlining and unrolling bloat. The leaves the ROB, which was <10% of the core in PPRO, P3 and Pentium M, is less than 8% in K8 (all of the schedulers combined),

It's also incredibly complex and needs to run very fast, i.e. it's gets hot. 8% of a die may not should like that much but consider that more than half of the die is taken up by cache and it only uses a few percent of the CPU's power budget. Being small doesn't mean it's not a potential problem.

Your statement is misleading. You imply that IPC for an in-order PC CPU would haved remained constant at current clock speeds. That would not be the case because memory latencies have increased significantly. So the IPC of an in-order CPU would be lower than it was 10 years ago. My guess is that the IPC of modern OOO x86 CPU's is somewhat 3-5x that of an comparable in-order x86 CPU, maybe even more for memory intensive applications.

IPC is generally limited by code, not the hardware, the average IPC you can extracet from code is around 2 - exactly what the PPE and SPEs were designed for. In reality however IPC is usually lower.

This sounds very strange as IBM has plenty of OOO cores to pick from if OOO was really that important to MS.

440, 750, 970, POWER5 and probably several others besides, if OOO was that important they would have got it.

OOO was dropped because of space and power concerns, and because the workload (SIMD floating point) doesn't benefit from it much, if at all.

From what I've read Xenon and Cell was developed by entirely separate teams.

Yes, but the PPE/Xenon's integer core was from an older project and was a plug-in they could both use.
 
This sounds very strange as IBM has plenty of OOO cores to pick from if OOO was really that important to MS. My guess is that IBM convinced MS that if they wanted most overall IPC, then multiple dual threaded in-order cores would give them most performance per transistor.

If you say development time was crucial for the fact that Xenon ending up being in-order, how do you explain the fact that the PPE core in Cell ended up being in-order? Development time was hardly a constraint in that case.
It was confirmed to me by a Microsoft employee that schedule was the reason IBM could not provide out of order execution. It was the fault of the business people for taking so long to get a contract signed.
 
It was rumoured that MS really wanted a OOO CPU in the X360, but IBM cut them short to meet the deadlines.

Not a rumor.

Dean Takahasi's book, p149:

IBM knew that it could make a derivative of the efficient PowerPC core that it had created for Sony without a huge redesign effort. It anticipated that it would be able to include a feature known as out-of-order execution. With this feature, a processor could run faster because it could take instructions and reorder them for the most efficient processing. The drawback was that it took up more space on a chip than the simpler, in-order execution of earlier processors.

Using a low-power Power PC core, IBM expected that it would create a chip that ran at a clock rate of 3.5 gigahertz and put three processing cores on a single chip. Each of the cores would also be capable of running two programs, or threads, at the same time. In terms of performance, the machine would be capable of running six times the number of threads on the original Xbox. And it would run four times faster in terms of megahertz. The cores would also be small, meaning that they wouldn’t be extremely costly.

Dean Takahashi's book, p263:

“We made the trade-offs together,” Spillinger said. “It started with communication between two teams, and then it expanded so that they talked to any of our engineers.”

A couple of the trade-offs were big ones. During 2003, IBM realized it had to scale back. Instead of hitting 3.5 gigahertz, IBM decided that it could only target 3.2 gigahertz speeds. (Sony had the same problem; it said its Cell chips would run at 4 gigahertz, but had to settle for 3.2 gigahertz). Otherwise, the yields on its chips might be too low, driving the costs up for both IBM and Microsoft.

Another setback was that IBM had also decided that it couldn’t do out-of-order execution. This was a modern technique that enabled microprocessors to dispatch a number of tasks in parallel. A sequence of instructions was broken into parallel paths without regard to order so they could be executed quickly, and then put back into the proper sequence upon completion.

Instead, IBM had to make the cores execute with the simpler, but more primitive, in-order execution. Out-of-order consumed more space on the chip, potentially driving up the costs and raising the risks. When Microsoft’s Jeff Andrews went to Jon Thomason and told him the news, it was like a bombshell. One by one, many of the Mountain View group’s biggest technological dreams were falling by the wayside.

“You always shoot for the best you can do, and then reality kicks in,” said Nick Baker. “You go through iterations and sometimes you get nasty surprises.”

So the story pretty much was, both IBM and MS wanted OOOe, IBM thought they could do it, but ran out of time or transistor budget, and so MS had to settle for in-order.
 
Last edited by a moderator:
This has always been a bollocks argument. The memory wall concerns two things: bandwidth and access latency. The latter being significantly harder to do anything about than the former.

CELL does nothing to alleviate latency, in fact it adds to access latency because the SPEs has to explicitly set up DMA for memory transfers (but to be fair, it does this rather fast).

And you don't think running code and data from the local store rather than slow external memory reduces latency?

Yeah it would be insane if CELL in it's current form were running anywhere near peak FLOPS. Do you think this is the case ?

If you run code and data from the local store, you CAN close to the theoretical peak, unlike a conventional cached processor reading program and data out of slow external memory.
[/QUOTE]

As for in order and out of order execution, as I said before Windows is very different from consoles. In-order was a deliberate choice by IBM and Microsoft since the current IBM powerPC processors (like the one used in Wii are ooo. Both IBM and Microsoft have the tools to do extensive performance simulations of code before actually creating any silicon and have no doubt done so before making the decision to drop ooo for in-order. Also they - particularly IBM - have some real experts on processor performance - far more knowledgable than us amateur processor performance pundits on this forum. Now why do you think they both went for in order cores? How many cores do you think Xenon would have if it was ooo? One? One and a half?
 
Last edited by a moderator:
Yeah it would be insane if CELL in it's current form were running anywhere near peak FLOPS. Do you think this is the case ?

I don't know, but I do know that a lot of the big technical articles I've read on Cell mentioned that the Cell processor might be one of those rare CPUs that may actually perform very close to its theoretical maximum (i.e. 98% or better).

I personally have limited knowledge and experience with processors, but I do think that streaming is by far the most important technology of the current computer era, and that the SPEs are designed extremely efficiently around this problem, with their very well balanced paralel data reading and processing. A big help in maximising their performance is also their simplicity. It may not seem that way at first, but their relative simplicitly and autonomy make optimising tasks for them a lot easier. Everywhere I am reading that the Cell design is basically the right way to do it and the design to follow in the future, and the only reason that current desktop processors aren't going there faster or aren't there yet, is because they have to deal with legacy stuff and don't want to make too big shifts at once, but that nevertheless the systems will go there and go there fast. There is only so much you can do with a single core, and as soon as you start going multi-core, the rules change.
 
And you don't think running code and data from the local store rather than slow external memory reduces latency? How peculiar!

No, like Gubbi said, all LS does is move the problem from being one the processor deals with, to one the developer deals with. A developer is way more clever than a processor, but a developer is also typically a much more expensive and limited resource than a processor.

Both LS and cache are designed to take advantage of coherency in an algorithm. All LS does is force the developer to manually schedule memory accesses, while a cache is largely controlled by the processor.

If a human takes long enough to optimize the code, LS may get you higher peak performance, but a cache will get you better average performance when the human doesn't have time to look at and hand optimize everything in your application.
 
In-order was a deliberate choice by IBM and Microsoft since the current IBM powerPC processors (like the one used in Wii are ooo. Both IBM and Microsoft have the tools to do extensive performance simulations of code before actually creating any silicon and have no doubt done so before making the decision to drop ooo for in-order.

Yes it was a deliberate choice, and it was forced because IBM was unable to put OOOe in, not because they tested it and decided in-order would result in better performance. See Dean Takahashi's book.
 
Historically, almost all if not all in-order cores have been outperformed by OOOE equivalents.
Sure .. so?

Desktop processors are fast because of a million of area and power eating features all providing their few % of speedup, issue width being a major factor ... OOOE in isolation on a dual issue core? I doubt it would make even a 10% difference on average.
 
Sure .. so?

Desktop processors are fast because of a million of area and power eating features all providing their few % of speedup, issue width being a major factor ... OOOE in isolation on a dual issue core? I doubt it would make even a 10% difference on average.

I'd say you're selling it a bit short, but as I mentioned already the narrow issue width is probably partially a result of the decision to use an in-order core.
 
Back
Top