Future console CPUs: will they go back to OoOE, and other questions.

nope. cache misses (and any other kind of stall) is generally handled better by OOOE units. then you get OOOE-with-SMT, which in theory should be all fine and dandy, as long as you don't have a nasty resource bottleneck, i.e. as long as you have really high unit redundancy.
Please explain how an OOO core can access system memory significantly faster than an in-order core when a cache miss occur, I don't get that from your description.

darkblu said:
all in all, OOOE and SMT are not mutually-exclusive, as they tackle similar problems but from fairly distant perspectives - one being the fine-grained information of the CPU about its actual momentarily state, and the other - the programmer/os-scheduler's knowledge that certain tasks can be carried concurrently.
Yes, that's a good description.

gubbi said:
What it does do, is force a programming model where the developer is tasked with turning a latency bound problems into bandwidth bound ones. This is just a losing proposition IMO (at least in the long run).
I think it will be a winning proposition in the long run as it put the programmer in control. He does not have to depend on that some code running in parallel is not messing up the cache and by realying on triple, quadruple, etc. I/O buffers he is pretty insensitive to temporary congestions on the memory bus, which may allow the memory bus to be used to close the theoritical maximum without stalling the programs.

gubbi said:
Yeah it would be insane if CELL in it's current form were running anywhere near peak FLOPS. Do you think this is the case ?
I remember IBM running some benchmark code at about 150 GFlops compared to the theoretical max that was about 220 something GFlops. And I am sure there will be short moments when highly optimised game code will reach similar figures. That is close enough for me.
 
Whatever happened to aaronspink?

He worked on a bunch of real chips including one that looks in principle looks a lot like CELL and would have a lot to say about this.
 
Not a rumor.
Dean Takahasi's book, p149:
Dean Takahashi's book, p263:

So the story pretty much was, both IBM and MS wanted OOOe, IBM thought they could do it, but ran out of time or transistor budget, and so MS had to settle for in-order.

From the text it seems like the transistor budget was the driving force IMO. They prefered to cut down OOO instead of the number of cores for some reason.
 
No, like Gubbi said, all LS does is move the problem from being one the processor deals with, to one the developer deals with. A developer is way more clever than a processor, but a developer is also typically a much more expensive and limited resource than a processor.

Both LS and cache are designed to take advantage of coherency in an algorithm. All LS does is force the developer to manually schedule memory accesses, while a cache is largely controlled by the processor.

If a human takes long enough to optimize the code, LS may get you higher peak performance, but a cache will get you better average performance when the human doesn't have time to look at and hand optimize everything in your application.
Actually what a LS does is take away alot of dynamic, to ensure load/stores are predictable. OOO is usefull if you dont know how long a load might take, it can then dynamically decide what to do next, but still beeing limited to a small set of information (CPUs only have a short time to make decisions).
On a SPE, you can send a compiler working for hours on a piece of code and add information from profiling runs, without doing the hard work yourself - in other words the compiler knows ALOT more about the code than an OOO-Algorithm and has alot more time to make decisions.

That, of course doesnt take into account branches and loads via DMA, but aslong as a SPE has a reasonable branch-less algorithm and big enough buffers, a compiler alone should bring it closer to its peak than a CPU with caches & OOO would.
 
I wouldn't be so sure about that, in floating point SIMD I'd expect a Xenon to demolish even a Woodcrest.

Number of ops is about the same, 8 per cycle and comparable cycles. Woodcrest has an advantage in that muls and adds can be scheduled indepently (and of course it's OOO).

Xenon's SIMD engines have 128 registers and while it doesn't have a huge cache it gets around added latency by using 2 threads.

The cache isn't big because it's being accessed by 4 processors (the 3 cores and the GPU) and can do things like cache locking.
Having two threads helps keep utilization up, but in effect turns your 3.2GHz CPU into 2 1.6GHz CPUs.

The caches aren't big, because the die diet ate the rest.

But really low power devices like Phones and PDAs all use in-order processors. High throughput chips like GPUs (which completely destroy CPUs) are also in-order.

They are also scalar CPU with low operating frequency, using only few (1-4) mm^2 of die space. You wouldn't want to stick 100 ARM 7 cores on a die would you? :)

As I said, OOO is useful for certain types of workload.

100% agree.

What complicates matters is OOO is also a useful bandaid for designs which have small numbers of registers - e.g. x86. Without OOO you'd lose all the rename registers and performance would likely plummet (see VIA C3 benchmaks). In that case OOO probably does save power since it's boosting performance so much.

However PowerPC has always had 32 registers so doesn't need OOO quite so much and doesn't have so much of an effect, according to IBM's figures OOO only boosts performance by 30-40%.

You have a good point there x86 benefits alot more from aggresive micro architecture than more straight up RISC architectures do. But 30-40% is still an astonishing amount of performance, especially from the modest amount of silicon real estate you have to spend to get that.

That said the PPE is NOT a pure in-order machine, it does OOO loads...
Non-blocking loads has been in in-order CPUs for a long time. The PPE/360CPU also does delayed execution, which is a limited form of OOO, in the FP (+SIMD) pipeline, similar to G3/G4s

If history goes back to 1998 yes, before that the in-order Alpha was outgunning the out-of-order PA-RISC. Before the Alpha the fastest CPUs were all huge multi-chip things, the fastest of the fast being Cray's machines, all high clocked in-order designs, all beating the living S**t out of IBM's OOO mainframes.

You make it sound so dramatic, "outgunning". :) In the end of 1996 400MHz Alpha 21164s scored 10.1 SpecINT95 (baseline) vs. 9.43 for 160MHZ PaRISC 8000s. In the same time frame 200MHz R10000s (also OOO) scored 10.7 - in my world that is neck and neck.

Heck, when the Pentium PRO debuted is was king of the hill in SpecINT for two months, faster than the Alphas of the day, that should tell you something about the usefulnes of OOO.

But that was back then. Cycle times today are a decimal order of magnitude smaller, while memory latency is lower by a factor of two.

It's also incredibly complex and needs to run very fast, i.e. it's gets hot. 8% of a die may not should like that much but consider that more than half of the die is taken up by cache and it only uses a few percent of the CPU's power budget. Being small doesn't mean it's not a potential problem.
It's not 8% of a die, it's 8% of a core (level 1 caches are counted as part of the core). For a K8 die with 1MB L2 cache the core only takes up 30% of the die area. So it's 8% of 30%!!!

IPC is generally limited by code, not the hardware, the average IPC you can extracet from code is around 2 - exactly what the PPE and SPEs were designed for. In reality however IPC is usually lower.
IPC is almost always limited by the memory subsystem these days, be it bandwidth or latency.

440, 750, 970, POWER5 and probably several others besides, if OOO was that important they would have got it.
The 440, G3 and G4 has a limited form of OOO, it's the same as the X360 CPU and PPE has in their FP units, it's called delayed execution and can only handle the latencies incurred by these execution units (6-7 cycles tops).

Power4 and the derived PPC 970 were the first really aggressive OOO superscalar IBM designed.

OOO was dropped because of space and power concerns, and because the workload (SIMD floating point) doesn't benefit from it much, if at all.
Do you have a source for that ? Because other than FFTs and single precision Linpack I haven't seen any workload where FP throughput is anywhere near peak.

Yes, but the PPE/Xenon's integer core was from an older project and was a plug-in they could both use.

Erh, so now you're saying that they _are_ the same ?

Cheers
 
Last edited by a moderator:
Sure .. so?

Desktop processors are fast because of a million of area and power eating features all providing their few % of speedup, issue width being a major factor ... OOOE in isolation on a dual issue core? I doubt it would make even a 10% difference on average.

There are several ways to skin a cat. You could make it really wide and slow or pipeline the bejesus out of it and make it clock really fast and be really narrow. Throughput would be the same! But IPC would be a lot lower for the fast and narrow.

Why would it matter what speed (instructions/cycle) the front end can fetch, decode and inject instructions in to -, and the retirement stage remove instructions from - the ROB, as long as it is higher than the average IPC measured over the number of instructions in the ROB (80+) ?

IMO it would make a lot of sense to make a narrow CPU with OOO execition. The inherent data depencies aren't going away just because you have a narrow front/backend.

Cheers
 
Last edited by a moderator:
Please explain how an OOO core can access system memory significantly faster than an in-order core when a cache miss occur, I don't get that from your description.

no, it can't. it just manages to still get some work done in the meantime. and that's not just limited to out-of-L1$ experiences. that does no mean i don't believe in the programmer's/compiler's ability to handle the situation at a medium-to-macro levels - actually i do. i just don't think either the programmer or the compiler tend to get the job well done at a fine level. or put in other words, i do believe that OOOe is a good tool to fill-in the cracks left by the coder's/compiler's masonry. i also like the fact that from coder's perspective OOOE is free ; )
 
Last edited by a moderator:
Why would it matter what speed (instructions/cycle) the front end can fetch, decode and inject instructions in to -, and the retirement stage remove instructions from - the ROB, as long as it is higher than the average IPC measured over the number of instructions in the ROB (80+) ?
The more instructions in flight the sooner you will stall with in order, the greater relative benefit to OOOE.
 
Gubbi said:
And just about any algorithm that is a good fit for local stores is a good fit for caches.
While this is true, the end results speak for themselves. Localstore optimizations to data layout will improve performance of the algorithm on cache based arch. as well - but in my experience, LS will nevertheless always outperform it (usually by a significant margin).
Maybe I've just had the misfortune to work with all the wrong CPUs (I've done relevant work only with MIPS, x86 and PPC, so there's many architectures I can only guess about).

What doesn't help matters at all is that prefetch on in-order designs is all but useless(at least all those I've seen to date), hence tilting the balance into LS favour even more.

Yeah it would be insane if CELL in it's current form were running anywhere near peak FLOPS. Do you think this is the case ?
Well given the nature of SIMD, I'd say that's even theoretically impossible in most general algorithms. :p

Well a modest sized ROB could cover the L1-miss-L2-hit latency of twenty something cycles.
What about when the L2 hit latency is not so modest 40+? Anyway I think there are things that would do more for performance then OOOe with console PPCs in their current state.
 
Yes it was a deliberate choice, and it was forced because IBM was unable to put OOOe in, not because they tested it and decided in-order would result in better performance. See Dean Takahashi's book.

You mean they looked at ooo and realised that they had a choice between a single core ooo chip or a triple core in-order chip and realised that for the same silicon in-order was faster. Don't forget dual core ooo Intel chips for example came out later than the cut off date foe Xbox 360 release and they cost far too much to make them viable in a console.

If Microsoft had selected ooo, they would have had a Wii workalike. It is a question of which will perform faster a Wii chip or a Xenon. Evidently Microsoft decided Xenon would perform faster.
 
No, like Gubbi said, all LS does is move the problem from being one the processor deals with, to one the developer deals with. A developer is way more clever than a processor, but a developer is also typically a much more expensive and limited resource than a processor.

Both LS and cache are designed to take advantage of coherency in an algorithm. All LS does is force the developer to manually schedule memory accesses, while a cache is largely controlled by the processor.

If a human takes long enough to optimize the code, LS may get you higher peak performance, but a cache will get you better average performance when the human doesn't have time to look at and hand optimize everything in your application.

Regardless, if coded as it should be, LS dramatically reduces latency and eliminates cache miss and bus contention. That is the bottom line.

Regarding your comment about clever developers, to use the SPEs, you either code to use the local store or you don't use the SPEs at all. This is the way SPEs work, and if you want to use then you have to use the local stores. If you want to compare Cell performance with other processors, you shouldn't therefore compare code that is written for other processors that only use the PPE and then conclude that the SPE can't achieve peak performance in real situations. This may happen in practice for general purpose applications on PS3 Linux where programmers may not be bothered to recode to use the SPEs, but it certainly won't be true for multi-media player and library functions and games where the SPEs will definitely be utilised. Incidentally games, and multi-media library functions are precisely the types of applications that require acceleration on the typical desktop PC. Few people if any require a triple core cpu just to accelerate MS Word, so I am not sure that a processor that runs a little slower on office applications, but is massively faster on screen re-draws, media playing and games would not be a bad thing.
 
I think if you look at Niagara or Azul Systems TLP chips, they pack in way more performance per watt without OOOe because the workloads they run are heavily threaded. GPUs have orders of magnitude better performance per watt than CPUs attempting to match the same workload performance, because GPUs are highly threaded. (consider how many CPU cores would be needed to match a single G70 in rendering, now look at the power differences)

If you look at game workloads, they consist of a balance of highly threaded pieces and serialized pieces. Thus, the optimal CPU architecture IMHO, in one containing 1 or 2 cores with OOOe coupled with a large number of functional units and ability to run many many threads on co-processors. Sort of a Core 2 Duo combined with 16-64 SPE-like units each of which can have a 8-16 thread contexts to hide I/O, memory, and branch latency. Games are 100% serial, so for example, if the workload with 40% serial and 60% fully threadable, I don't neccessarily want to run all the threadable code on Core 2 equivalents where they are less efficient per silicon and power consumed.

Today, if you wish, you can already view CPU+GPU in this manner, as the highly threaded rasterization logic being handled off to specialized mega-threaded shader pipelines. However, even DX10 pipelines aren't as general purpose as SPEs, so there still may be some advantage of locating these megathreaded dispatch units on the same core as the OOOe CPU, not the least of which is the possibility of a better cache bus design.
 
the XeCPU cores individually will outperform Cell's PPE rather handily, though they're equal under perfectly ideal conditions.

VMX-128 processing aside... how is this possible ? If by individually you mean that you take a single core accessing the whole 1 MB L2 cache then I can understand (2x the cache), but isn't the PPE (since DD2 onwards) at least as large if not quite larger (IIRC it is the latter) larger, without counting the die space used by the L2 cache in either PPE core or PPX core, than the PPX ?

Counting that both designs are dual-threaded SMT designs, that the PPX core has 4x the number of VMX registers (32*2 vs 128*2 [2 Hardware SMT contexts taken into account]) and a more complex VMX unit... how did IBM manage the PPE to slower than each PPX core (someone should try a L2 cache independent benchmark, trying to fit everything in the L1 cache) ?
 
The more instructions in flight the sooner you will stall with in order, the greater relative benefit to OOOE.

But pipelining is just one way of extracting instruction level parallism. So having a CPU with issue width m and n pipestages is equivalent to having a cpu with issue width m/2 and 2n pipestages, in both cases you have nm instructions in flight.

Being able to schedule around L1-miss-L2-hit events could be very useful. Faf mentioned L2 latencies of 40+ cycles (ouch!), that equates to a ~80 instruction ROB. Quite feasible and being able to schedule around L2 latencies could give a big boost to performance.

Cheers
 
While this is true, the end results speak for themselves. Localstore optimizations to data layout will improve performance of the algorithm on cache based arch. as well - but in my experience, LS will nevertheless always outperform it (usually by a significant margin).
Maybe I've just had the misfortune to work with all the wrong CPUs (I've done relevant work only with MIPS, x86 and PPC, so there's many architectures I can only guess about).

I'm taking your word for it.

But has this more to do with the memory architecture (core+LS) than the fact that the SPEs are a clean slate design with particular workload in mind?

What limits PPEs compared to SPEs ?

For example how does SPE SIMD compare to PPE VMX ? Are they equivalent ? or are SPE SIMD in general better (performance wise, easier to use. etc). Is it the load/store system of the PPE that is inadequate (I'd imagine the 6 cycles load-to-use latency of the L1 would be a problem).

Cheers
 
If you look at game workloads, they consist of a balance of highly threaded pieces and serialized pieces. Thus, the optimal CPU architecture IMHO, in one containing 1 or 2 cores with OOOe coupled with a large number of functional units and ability to run many many threads on co-processors. Sort of a Core 2 Duo combined with 16-64 SPE-like units each of which can have a 8-16 thread contexts to hide I/O, memory, and branch latency. Games are 100% serial, so for example, if the workload with 40% serial and 60% fully threadable, I don't neccessarily want to run all the threadable code on Core 2 equivalents where they are less efficient per silicon and power consumed.
I basically agree. But with heavily multicored microprocessors we'll see a massive speed up of the parallel parts of our workload. In your above 40/60 example, the 60% can be asymptotically reduced to zero with adding an infinite amount of cores, that equates to a 2.5 time speed up. You'll be entirely limited by your serial part of your workload.

That's Mr. Amdahl hitting us with the clue bat.

We're already there, with a total of 8 cores, CELL is already limited by the serial parts of most workloads.

Am I the only who wondered why the recent Havok demonstration only scaled to 4SPEs ? Why not go all the way and scale to 7 and see total execution time go to 2.5ms ?

Speculation: I'm guessing that they are doing some sort of front end processing on the PPE, then submitting tasks to the SPEs, - and 4 SPES is enough to saturate the PPE completely.

Today, if you wish, you can already view CPU+GPU in this manner, as the highly threaded rasterization logic being handled off to specialized mega-threaded shader pipelines. However, even DX10 pipelines aren't as general purpose as SPEs, so there still may be some advantage of locating these megathreaded dispatch units on the same core as the OOOe CPU, not the least of which is the possibility of a better cache bus design.

I just don't think CELL is the way to go.

The biggest problem is that they are hard to virtualize, so would be really hard to multiplex threads on them in a efficient manner they way we do today on all preemptive OSs. Not a problem for PS3 which has a fixed amount of cores (1PPE+7SPEs) and will only run one app. at a time.

But to make it in the rest of the world they'd need to support different SKUs (price points).

Cheers
 
1. The sentiment that OOO is so developers can get away with sloppy (or legacy) code. That is not the case at all!!!!
I wasn't saying it was! That's why a laptop PC has to have OOOe. A console runs specialist software designed just for it, so OOOe becomes an option. For sure it has performance benefits, and in complex cores with lots of execution units can add considerably, but for something like Cell the benefit from OOOe becomes diluted due to the workload and ability for developers to better target the processor. In Cell's case, dual-threading takes up the responsibility of keeping PPEs fewer execution units busy.
2. The sentiment that developers should rewrite most of their existing codebase to "take advantage" (really: to work around pitfalls) of these arhictectures.
That's true in both ways though. Monolithic OOOe is a pitfall in that it doesn't provide the internal BW needed to feed lots of execution units.
With game engines reaching multi-million lines of code this simply becomes un-feasible for most (both time and money). We're not just talking about a re-compile here, we're talking about explicitly (vertically) multi-threading your code to avoid the insane load-to-use and other execution latencies of these CPUs and jumping through all kinds of other hoops.
Which is the price you have to pay to get loads of throughput from a processor. A honking great OOOe core with 8 VMX units is going to be BW starved if the rest of the processor remains the same. As you start thinking about how to provide BW to those units, and schedule effectively across them, the idea of breaking them off as little cores makes a lot of sense. It is a new programming and design paradigm, but a necessary one IMO. Developers may now be jumping through hoops to get it to work, but once they've had these sorts of processors for 5 years, it'll be second nature to design for them effectively.

The biggest problem is that they are hard to virtualize, so would be really hard to multiplex threads on them in a efficient manner they way we do today on all preemptive OSs. Not a problem for PS3 which has a fixed amount of cores (1PPE+7SPEs) and will only run one app. at a time.

But to make it in the rest of the world they'd need to support different SKUs (price points).
This is the wrong argument though. The question was 'Future console CPUs:...'. OOOe on PC and other processors makes sense where you're looking at different acitivities and different code, but the environment of the console wants maximum performance for minimal cost. If it costs developers extra effort to get that but in turn shows a platform that looks better than it's rivals and so becomes popular and establishes a large user base to pay for all that development, it's a good choice. OOOe would have been a nicety, and may well appear on later Cell's PPEs in which case maybe the answer to the OP question is yes, depending on what they are counting as 'going back'. However, IO will always be there IMO, because the future workloads fit streamed architectures better than general purpose architectures, and if you're working on streaming data and processors then OOOe won't be any help. That I understand to answer the question 'Will they go back to OOOe' to be no, because they will stick with IOe.

I think you've been missing the argument that this is about consoles, so referencing benefits in PC space with laptops etc. is non-applicable here. For PC devs to have to rewrite all their code to run on IOe cores...ouch! :eek: For IOe to have to cope with large preemptive OSs and multiple varied applications at once...ooch! For IOe to have to cope with highly optimized, desgined from the ground up, console games and activities, that's not a problem any more than writing console software is. They've never been particularly nice machines to develop for when you've been concerned with getting the most from them.
 
I basically agree. But with heavily multicored microprocessors we'll see a massive speed up of the parallel parts of our workload. In your above 40/60 example, the 60% can be asymptotically reduced to zero with adding an infinite amount of cores, that equates to a 2.5 time speed up. You'll be entirely limited by your serial part of your workload.

That's Mr. Amdahl hitting us with the clue bat.

Yes, but Amdahl's law doesn't help the case of multicore either. If you've got a well known workload (games) with an embarassing parallel section that can be reduced to zero, and a fixed serial section that can't, then you're better off coupling the fastest serial single core you can find with as many parallel co-processors as needed to reduce the parallel section to nigh zero. You'll have a smaller chip than sticking N copies of a Core 2 on a die, and 2.5x speedup is nothing to shake a stick at. It is for this very reason that 3D Acceleration "works" Amdahl's law or no, the speedup thanks to accelerators has been exponential in the amount of workload processed, and has increased much faster than the rate at which the main CPU has increased the speed at which it can process serial sections.


I just don't think CELL is the way to go.

I don't think homogenous cores are the way to go. IMHO, GPU proves the use case alone. As for virtualization, you don't neccessarily need to virtualize single SPEs anymore than you need to virtualize GPUs. We have survived for a long period of time with GPUs which can't easily be shared amongst concurrent applications, users seem to be ok with the cooperative multitasking-style expensive "preemption" of alt-tabing applications which pretty much "own" the GPU while they are in the foreground. Yeah, Vista is going to change this, but I don't think it's a big selling point to users of full screen apps. It's mostly for accelerated desktop UI compositing anyway.

I'm not saying the CELL PPE and SPE units themselves are the best design, but an assymetric heterogenous processor architecture may be. Serial processors are going to hit a wall in the future, and more and more of the workloads are going to be hamstrung by highly thread oriented code. For I/O bound server applications, it's a slam dunk. For games, I think you'll see a few serial loops that will benefit from fast ILP/OOOE execution, which justifies having a handful of traditional cores, but you'll also see a huge amount of embarassing parallel code.

Maybe you'll see AMD just lift the non-ROP sections of ATI gpus and put them directly into the CPU core, and leave the ROPs as an XBOX360-like co-processor, giving them multicore traditional plus 64-256 ALUs. Who knows. All I know is, I've seen problems which I was pretty close to certain were non-parallelizable and some smartass succeeded in a GPGPU port that smoked the fastest serial processors, I am also skeptical of the frequent assertions of non-parallelizable code which doom such architectures, because the measures of usually taken by looking at algorithms before people brought any real R&D on parallelizing them.

The real problem may not be that they aren't as non-parallelizable, but that it's cheaper to use a traditional core than to hire a bunch of guys to spend 1 year doing PostDoc level research to figure out the right technique. In other words, CELL-like architectures may have to wait until computer science and tools catches up a little bit.

But for fixed platforms (consoles, super-computers in research), one want so achieve maximum performance for minimum cost and power, and development difficulty, atleast in the supercomputer field, is not something they are afraid of. They'd rather have 2.5x the performance if it meant 10x the software effort rather than build a machine 2.5x as big.
 
I just don't think CELL is the way to go.

The biggest problem is that they are hard to virtualize, so would be really hard to multiplex threads on them in a efficient manner they way we do today on all preemptive OSs. Not a problem for PS3 which has a fixed amount of cores (1PPE+7SPEs) and will only run one app. at a time.
Cheers

I think your concept of SPE use in a general purpose is completely wrong. The way the SPEs would be used in a general purpose OS is similar to the way GPUs are used on those same OSes - to accelerate specific things.

If you are running a general purpose OS such as Linux on the PS3 you would use the PPE to run the existing code and applications, and only use the SPEs to accelerate speed critical code or standard libraries or media drivers requiring acceleration. This really isn't too hard to do since this code is separate from the main program code (no need to rewrite every program to accelerate them - just rewrite the critical libraries and drivers). The code for drivers or library API calls tend to be short and compact, and media player and driver code tends to be hand optimised anyway.

Multi-tasking SPE code is easy and efficient so long as you are not saving and reloading the code in the SPE local store (ie. you are only mult-tasking between a set of the same or similar tasks on a particular SPE). You just allocate an SPE to certain tasks eg. sound processing and mixing, and when tasks are switched you just save load and reload the data that would be overwritten/specific to the individual task. For the sort of thing that an SPE would be used for on a general purpose preemptive multi-tasking OS - media acceleration, API/driver acceleration, this would be the natural way you would do this anyway.
 
Back
Top