Future console CPUs: will they go back to OoOE, and other questions.

Well, maybe not completely wrong - even if they aren't the most efficient way to do it sometimes, the Cell can still do almost all kinds of code, and even if they don't do it faster than the PPE, or even a tad slower, they can still take care of a part of the workload.

However, the quickest gain would be to write and rewrite libraries optimised for SPEs where possible, obviously. Even XML parsers, for instance, seem to run well on SPEs.

I personally really hope that Linux on the PS3 is going to be included as promised, and good enough to develop Cell stuff on. Because that will be the best way to find out how well the Cell can perform in a more common environment.
 
But pipelining is just one way of extracting instruction level parallism. So having a CPU with issue width m and n pipestages is equivalent to having a cpu with issue width m/2 and 2n pipestages, in both cases you have nm instructions in flight.
Who said anything about changing pipeline depth though? I merely said that a narrower superscalar processor (ie. smaller issue width) benefits relatively less from OOOE.
Being able to schedule around L1-miss-L2-hit events could be very useful. Faf mentioned L2 latencies of 40+ cycles (ouch!), that equates to a ~80 instruction ROB. Quite feasible and being able to schedule around L2 latencies could give a big boost to performance.
And Id still be surprised if in isolation you would get more than 10% speedup on average (on specint).
 
If history goes back to 1998 yes, before that the in-order Alpha was outgunning the out-of-order PA-RISC. Before the Alpha the fastest CPUs were all huge multi-chip things, the fastest of the fast being Cray's machines, all high clocked in-order designs, all beating the living S**t out of IBM's OOO mainframes.

And yet the ~500 MHz 21264 had 50% higher performance than the 600 MHz 21164.
 
Counting that both designs are dual-threaded SMT designs, that the PPX core has 4x the number of VMX registers (32*2 vs 128*2 [2 Hardware SMT contexts taken into account]) and a more complex VMX unit... how did IBM manage the PPE to slower than each PPX core (someone should try a L2 cache independent benchmark, trying to fit everything in the L1 cache) ?
Maybe worth trying, but I do believe others have run a larger variety of benchmarks than I have, though a lot of benchmarks thus far have shown Cell's PPE to be quite a bit behind 1 PPX core. In general, there's not a whole lot about the core functional components themselves that is weaker. The general impression I get is that a given context simply isn't getting as many cycles to use as it would on a PPX core -- of course, it's just as easy to blame the compiler as something else like thread scheduling, but the standard assumption is that Microsoft's compiler is the lesser as it's gotten the least amount of man-hours put into it (not counting the man-hours that went in 10+ years ago). I should read up on it more, but Cell/PS3 is not my more immediate concern.

I'm not saying the CELL PPE and SPE units themselves are the best design, but an assymetric heterogenous processor architecture may be. Serial processors are going to hit a wall in the future, and more and more of the workloads are going to be hamstrung by highly thread oriented code. For I/O bound server applications, it's a slam dunk. For games, I think you'll see a few serial loops that will benefit from fast ILP/OOOE execution, which justifies having a handful of traditional cores, but you'll also see a huge amount of embarassing parallel code.
I think TLP can only scale so far for consoles, though. Unlike the server example, you're basically taking care of tasks for a single app at a time, more or less (as opposed to tens or hundreds of clients). ILP is going to matter; it's just that constantly trying to scale up on extractable ILP alone is useless, just as scaling up on usable TLP alone isn't going to mean much for a single app. Which is no real smear against the idea of asymmetric multiprocessing and/or CELL. Just that CELL is a point along the way.

Sure if people want to move towards the whole idea of rendering on the CPU, I can see the need to scale the TLP up to some 200+ SPEs, but even then, you've got a chip with loads of processors that only make a measurably large difference to one sub-problem.
 
Rasterization isn't the only embarassingly parallel problem you're going to see, it just happens to be one of the most highly visible problems because increasing its workload makes a very recognizable impact. (you can tell the difference in graphics) Physics is atleast partially parallelizable, fluid dynamics and n-body problems more so (great for weather and particle system effects) more so. There hasn't been much work done on the 3D equivalent of audio "rendering" since Aureal folded. We're still taking prerecorded sounds and applying box effects from a library (reverb, etc) instead of the sounds being affected by the geometry in a physically correct way and types of the material the level is made up. About all I've seen on this front is MPEG-4 SASL which allows a programmable approach to synthesizing sounds (without regard to environment) Then there's bio-dynamics and stuff like Natural Motion/Euphoria, which the parallelizable component of a single actor is known, but having to handle a few dozen onscreen personalities may be very parallelizable. AI is a diverse field and there are lots of AI algorithms that are parallel (and many that aren't) and of course focus in these forums usually falls on the ones that are less so -- and even then, I am skeptical that not enough research has been done. Take route planning, IBM has an embarassingly parallel solution, who knows if it will be adaptable for game use.

But developers seem to so pressed for time to shovell games onto market within narrow timelines, that their thinking, even for far future generation, seems bound by architectures which won't require them to sit around and think for a few months about an optimal algorithmic solution. Which is a shame, because alot of exciting work is being done in academia, and people need to take a close look at it and see what can be adapted and used. Everyone today does this with graphics papers, looking for new effects, new strategies that can be adopted for real time, but much less so in other parts of the game engine, where they seem to delegate this research to a few middleware providers.
 
Your statement is misleading. You imply that IPC for an in-order PC CPU would haved remained constant at current clock speeds. That would not be the case because memory latencies have increased significantly. So the IPC of an in-order CPU would be lower than it was 10 years ago. My guess is that the IPC of modern OOO x86 CPU's is somewhat 3-5x that of an comparable in-order x86 CPU, maybe even more for memory intensive applications.

Doesn't VIA still make in-order cpus? Their cpus get around 1/3rd the performance in many tasks(including optimized benchmarks) of a similar speed intel processors. (I've seen benchmarks where VIA's 1ghz chips perform around the level of a 300mhz Pentium 2, and a modern Pentium M might even have higher IPC than a Pentium 2 and thus the gap could be even wider per mhz there) Though I don't think the designs VIA's cpus are based on ever really matched even Intel's in-order cpus in performance per mhz, but Cyrix was somewhat in the same ballpark.

I've also heard people spout that in an extreme case, an OOO cpu can be 10x faster than an in order cpu.

You make it sound so dramatic, "outgunning". In the end of 1996 400MHz Alpha 21164s scored 10.1 SpecINT95 (baseline) vs. 9.43 for 160MHZ PaRISC 8000s. In the same time frame 200MHz R10000s (also OOO) scored 10.7 - in my world that is neck and neck.

I wouldn't be surprised that world class engineers all approaching the same problem but with different solutions come up with comparable solutions. Which one was cheaper to produce though?

If Microsoft had selected ooo, they would have had a Wii workalike. It is a question of which will perform faster a Wii chip or a Xenon. Evidently Microsoft decided Xenon would perform faster.

I really doubt Wii has a processor as fast as even a G5, and I doubt it's what IBM would have pitched to Microsoft.
 
I really doubt Wii has a processor as fast as even a G5, and I doubt it's what IBM would have pitched to Microsoft.

Exactly, according to Dean Takahashi, IBM pitched a 3-core, 6-thread OOOe PowerPC @ 3.5+ ghz and by the time they realized they couldn't pull it off, it was too late to do anything else except cut features to ship.

If IBM and MS had decided that an in-order processor would have been faster, that what's they would have initially aimed for, not the tri-core in-order CPU they ended up producing.

---

On another point, I tend to agree with the people who say that eventually what you'll see is a few big OOOe cores optimized for serial execution, surrounded by simpler cores tuned for specific parrallel tasks.

However one thing I expect will happen is that all the cores will share the same basic instruction set and differ only in execution resources and optimizations for specific tasks. The important thing is that the basic programming model for all the cores be the same, so the developer/OS has flexibility to schedule and load balance tasks anywhere.

I also think LS is temporary and is mostly the result of limited transistor budget. We'll end up going back to lockable caches with prefetch and full views of system memory when the transistor budget allows.
 
Last edited by a moderator:
Meh, you could just as easily blame their inability to realise their initial plan on OOOE ...
 
On another point, I tend to agree with the people who say that eventually what you'll see is a few big OOOe cores optimized for serial execution, surrounded by simpler cores tuned for specific parrallel tasks.

if i read you right, we actually and largely agree for once ; )
 
Having two threads helps keep utilization up, but in effect turns your 3.2GHz CPU into 2 1.6GHz CPUs.
I could say this is bollocks, but I believe you've trademarked that expression. :D

Anyway your assumption is faulty in the case of the Cell. I highly recommend you to read the Cell Handbook, there is a lot of interesting information there.

for example:



You can have a boost of between 10-30% from hw multithreading, depending on what kind of tasks you distribute to the parallel tasks. If you keep fp-intensive calulations in one task and integer intensive calculations in the other you will achieve more parallel execution.
That mean in the good case you could turn your 3.2 GHz CPU into 2 2.1 GHz CPUs.

Gubbi said:
Reason #1: Paper FLOPS
I think you under-estimate Sony and Microsoft.

darkblu said:
no, it can't. it just manages to still get some work done in the meantime. and that's not just limited to out-of-L1$ experiences. that does no mean i don't believe in the programmer's/compiler's ability to handle the situation at a medium-to-macro levels - actually i do. i just don't think either the programmer or the compiler tend to get the job well done at a fine level. or put in other words, i do believe that OOOe is a good tool to fill-in the cracks left by the coder's/compiler's masonry. i also like the fact that from coder's perspective OOOE is free ; )
I still don't get what kind of significant work the OOO core can get done when they hit a L2 cache miss on a branch, beside TLP where in-order have the same benfits. Can you explain further?
 
You can have a boost of between 10-30% from hw multithreading, depending on what kind of tasks you distribute to the parallel tasks. If you keep fp-intensive calulations in one task and integer intensive calculations in the other you will achieve more parallel execution.
That mean in the good case you could turn your 3.2 GHz CPU into 2 2.1 GHz CPUs.
Well, the cell handbooks omits say that you can even have the case that there is a performance loss with multithreading (for example when the two threads throw each others data out of the cache). And the more optimized your code is the smaller the performance gain. So the average case will probably be closer to a 10%-15% performance gain.

But let's get to the topic of the thread: My predicition is that future console CPUs will go back to OOOe. When the next generation of consoles come out, die space will have increased significantely, so will memory latencies. Die cost of OOOe will remain pretty constant, but caches will get larger and take up more die space, meaning that OOOe logic will take up only a small portion of the die space. So including OOOe would be the smart thing to do.

The interesting question is: How will a Playstation 4 Cell look like?
 
Doesn't VIA still make in-order cpus? Their cpus get around 1/3rd the performance in many tasks(including optimized benchmarks) of a similar speed intel processors. (I've seen benchmarks where VIA's 1ghz chips perform around the level of a 300mhz Pentium 2, and a modern Pentium M might even have higher IPC than a Pentium 2 and thus the gap could be even wider per mhz there) Though I don't think the designs VIA's cpus are based on ever really matched even Intel's in-order cpus in performance per mhz, but Cyrix was somewhat in the same ballpark.

VIA makes low power and fanless CPUs intended for embedded applications like firewalls and set top boxes. They go with VIA's mini-ITX and single board computer chipsets. They are intended to be low power and cheap, not to compete with Intel and AMD's power processors.

http://www.mini-itx.com/store/?c=27
 
Exactly, according to Dean Takahashi, IBM pitched a 3-core, 6-thread OOOe PowerPC @ 3.5+ ghz and by the time they realized they couldn't pull it off, it was too late to do anything else except cut features to ship.

If IBM and MS had decided that an in-order processor would have been faster, that what's they would have initially aimed for, not the tri-core in-order CPU they ended up producing.
I don't think anyone can pull it off even now. Look at the price of current dual core Intel CPUs - they are still far too expensive for use in a console. It is about getting the best performance per transistor. Oooe lost on this basis.

---

On another point, I tend to agree with the people who say that eventually what you'll see is a few big OOOe cores optimized for serial execution, surrounded by simpler cores tuned for specific parrallel tasks.

The trend for consoles will probably to put a minimal level of oooe on a single PPE like core and surround them with small SPE like cores. A combination of good performance and low transistor count/cost is important in consoles. This will allow an optimising compiler to do it's job, and a minimal level of oooe could help deal with issues that can't be predicted at run-time eg. due to conditional branching, preemptive multi-threading.

However one thing I expect will happen is that all the cores will share the same basic instruction set and differ only in execution resources and optimizations for specific tasks. The important thing is that the basic programming model for all the cores be the same, so the developer/OS has flexibility to schedule and load balance tasks anywhere.

This was discussed before on another thread . Automatic pre-emptive scheduling and load balancing by the OS is only useful and only used for general purpose operating systems. High performance parallel processing doesn't make use of preemptive scheduling for load balancing although suitable SMP-like technologies are widely available, and neither do games. For games/high performance computing, the scheduling is explicit, and so the need to run on any core including the PPE is not important. Also one core (the PPE) will be used to control the others, which means you already have a suitable implementation on Cell: the SPE code can run on any of 8 SPEs.

I also think LS is temporary and is mostly the result of limited transistor budget. We'll end up going back to lockable caches with prefetch and full views of system memory when the transistor budget allows.

The transistor budget is always the primary factor in performance and chip implementations, and it isn't going to go away just because you have an increased transistor budget. For games consoles in particular where there is always the pursuit of raw performance, chip manufacturers will probably use that extra transistor budget to provide more cores and more LS to again boost performance beyond what is possible for a given transistor budget using the cache/symmetric core/oooe approach, although the latter which will happen in the PC market.
 
Last edited by a moderator:
VIA makes low power and fanless CPUs intended for embedded applications like firewalls and set top boxes. They go with VIA's mini-ITX and single board computer chipsets. They are intended to be low power and cheap, not to compete with Intel and AMD's power processors.

http://www.mini-itx.com/store/?c=27
You are missing the point here. Fox5 mentioned the VIA chips in order to back up my claim that current x86 CPUs with OOOe are more than 100% faster than OOOe-less x86 CPUs because OOOe-less CPUs dropped in IPC since 10 years ago because of increased memory latencies. BTW, if my memory serves me right VIA added some (light) OOOe in the latest generation of it's x86 CPUs.
Something like the original patent suggested, IMO.
Could you elaborate this a bit, please? I haven't read the patent.
 
Yes, but Amdahl's law doesn't help the case of multicore either. If you've got a well known workload (games) with an embarassing parallel section that can be reduced to zero, and a fixed serial section that can't, then you're better off coupling the fastest serial single core you can find with as many parallel co-processors as needed to reduce the parallel section to nigh zero.

Hey, stop arguing my point. :)

I could have been clearer though. CELL has 8 cores in total (ignore the PPE/SPE heterogenous distinction for a minute). This gives it great potential to reduce the parallel parts of your workload. In doing so the sequential part, to a larger degree, defines the upper bound of performance achievable by your application. It therefore makes little sense giving your main/host CPU craptacular performance in such a system (that's my opinion).

Which would be faster: a 4 core device or an 8 core device with 30% lower per core performance ? It would be depend on the workload, but it would be alot easier to get good performance from the former.

You're right about there being special cases that have absurd amounts of parallism. Graphics being one already well established. The reason we have GPUs today is because computer graphics has matured to the point where the main functionality could be concentrated into a subsystem communicated with through an API. Gone are the days where every 3D game rolled their own rendering subsystem, gone are the semi deferred software renderers, the raycasting renderes, the voxel engines. Today all use hardware rasterization. We've traded in flexibility for a massive jump in performance.

We may see the same for physics and AI and what not. But these technologies are still in their infancies, so the flexibility is needed for now.

I'm not saying the CELL PPE and SPE units themselves are the best design, but an assymetric heterogenous processor architecture may be. Serial processors are going to hit a wall in the future, and more and more of the workloads are going to be hamstrung by highly thread oriented code. For I/O bound server applications, it's a slam dunk. For games, I think you'll see a few serial loops that will benefit from fast ILP/OOOE execution, which justifies having a handful of traditional cores, but you'll also see a huge amount of embarassing parallel code.

The problem with heterogenous processors is that the overall structure (distribution) is cast in iron (well, silicon). So if your workload is such that you can't take advantage of your SPE-like auxiliary cores you're wasting (expensive) si real estate. To not waste them they have to be flexible enough to run any code or fast enough on specific sub-tasks to justify their existence. Changing the ISA (and programming model) is a pretty heavy blow to flexibility.

If some workloads becomes common enough that they could benefit from such an auxiliary core, such a core would make economic sense (the way GPUs do). AMD has already detailed how encryption and XML parsing co-processors will be embedded in their future Opterons, but these are examples of very special case helper-CPUs.

Cheers
 
Last edited by a moderator:
A honking great OOOe core with 8 VMX units is going to be BW starved if the rest of the processor remains the same. As you start thinking about how to provide BW to those units, and schedule effectively across them, the idea of breaking them off as little cores makes a lot of sense.

Sorry if I have given the impression that I think multi-core is a bad thing. It isn't, it's the only way to let performance increase exponentially the way it has done for the past 4 decades.

I just think it's a bad idea to give up on sequential performance. Amdahl's law will f*ck over those that do.

Cheers
 
Last edited by a moderator:
You are missing the point here. Fox5 mentioned the VIA chips in order to back up my claim that current x86 CPUs with OOOe are more than 100% faster than OOOe-less x86 CPUs because OOOe-less CPUs dropped in IPC since 10 years ago because of increased memory latencies. BTW, if my memory serves me right VIA added some (light) OOOe in the latest generation of it's x86 CPUs.

VIA CPUs are low power fanless chips intended for low power silent operation in embedded systems. You are not comparing like for like. It is as stupid as comparing an in-order ARM chip in a PDA with a dual core oooe AMD64 chip and claim that the dual core AMD64 is 100 times faster because of oooe.

I am not disagreeing with anyone that oooe chips are faster than the same chip without the oooe logic. Nor am I disagreeing with the fact that for general purpose operating systems where you have to run code distributed as binaries which are not compiled to run on a specific target, oooe is absolutely necessary for good performance and gives a higher performance per transistor. What I am saying is that where you can use smart compilers to optimise the code order at compile time (eg. on games consoles), even though some ordering is only predictable at run time, you can probably get better performance per transistor using an in-order CPU, because the performance increase due to oooe after compiler optimisation is less than the performance increase you would get if you used the additional transistor count that oooe would require on additional cores. This was the basis on which in-order processors were usec on the XBox 360 and PS3. Xenon would have been a single core chip if oooe had gone in.
 
Xenon would have been a single core chip if oooe had gone in.

Bollocks!

An entire K8 (Athlon 64) core (L2 cache, northbridge and I/O not included) takes up a whopping 31mm^2 in 90nm. You could fit three of those and 1MB of L2 RAM in about the same space as the XeCPU.

Cheers
 
Back
Top