Future console CPUs: will they go back to OoOE, and other questions.

Shifty Geezer · Sep 6, 2006

Gubbi said:
I just think it's a bad idea to give up sequential performance. Amdahl's law will f*ck over those that do.

How much is sequential performance demand going to increase versus parallelised demand? From the design criteria for Cell, the idea was to create a processor from scratch to deal with data processing that current 'sequential' processors aren't ideally suited for, and that me makes sense. I think a lot of computer theory has evolved around the evolving CPU architectures, and it's good to step back and think 'if we consider the problems we'll have to tackle, what sort of a processor would be best suited for that?' without concerning yourself about programming theory too much and running current algorithms.

At the moment there's a lot of devs I'm sure looking at the code they've been writing to date and thinking 'there's no way this algorithm is going to be able to be spread cross multiple cores' and wanting one core that can really eek out the performance of it's silicon, but parallelism across the board is in it's infancy. We're already hearing about problems regarded as a bad fit for parallelism being reworked to fit very well, both on Cell and GPGPU. Personally, using my Tea Leaves of Infallibility, I think the way problems are tackled is going to shift towards parallelism and the need for the serialised execution thread is going to remain limited. At the end of the day, perhaps the serial core in a multicore architecture won't need to progress far beyond something like an A64, and the focus for processor advancement will be on both the supplimentary cores, and the memory interfacing, both intracore and to try and get past that damned slow main memory!

I'll also note that there'll likely remain different designs for server processors and home/workstation/console processors, due to the fundamentally different workloads.

N00b · Sep 6, 2006

SPM said:
VIA CPUs are low power fanless chips intended for low power silent operation in embedded systems. You are not comparing like for like. It is as stupid as comparing an in-order ARM chip in a PDA with a dual core oooe AMD64 chip and claim that the dual core AMD64 is 100 times faster because of oooe.

I didn't compare at all. Please read properly before hitting the reply button. I stated that IPC for in-order CPUs goes down when memory latencies go up, while OOOe CPUs are still able to increase IPC despite increasing memory latencies. You won't deny that, will you?

SPM said:
I am not disagreeing with anyone that oooe chips are faster than the same chip without the oooe logic. Nor am I disagreeing with the fact that for general purpose operating systems where you have to run code distributed as binaries which are not compiled to run on a specific target, oooe is absolutely necessary for good performance and gives a higher performance per transistor. What I am saying is that where you can use smart compilers to optimise the code order at compile time (eg. on games consoles), even though some ordering is only predictable at run time, you can probably get better performance per transistor using an in-order CPU, because the performance increase due to oooe after compiler optimisation is less than the performance increase you would get if you used the additional transistor count that oooe would require on additional cores. This was the basis on which in-order processors were usec on the XBox 360 and PS3. Xenon would have been a single core chip if oooe had gone in.

You grossly overestimate die space for OOOe and grossly overestimate the power of compilers. A OOOe Xenon would have been a dual core with a litte more cache, maybe a triple core without SMT and a bit smaller cache. The problem with OOOe is not die space, it's the complexity it introduces, leading to longer development times.
Regarding compilers: The problem compilers for in-order CPUs have is that they (instead of the CPU) need to apply OOOe to the instruction stream, meaning they have to reorder the instructions so that there are as little as possible stalls. There are many problems with this. Anticipating memory latency is one thing, branches is another. Look at Xenon where you have three cores with possibly six threads. Memory access patterns will be pretty random there. So what should the compiler do? Optimize for L1 cache latency and risk a lot of stalls? Optimize for L2 cache latency and possibly sacrifice a little performance or have some stalls? Or optimize for main memory latency and possibly sacrifice a lot performance? These are questions that the compiler can not answer a priori. Sure, profiling helps, but you still have a lot of unkown quantities.

darkblu · Sep 6, 2006

Crossbar said:
I still don't get what kind of significant work the OOO core can get done when they hit a L2 cache miss on a branch, beside TLP where in-order have the same benfits. Can you explain further?

a L2$ miss on a branch is about the worst thing that can happen to a cpu these days, that's why the fetch'n'decode units run miles ahead of the pipeline, and OOOe cpus usually employ branching predictors and speculative execution. outside of that, though, there's little a cpu can do but take the stall head on. of course, now we could speculate and argue what percentage of in-order cpu's have decent branch predictors/speculative execution, but that'd be rather moot.

anyway, i stand corrected - an OOOe cpu may not handle any stall better than an in-order one, but it would still perform statistically better on the variety of stalls you'd encounter in a reasonably-hairy modern general-purpose code. there, i said it : )

ban25 · Sep 6, 2006

The main push for these increasingly parallel CPUs has been from the server side of things. Multiprocessor systems have existed in that space for a long time and it's also where the first multi-core processor systems started to appear. And while the inherent parallelism present in any large multi-user server application makes multi-thread/core/socket machines easy to take advantage of, this is not something that can be so simply or broadly applied in the desktop and embedded/entertainment spaces.

There are plenty of games in which latency is of prime importance (most of them, in fact) and this is not a problem more cores can fix. Even though games will become more parallel with time to take advantage of the hardware that's there, CMP is by no means a pancea for gaming and you will continue to see plenty of games with the vast majority of their logic in a single thread.

It's a foregone conclusion that chips will continue to get wider as transistor budgets increase, that's easy, but if they don't also get faster as well, then we're all going to be in a lot of trouble.

Shifty Geezer said:
How much is sequential performance demand going to increase versus parallelised demand? From the design criteria for Cell, the idea was to create a processor from scratch to deal with data processing that current 'sequential' processors aren't ideally suited for, and that me makes sense. I think a lot of computer theory has evolved around the evolving CPU architectures, and it's good to step back and think 'if we consider the problems we'll have to tackle, what sort of a processor would be best suited for that?' without concerning yourself about programming theory too much and running current algorithms.

At the moment there's a lot of devs I'm sure looking at the code they've been writing to date and thinking 'there's no way this algorithm is going to be able to be spread cross multiple cores' and wanting one core that can really eek out the performance of it's silicon, but parallelism across the board is in it's infancy. We're already hearing about problems regarded as a bad fit for parallelism being reworked to fit very well, both on Cell and GPGPU. Personally, using my Tea Leaves of Infallibility, I think the way problems are tackled is going to shift towards parallelism and the need for the serialised execution thread is going to remain limited. At the end of the day, perhaps the serial core in a multicore architecture won't need to progress far beyond something like an A64, and the focus for processor advancement will be on both the supplimentary cores, and the memory interfacing, both intracore and to try and get past that damned slow main memory!

I'll also note that there'll likely remain different designs for server processors and home/workstation/console processors, due to the fundamentally different workloads.

Inane_Dork · Sep 6, 2006

N00b said:
Could you elaborate this a bit, please? I haven't read the patent.

I'm sure others recall it much better than I, but the gist of it was a group of 8 Cells (1 PPE + 8 SPEs) connected together. Or maybe it was 4 Cells.

Fox5 · Sep 6, 2006

darkblu said:
if i read you right, we actually and largely agree for once ; )

Now you just have to work out if it was good to go with simpler cores now, rather than one OOOe one. Which element is more important?

Look at the price of current dual core Intel CPUs - they are still far too expensive for use in a console.

The lowest Pentium D goes for like $90, which probably means cost to produce is within reasonable levels for a console. I'd except Core duo and Core 2 Duo don't cost much more to produce (if anything), and are just sold at a significant profit.

VIA CPUs are low power fanless chips intended for low power silent operation in embedded systems. You are not comparing like for like. It is as stupid as comparing an in-order ARM chip in a PDA with a dual core oooe AMD64 chip and claim that the dual core AMD64 is 100 times faster because of oooe.

I don't think it's too unfair to compare a current non-OOOE VIA chip at 1ghz to a Pentium 2 at 300mhz that will actually beat or match the VIA chip in quite a few benchmarks. I believe what the chips are theoretical capable of are pretty close (they both follow the x86 standard pretty closely right?) and I think Intel even produces fanless pentium 2's to this day.
If you assume that the design of VIA's current chips are at least on par with the old Cyrix chip designs they purchased, then their designs scale very poorly. The Cyrix chips also targetted the integrated market, but were usually within 10% to 20% the performance of a Pentium 2 at the same mhz. Well, except in cache dependent benchmarks, where the Pentium 2 appears to have blown all competitors away. The VIA chips don't always have poor performance though, iirc correctly their integer performance is far better, though still not comparable to any OOOE intel processor at the same speed. Though if the current VIA processors have nothing in common with the old Pentium 2 Cyrix competitors, then it is a worthless comparision. If they're the same architecture, then the VIA cpus have very poor scaling in many tasks. (I'm also not sure if the pentium 2s I've seen compared to modern via chips are the same as back in the day, the ones I've seen in benchmarks now may have their L2 cache integrated on die)

darkblu · Sep 6, 2006

Fox5 said:
Now you just have to work out if it was good to go with simpler cores now, rather than one OOOe one. Which element is more important?

my personal position is that even though console manufacturers are more in a position to afford in-order designs, moreover that design execution timings are usually of higher importance to them, which is another inclination toward in-order, OOOe is what any performance-sensitive design should be after. i won't comment on xenon (i personally find it a rather mediocre design) but re cell - it would've been even more interesting if (1) ppe was a fully-potent OOOe core running at a lower clock, say 2GHz, and (2) SPEs were the way they are now but running on a separate, possibly higher clock, say 4GHz. i know such a design would've been even more challenging but potentially more rewarding, IMO, of course.

Gubbi · Sep 6, 2006

Crossbar said:
I still don't get what kind of significant work the OOO core can get done when they hit a L2 cache miss on a branch, beside TLP where in-order have the same benfits. Can you explain further?

Well, todays scheduling windows can only (just barely) cover for the level 2 cache access latency. So no, there is no direct benefit from OOO in handling main memory accesses if you look at one isolated miss and the consequences that has on execution of instructions.

But!

OOO allows the CPU to procede further, past the point where the equivalent in-order CPU would have stalled. In doing so it might encounter other cache misses which will then start executing, etc. This means that an OOO has a higher chance of starting more main memory transactions, and thereby lowering apparent latency for the subsequent cache misses/memory accesses.

This is commonly confused with speculation. It is not. Speculation only occurs when a control-dependency (branch) is guessed. Pure data-dependencies like the above are always executed when all producing instructions has results for the consuming ones.

Cheers

ADEX · Sep 7, 2006

OOO isn't the magic bullet some of you might think:

The Alpha 21264 was faster than the 21164 just like many OOO designs were faster than their predecessors but cache sizes and memory interfaces have also improved at the same time, OOO is part of the improvement not all.

Same thing for the VIA chips, they are slower in part due to being in-order but the fact they run at a slow clock, have a tiny cache and likely non-exciting memory system also hurt. The VIA chips also share units rather than having separate ones so things which normally go in parallel on other processors will go in serial.

There may even be ways to get around the relative problems of in-order designs in software. Sun's upcoming "Rock" processor uses a technique called "run forward" (or something similar). A separate thread reads through the instruction stream reading data before it's needed, OOO's greatest advantage is that it can issue reads before it's necessary.

I think in Xenon's and especially Cell's case the performance of the PPE is being hurt by the cache. In Cell every memory access has to be checked against the PPE cache, it is in effect being shared by 9 processors.

It's a foregone conclusion that chips will continue to get wider as transistor budgets increase, that's easy, but if they don't also get faster as well, then we're all going to be in a lot of trouble.

Wider in number of cores or SIMD but not wider in instruction units. Clock speed will increase but it'll be at a snails pace. As more and more cores are added you can expect cache sizes to decrease (see AMD KL8) and single threaded performance to drop as cache coherence problems start to bite - keeping multiple caches in sync with each other is going to be complex.

The days of ever more aggressive OOO cores are over, instead we'll see Cell like designs with control cores and simpler cores for high speed computation. Both Intel and AMD are going in this direction.

The reason is heat, a complex core at a high clock takes more power than multiple cores at a slightly lower clock.

Both Xenon and Cell are signs of things to come for the desktop market.

nonamer · Sep 7, 2006

Gubbi said:
Bollocks!

An entire K8 (Athlon 64) core (L2 cache, northbridge and I/O not included) takes up a whopping 31mm^2 in 90nm. You could fit three of those and 1MB of L2 RAM in about the same space as the XeCPU.

Cheers

It would also need about 300W too.

Gubbi · Sep 7, 2006

ADEX said:
OOO isn't the magic bullet some of you might think:

The Alpha 21264 was faster than the 21164 just like many OOO designs were faster than their predecessors but cache sizes and memory interfaces have also improved at the same time, OOO is part of the improvement not all.

21264 was not faster, it was a lot faster.

Code:

|  CPU  | MHz | SpecInt2K |  SpecFP2K | 
| 21164 | 533 |     176   |     176   |
| 21264 | 500 |     300   |     383   |

Cache systems were quite different. 21164: 2x8KB L1 + 96 L2 (on-die), 21264: 2x64Kb L1 - with external L2 cache. I'm guessing the L1s were that tiny on 21164 to keep latency low.

There may even be ways to get around the relative problems of in-order designs in software. Sun's upcoming "Rock" processor uses a technique called "run forward" (or something similar). A separate thread reads through the instruction stream reading data before it's needed, OOO's greatest advantage is that it can issue reads before it's necessary.

Nothing concrete about this approach has materialized, how does this scouting thread know where to stream in from ? It's exactly these data-dependent loads that are the problem and that OOO is so good at scheduling around.

Needing revolutionary breakthroughs in software to get good performance on your new core, sorry but I'm a sceptic.

The days of ever more aggressive OOO cores are over, instead we'll see Cell like designs with control cores and simpler cores for high speed computation. Both Intel and AMD are going in this direction.

Don't know if they are quite over yet, Intel's new microarchitecture CPUs are king of the hill by a large margin.

The only thing Intel has shown so far is a slide with photoshopped mini-cores on it, no concrete plans has been detailed.

AMD are planning to integrate application specific accelerators (similarly to what Sun has done with encryption on Niagara). That's because they take up a miniscule amount of die space, and can give 1 or 2 full decimal orders of magnitude boosts to the tasks they are supposed to boost, not the factor of 2 we would see from SPE-like DSPs.

Both AMD and Intel would be better off putting as many full blown cores on a die as possible for reasons already discussed previously in this thread.

The reason is heat, a complex core at a high clock takes more power than multiple cores at a slightly lower clock.

And OOO leads In-orders in performance/power at all performance points all the way down to where it gets completely uninteresting.

Cheers

ADEX · Sep 8, 2006

Cache systems were quite different. 21164: 2x8KB L1 + 96 L2 (on-die), 21264: 2x64Kb L1 - with external L2 cache. I'm guessing the L1s were that tiny on 21164 to keep latency low.

Yes, 1 cycle access. It didn't work very well though, they found a slower but bigger cache worked better. The instruction issue on the 21164 was quite limited though and the 264 also improved on that.

Don't know if they are quite over yet, Intel's new microarchitecture CPUs are king of the hill by a large margin.

There's no doubt they are impressive processors but much of the gains come from the faster FSB and the fact they are built at 65nm giving huge caches and a higher clock. One AMD get up and running properly at 65 nm I expect them to be able up performance by a decent margin, it might not close the gap but it'll likely shrink somewhat.

The only thing Intel has shown so far is a slide with photoshopped mini-cores on it, no concrete plans has been detailed.

Look up the "platform 2015" doc.

AMD are planning to integrate application specific accelerators (similarly to what Sun has done with encryption on Niagara). That's because they take up a miniscule amount of die space, and can give 1 or 2 full decimal orders of magnitude boosts to the tasks they are supposed to boost, not the factor of 2 we would see from SPE-like DSPs.

According to their presentations that's true but other less specific cores are also mentioned e.g. "vector floating point" or "media processing".

It actually looks like there'll be a variety of designs the makeup of which will depend on the target market.

And OOO leads In-orders in performance/power at all performance points all the way to down where it gets completely uninteresting.

On "general purpose" (to which I read "branchy integer") stuff yes, but on SIMD FP I doubt any desktop processor will touch Xenon and certainly not Cell. On heavily threaded server stuff nothing touches Niagara.

BTW You asked in a previous post if I thought the PPE and Xenon cores were the same.
If you look at the die photos it's pretty obvious the front end (control / integer processing) is the same. However there appear to have been some changes to the PPE in Cell DD3 onwards (I can't compare as they've never released a DD3 image).

I doubt they added OOO but I expect a "lite" version will show up in a later Cell.

Fox5 · Sep 8, 2006

nonamer said:
It would also need about 300W too.

I think AMD's fastest dual cores only use about 120W at max load, with around 90W being typical for their non-Fx chips.

There's no doubt they are impressive processors but much of the gains come from the faster FSB and the fact they are built at 65nm giving huge caches and a higher clock.

Err, slower FSBes don't have a large performance hit on the core 2 duos, nor does halving the L2 cache from 4MB to 2, and I'd say even at the lower speeds (~2ghz) they'd probably whoop xenon.

Look up the "platform 2015" doc.

Which is even more likely to chance than intel's plans to hit 10ghz. That's a bit too far in the future to say it will definetely happen.

N00b · Sep 8, 2006

nonamer said:
It would also need about 300W too.

The new Athlon 64 X2 5200+ EE only has a TDP of 65W...

SPM · Sep 8, 2006

Gubbi said:
Bollocks!

An entire K8 (Athlon 64) core (L2 cache, northbridge and I/O not included) takes up a whopping 31mm^2 in 90nm. You could fit three of those and 1MB of L2 RAM in about the same space as the XeCPU.

Cheers

I don't know where you get your figures from, but let me remind you of the current state of the art with regard to oooe multi-core processors, which is a dual core Athlon 64 or Conroe. These currently cost $160 plus. For consoles you are looking for a cost of around $60, so even at this very moment putting a dual core oooe processor on a console is not feasible, let alone a triple core. A triple core in-order chip clearly is as demonstrated by the Xbox 360 cpu available a little more than a year ago. The bottom line is that if Microsoft was going for an oooe core, the best they could do for $60 is a single core.

It is possible of course possible Microsoft made a mistake in employing incompetent IBM engineers to do the Xenos chip design and they should have come to you instead, but I very much doubt it.

N00b · Sep 8, 2006

SPM said:
I don't know where you get your figures from, but let me remind you of the current state of the art with regard to oooe multi-core processors, which is a dual core Athlon 64 or Conroe. These currently cost $160 plus. For consoles you are looking for a cost of around $60, so even at this very moment putting a dual core oooe processor on a console is not feasible, let alone a triple core. A triple core in-order chip clearly is as demonstrated by the Xbox 360 cpu available a little more than a year ago. The bottom line is that if Microsoft was going for an oooe core, the best they could do for $60 is a single core.

The real question is how much does it cost to produce the A64X2 or Conroe. I think they will be comfortably below $60.
Edit: The next question is why would IBM have pitched a OOOe design in the first place and why would Microsoft have wanted one?

SPM said:
It is possible of course possible Microsoft made a mistake in employing incompetent IBM engineers to do the Xenos chip design and they should have come to you instead, but I very much doubt it.

Please quit getting personal, that's simply bad style.

SPM · Sep 8, 2006

Fox5 said:
I don't think it's too unfair to compare a current non-OOOE VIA chip at 1ghz to a Pentium 2 at 300mhz that will actually beat or match the VIA chip in quite a few benchmarks. I believe what the chips are theoretical capable of are pretty close (they both follow the x86 standard pretty closely right?) and I think Intel even produces fanless pentium 2's to this day.

The comparison with the Pentium II is appropriate, but the VIA chips like the AMD Geode chips are designed for cheap integration into single board computers rather than performance, and so have a lot more non-cpu circuitry crammed onto them than the PII.

The point I am making is not that oooe cpus are not faster than in-order cpus - they are. What I am saying is that in certain specific circumstances when you can optimise code ordering at compile time for a specific architecture, you can get better performance per transistor with an in-order processor. The choice of in-order cpus for both the Cell and Xenon were based on this premise, and that is the issue being debated here.

Also as I said earlier, benchmarks that are based on code which is not compiled with optimisation for the specific in-order cpu running the code cannot be used to argue the case in this discussion simply because the performance argument is based on code being optimised at compile time for in-order against run-time for out of order. For this reason Windows and generic Linux distribution benchmarks cannot be used in this to prove things one way or other: in-order will always perform badly when running generic non-optimised code, that is not in any doubt.

SPM · Sep 8, 2006

N00b said:
The real question is how much does it cost to produce the A64X2 or Conroe. I think they will be comfortably below $60.

You think? Based in what?

Edit: The next question is why would IBM have pitched a OOOe design in the first place and why would Microsoft have wanted one?

Possibly because the Power PC from which it was derived was oooe, and so it would have involved less work. I suspect a three core oooe was not technically feasible, a two core oooe was not viable due to cost grounds and deadline issues, and a single core oooe chip was rejected on performance grounds.

Please quit getting personal, that's simply bad style.

No offence was intended, but give the engineers at IBM and Microsoft some respect as well - they are not complete idiots. The Fa*b0y mentality that you know everything and everyone else is a complete idiot is annoying to say the least.

N00b · Sep 8, 2006

SPM said:
You think? Based in what?

http://www.xbitlabs.com/news/cpu/display/20050913222050.html

SPM said:
Possibly because the Power PC from which it was derived was oooe, and so it would have involved less work. I suspect a three core oooe was not technically feasible, a two core oooe was not viable due to cost grounds and deadline issues, and a single core oooe chip was rejected on performance grounds.

I think it was mostly a schedule thing. IBM has a not-so-good track record when it comes to schedules and OOOe CPUs. Ask Apple about it. I don't think a a dual core OOOe would have been too expensive. If I read correctly Microsoft decided themselves to go with the triple core in-order design and I guess they did so because schedule not die cost / performance reasons.

SPM said:
No offence was intended, but give the engineers at IBM and Microsoft some respect as well - they are not complete idiots. The Fa*b0y mentality that you know everything and everyone else is a complete idiot is annoying to say the least.

Of course they are not idiots, but they are human and they make mistakes. Maybe IBM simply understimated development time. Happens all the time. Or maybe they simply came to the conclusion that their margin for a triple core OOOe CPU would have simply been to small (due to yield issues). I think that IBM simply had some internal technical problems and Xenon in it's current form may have been the best compromise with the given resources and time. But I don't think they would have chosen that design if everything had gone to plan.

Fox5 · Sep 8, 2006

SPM said:
I don't know where you get your figures from, but let me remind you of the current state of the art with regard to oooe multi-core processors, which is a dual core Athlon 64 or Conroe. These currently cost $160 plus. For consoles you are looking for a cost of around $60, so even at this very moment putting a dual core oooe processor on a console is not feasible, let alone a triple core. A triple core in-order chip clearly is as demonstrated by the Xbox 360 cpu available a little more than a year ago. The bottom line is that if Microsoft was going for an oooe core, the best they could do for $60 is a single core.

It is possible of course possible Microsoft made a mistake in employing incompetent IBM engineers to do the Xenos chip design and they should have come to you instead, but I very much doubt it.

I've seen sales that bring the cheapest athlon x2 down to $110, though $150 is the normal retail price. It's ridiculous for you to think that this is the cost of the chip though, that likely includes a large profit and a heatsync. Not to mention that the 733mhz mobile p3 the original xbox used probably was over $100 when it was used as well.
And don't forget the Pentium D's sell for as low as $90, and I believe have an even larger die size than an Athlon X2.

Not to mention that a dual core chip with significantly cut down L2 cache could be made in the same die space as some current single core chips.

Future console CPUs: will they go back to OoOE, and other questions.

Shifty Geezer

uber-Troll!

N00b

darkblu

ban25

Inane_Dork

Rebmem Roines

Fox5

darkblu

Gubbi

ADEX

nonamer

Gubbi

ADEX

Fox5

N00b

SPM

N00b

SPM

SPM

N00b

Fox5

Similar threads