CELL from GDC

"Chris Hecker: OK, he was the best guy I could find in like, three seconds in the WiFi network out in the lobby. All right. But how do we get there? Well, I'm going to take a little diversion here. I'm a programmer, so, I have two technical slides, really one technical slide. And that's about it. All right, ready? So there are two kinds of code in a game basically. There's gameplay code and engine code. Engine code, like graphics and physics, takes really giant data structures of homogenous data. I mean, it's all the same, like a lot of vertices are all a big matrix, or whatever, but usually floating point data structures these days. And you have a single small, relatively small hour that grinds away on that. This code is like, wow, it has a lot of math in it, it has to be optimized for super scalar, blah, blah, blah. It's just not actually that hard to write, right? It's pretty well defined what this code does.

The second kind of code we have is AI and gameplay code. Lots of little exceptions. Even if you're doing a simulation-y kind of game, there's tons of tunable parameters, [it's got a lot of interactions], it's a mess. I mean, this code--you look at the gameplay code in the game, and it's crap. Compared to like, my elegant physics simulator or whatever. But this is a code that actually makes the game feel different. This is the kind of code we want to be easy to write and so we can do more experimental stuff. Here is the terrifying realization about the next generation of consoles. I'm about to break about a zillion NDAs, but I didn't sign any NDAs so that's totally cool!

I'm actually a pretty good programmer and mathematician but my real talent is getting people to tell me stuff that they're not supposed to tell me. There we go. Gameplay code will get slower and harder to write on the next generation of consoles. Why is this? Here's our technical slide. Modern CPUs, like the Intel Pentium 4, blah, blah, blah, Pentium [indiscernible] or laptop, whatever is in your desktop, and all the modern power PCs, use what's called 'out of order' execution. Basically, out of order execution is there to make really crappy code run fast.

So, they basically--when out of order execution came out on the P6, the Pentium 6 [indiscernible] the Pentium 5, the original Pentium and the one after that. The Pentium Pro I think they called it, it basically annoyed a whole bunch of low level ASCII coders, because now all of a sudden, like, the crappiest-ass C code, that like, Joe junior programmer could write, is running as fast as their Assembly, and there's nothing they can do about it. Because the CPU behind their back, is like, reordering that guy's crappy ass C code, to run really well and utilize all the parts of the processor. While this annoyed a whole bunch of people in Scandinavia, it actually…

[laughter]

And this is a great change in the bad old days of 'in order execution,' where you had to be an Assembly language wizard to actually get your CPU to do anything. You were always stalling in the cache, you needed to like--it was crazy. It was a lot of fun to write that code. It wasn't exactly the most productive way of doing experimental programming.

The Xenon and the cell are both in order chips. What does this mean? The reason they did this, is it's cheaper for them to do this. They can drop a lot of core--you know--one out of order core is about the size of three to four in order cores. So, they can make a lot of in order cores and drop them on a chip, and keep the power down, and sell it for cheap--what does this do to our code?

Well, it makes--it's totally fine for grinding like, symmetric algorithms out of floating point numbers, but for lots of 'if' statements in directions, it totally sucks. How do we quantify 'totally sucks?' "Rumors" which happen to be from people who are actually working on these chips, is that straight line gameplay code runs at 1/3 to 1/10 the speed at the same clock rate on an in order core as an out of order core.

This means that your new fancy 2 plus gigahertz CPU, and its Xenon, is going to run code as slow or slower than the 733 megahertz CPU in the Xbox 1. The PS3 will be even worse.

This sucks!

"

http://www.gamespot.com/news/2005/03/18/news_6120449.html
 
This means that your new fancy 2 plus gigahertz CPU, and its Xenon, is going to run code as slow or slower than the 733 megahertz CPU in the Xbox 1. The PS3 will be even worse.
Yeah..sure..fine..whatever!
Maybe someone should tell to Chris Hecker a thing or two about in-order-almost-cacheless CPUs, last generation and the PS2 :LOL:
Will in order CPUs be slower than their out of order counterparts? Yes!
Will them run deadly slow? No!
We'll have a couple of threads per core to hide some latency, and then we'll have much more control on L2 behaviour..
Things will be not as bad as Mr. Hecker believes.
Obviously lame programmers will produce much slower code than before, well, that's their fault ;)
 
Jaws said:
I think I know now what Pana's referring to then, is it caching all SPE's DMA lists in PPE's L2 cache? If so, I thought that was already possible IIRC, from one of the IBM patents...
Patents are patents, what made it into actual implementation is another thing.
Anyway, majority of DMA requests won't need caching, so if we get this, it will still be selectable, not just applied to all memory lookups arbitrarily.

nAo said:
Obviously lame programmers will produce much slower code than before, well, that's their fault
Besides lame code tends to be slow to begin with, OOE or no OOE.
 
Is there a reason why an in-order core cannot be used for fast-math applications and an additional Out of Order core (chip) can be used for code on the same motherboard? Figuring that each would execute its portion of code as fast as possible, there shouldnt be much latency.
 
blakjedi said:
Is there a reason why an in-order core cannot be used for fast-math applications and an additional Out of Order core (chip) can be used for code on the same motherboard? Figuring that each would execute its portion of code as fast as possible, there shouldnt be much latency.

The in-order core will only be slower if the math app doesn't lend itself as well for thread level parallelism as for ILP.
ILP + OOO is a big advantage when the app uses alot of memory so the cpu has something else to do while data is fetched from main memory.
 
Sure, it can execute an extra dozen instructions ... and still spend 100s of cycles waiting for that fetch.
 
MfA said:
Sure, it can execute an extra dozen instructions ... and still spend 100s of cycles waiting for that fetch.

To be fair the OOOE is more about hiding L2 cache latency which is a significant saving. It does nothing for an L2 miss, no architecture is going to help much with a 500 cycle wait on memory.

FWIW at a high level I agree with Chris' premise, game code needs to be easier to write. Programmers should be challenged to produce a great game not to fight the hardware.

I do think he's being overly pessimistic on performance projections though.
 
ERP said:
MfA said:
Sure, it can execute an extra dozen instructions ... and still spend 100s of cycles waiting for that fetch.

To be fair the OOOE is more about hiding L2 cache latency which is a significant saving. It does nothing for an L2 miss, no architecture is going to help much with a 500 cycle wait on memory.

Right.

L2 today is where main memory where 10 years ago latency-wise. While the absolute latency in nano seconds has gone down, the apparent latency, as seen from the CPU (measured in cycles), has gone up.

PPRO Cycletimes were 5-6 ns, main memory ~200ns away or in other words 40 CPU cycles.

Similarly with PS2's R59xx and VUs, cycle times of 3 ns and main memory ~100ns away, - or 30 CPU cycles.

Modern CPUs has L2 latency in the 20-30 cycle range. Next process gen. will increase this apparent latency.

Cheers
Gubbi
 
It doesnt matter, on a superscalar processor with a moderate issue width I have never seen any simulations suggesting you could get even close to a 10 times performance difference. A factor of 3 would be near the top, not the bottom ... something like a 50% is an optimistic average improvement for out of order execution.

In the case of the PPE the difference would be even smaller if there were 2 threads (with SMT the advantages of out of order execution are decimated as long as you have enough threads). Whether they could have pushed the clock as agressively as they have with in order execution is also a big question mark.

Im not saying the performance ratios he quoted are not true, but blaming them on in order execution is jumping to unfounded conclusions IMO.
 
Mfa said:
Whether they could have pushed the clock as agressively as they have with in order execution is also a big question mark.
Which begs the question - assuming same thermal characteristics of two different chips, would OOE design be faster at all, or would it fall behind due to lower clock speed?
 
Fafalada said:
Mfa said:
Whether they could have pushed the clock as agressively as they have with in order execution is also a big question mark.
Which begs the question - assuming same thermal characteristics of two different chips, would OOE design be faster at all, or would it fall behind due to lower clock speed?

Faster as in higher clock or faster as in higher real world performance ?

The first is impossible, the latter reasonably possible, IMO.

If you design your OOOE core to only handle L2 latencies, you can make a reasonably small ROB.

Look at the PPRO/P2/P3. The OOOE enabling machinery in there takes up approximately 10-15% of the total die area. The units are: the reorder buffer (ROB), the register alias table (RAT) and the reservation stations (RS).

For a SPU type CPU you only need 3 issue ports, the ALU/FPU, the permute unit and load/store unit. You'd want to handle more instructions though, 60 for tolerating 30 dual issue cycles of latency.

And they can be made fast too, look at the P4.

As for Chris's performance predictions: He's completely off his rocker. MFA predicting 50% gain from OOOE is reasonable I think (which is still alot of performance).

Cheers
Gubbi
 
Gubbi said:
For a SPU type CPU you only need 3 issue ports, the ALU/FPU, the permute unit and load/store unit. You'd want to handle more instructions though, 60 for tolerating 30 dual issue cycles of latency.
Well, the question here is what would be the use of having OOE on SPUs though. The 6 cycles of local latencies isn't exactly a lot - and random external fetches are supposed to be really rare.
Although - for DMA lookups that are L2 cached, we shold be looking at sub 100cycle latencies, so OOE could potentially make SPE into much stronger general purpose performer.

Still, I would think it's the PPU that needs OOE more.
 
Fafalada said:
Gubbi said:
For a SPU type CPU you only need 3 issue ports, the ALU/FPU, the permute unit and load/store unit. You'd want to handle more instructions though, 60 for tolerating 30 dual issue cycles of latency.
Well, the question here is what would be the use of having OOE on SPUs though. The 6 cycles of local latencies isn't exactly a lot - and random external fetches are supposed to be really rare.
Although - for DMA lookups that are L2 cached, we shold be looking at sub 100cycle latencies, so OOE could potentially make SPE into much stronger general purpose performer.

Still, I would think it's the PPU that needs OOE more.

Well, I should have been more precise. I meant a real cpu but with the size/execution capabilities of a SPU (ie. trying to keep it the same size).

It's absolutely true that OOOE makes little sense for the SPUs. Kernels that run well on SPUs are likely to be well behaved (ie. control/flow is easily predicted). If your control flow is irratic enough to cause problems with instruction schedule you're likely to have much bigger problems (like the 20 cycle branch bubble).

Cheers
Gubbi
 
MfA said:
The oppurtunity costs isn't just area, also pipeline depth.

But not by that much.

Page 5 of this shows that a SPU spends 7 cycles prior to executing an instruction. So even if you had to spend 3-4 cycles for scheduling you're not increasing the total pipeline length (and hence branch penalty) by that much (25-45%).

Edit BTW. Above paper also discloses that they have a C compiler for SPUs with limited C++ support, - and debuggers. It also says that the programming model is not pre-determined (ie. no tools), which I find discouraging (understatement!!)


Cheers
Gubbi
 
SCEA has put up the slides from GDC, but most are already seen from PC Watch.


Anyway,

http://www.research.scea.com/research/html/CellGDC05/index.html

55.jpg
 
Personally, I feel the contention that the world would have been better off if the PPE had been an out of order core to be silly.
IBM had perfectly good PPC OOO cores on the shelf. They could have utilized that IP. It sure is small enough, the 970fx is 60mm2 at 90nm including 512KB cache and all IO. They didn't. So apparently their analysis pointed towards the PPE core as an overall win in this context. And as far as I can see it should be as well.

Am I the only one who reads the quote above, and finds it a bit pathetic?
There are GOOD reasons to go with the PPE. Reminiscing about the good old OOO P6 days just shows lack of perspective both backwards and going forward. Complaining that "I can't write as sloppy code as I used to because the old hardware used to fix the some of those problem and the new one doesn't" just implies that you weren't able to write good code ever.
You damn well deserve to be left behind.
 
I'm on a limited quota bandwidth service :(. Can someone provide a summary of any important points, especially code examples?
 
Back
Top