PS3's Cell implementation is gimped?

Butta

Regular
I've been reading a thread on PS3forums where a posted with significant exposure to PS3 Cell is claiming that it is gimped due to the disabled SPU and Hypervisor. Here are a few quotes: (Any Thoughts?)

Link: http://ps3forums.com/showthread.php?t=22858&page=32

I'm mostly on this forum because I'm interested in Cell development, which I write about professionally. I do a lot of programming across a lot of architectures, and I've been published (like, "paying the mortgage" kind of money involved, not just posting to a blog) writing about both Cell and Xenon.

I'm making a technical claim, which is that the PS3's Cell is gimped in ways that make it very hard to achieve its theoretical performance abilities.

As I've said, repeatedly, even what you can get to on the gimped Cell runs rings around Xenon.

Thus, no amount of showing how Cell is outperforming Xenon contradicts my point -- which is that the PS3's Cell will not perform as well as the 8-SPE versions being used in supercomputing apps. It can't -- and it's not just the one SPE disabled, and one taken over by the hypervisor, but that the unpredictability means you have to avoid the top part of the performance envelope, or your game will have unpredictable performance problems on some people's PS3s but not on others.

Which is fine; the performance you can get to is unequivocally better than anything I expect to see out of the 360 any time soon.

I think the problem here is that I made a claim purely about the comparative merits of different Cell variants, and also comparing their ease of use to EE, and you're assuming that this somehow ought to be reflected by Xenon being faster than Cell or something. But that's silly. Xenon was rushed to market to try to get the first-to-launch advantage over the PS3. It's not otherwise particularly technically impressive.

But the thing you're trying to rebut isn't something I said; it's also not something I implied, suggested, hinted at, or intended. It's just something you assumed I'd say, because you've got me pigeon-holed as some kind of weird PS3-hater who doesn't believe the PS3 is powerful. Which is dumb, given that I've paid the bills on a number of occasions specifically by writing about how powerful Cell is and how to get more performance out of it!

A rough ballpark estimate: Start with a full-featured 8-SPE Cell, with support for SPE affinity. Now, take off one SPE. You've lost about 1/9 of your power; maybe a bit over, maybe a bit under... But you've also lost support for affinity, which kills your ability to push the system to the edge. Now, you have to design everything so that it doesn't rely on saturating EIB, and doesn't make any assumptions about being able to assign an adjacent pair of SPEs to a particular task, or about whether a given SPE is close to the PPE -- which matters, because the rings in EIB can do up to three non-overlapping transactions at once.

Now add in the hypervisor. You lose another SPE -- you've now lost a quarter of your heavy-duty vector processing. Furthermore, any and all attempts at affinity are totally shot. Worse, you cannot anticipate or plan for the hypervisor's bandwidth usage.

What that means is that you have to leave even MORE headroom, or have random, unpredictable, losses of performance that are totally outside your code. A new system update could kill your performance by saturating EIB, at least as much as the hypervisor can use... And the SPEs are pretty powerful, and can use a lot of bandwidth.

Putting it in terms of your notion of estimating cars, I'd expect about a 20% loss over what a developer could do on an RSX+Cell system without the hypervisor and without the missing SPE. Some of the damage is done by the hypervisor, some by the unpredictable missing SPE.

The thing is, if you really want to get the best performance out of Cell, you need affinity. You need to be able to allocate adjacent SPEs, and you need to know what they're adjacent to when allocating your workload. Without that, you are going to be well short of the theoretical capacity of the machine.

What that means is that, not only are you missing two SPEs (remember, that's 1/4 the total gruntwork power of the machine), but that you're actually a lot worse off than you would be on a hypervisor-free machine that simply had exactly six SPEs to begin with, and had them in a predictable topology.

So, while it's 25% off just the SPEs, not off the PPE, it's also a substantial hit to the optimization techniques you need if you're going to get full advantage of EIB. EIB's an absolutely gob-smacking technical achievement, and Sony crippled it. I know they had business reasons, but it's still a crying shame.
 
Last edited by a moderator:
Ok, so they made sacrifices in order to make the chip easier to manufacture and to assist the Game OS+HV, but these sacrifices mean that the final game console chip would not be as fast/efficient as the CELL Blades 8-SPE's variant used for HPC workloads ?

I am shocked...
 
Ok, so they made sacrifices in order to make the chip easier to manufacture and to assist the Game OS+HV, but these sacrifices mean that the final game console chip would not be as fast/efficient as the CELL Blades 8-SPE's variant used for HPC workloads ?

I am shocked...

I think that what you mention is a given... but what seems to be more concerning to me is the mention that PS3 has an unpredictable performance ceiling and even a simple OS update could throw performance off. Also the fact that he mentions that performance may not be equal on all PS3's seems a little weird (if not unbeliveable) to me.
 
Both 360 and PS3 have some sort of OS/kernel overhead on their CPUs. e.g., We already know about the extra memory reserved by the PS3 OS. So I don't think this is anything new.

As for the "unpredictable" missing SPU (I supposed he meant the portion of SPU used by the hypervisor and the Game OS from time to time), the developers can optimize their code on the remaining SPUs... leaving some slacks for the OS. This is no different from other CPUs like Xenon that has to give time to other tasks, or contend for resources (hence, operating at lower level too). The unpredictability is just a side effect of concurrency, and is common knowledge.

All in all, the Cell still have more bandwidth and high performing cores (SPUs) to deal out damages.
 
Last edited by a moderator:
Mod Note: I changed the thread title to be less ZOMG :runaway: to something that can possibly generate some sort of meaningful discussion. Some sort.
 
Quick summary of the "unpredictable" part. Draw a map of the Cell:

http://www.ibm.com/developerworks/power/library/pa-fpfeib/

PPE <-> SPE0 <-> SPE1 <-> SPE2 <-> SPE3 <-> I/O
MIC <-> SPE5 <-> SPE6 <-> SPE7 <-> SPE8 <-> BIO

(there's also vertical arrows between PPE/MIC and IO/BIO).

There are four "rings" for data transfer; two clockwise, two counterclockwise. Each ring is independent, and each ring can move up to three transactions AT A TIME... But only if they don't overlap.

So, for instance, you could have data simultaneously moving:

PPE->SPE1 SPE3->BIO SPE8->SPE6

all three on a single one of the four rings. (Each ring can do about 100GB/sec.)

Now, here's why performance might differ between two systems: It is not consistent which SPE is disabled. Since the disabled SPE is a workaround for manufacturing flaws, a Cell where SPE2 failed validation might look different from one where SPE6 failed validation:

PPE <-> SPE0 <-> SPE1 <-> SPE2 <-> SPE3 <-> I/O
MIC <-> SPE5 <-> XXXX <-> SPE7 <-> SPE8 <-> BIO

PPE <-> SPE0 <-> SPE1 <-> XXXX <-> SPE3 <-> I/O
MIC <-> SPE5 <-> SPE6 <-> SPE7 <-> SPE8 <-> BIO

Where does the hypervisor go? What happens when you try to get a pair of "adjacent" SPEs to work on a task, to reduce their effective impact on EIB?

You can't just say "I'll always put the hypervisor on SPE8" -- SPE8 may be the one that's disabled. The hypervisor's transfers can compete for EIB bandwidth with anyone else's, and there's no guarantee that there are three adjacent SPEs available that are neither the disabled SPE nor the hypervisor. (You could make such a guarantee -- but you'd reduce the number of two-SPE pairs...) The net result is that, if my game runs great on the machine where SPE2 is disabled, because my streaming 3-SPE workload process ends up grabbing data from MIC, passing it through 5, 6, and 7, and dumping it through the BIO port to RSX, it may run like crap on the machine where SPE6 is disabled, because if the hypervisor is down on the bottom row too, one of my SPEs has to be on the top, and that means the data gets shoved around most of the ring instead of staying down in one place, and that introduces potential bandwidth starvation as my dedicated streaming is suddenly competing for bandwidth with the physics engine which was running on another SPE! (I think RSX is off BIO; I could be wrong, though; in any event, it's somewhere, and the same argument applies regardless.)
 
I think what he is saying is that since you cannot predict WHICH SPU is disabled that means that you cannot fully optimize or rely on performance numbers will be the same on each PS3 Cell.

I think the key point is that there is a latency of memory transfers depending on which SPU is requesting the data because of the architecture of the EIB ring. e.g. not all SPU's have a direct connection to the memory controller, some have more 'hops' than others to get at their data. because of tis you cannot predict which SPU will be SPU #1 or SPU #6. It might be 1 hop away, it might be more. If this is true then you cannot predict the bandwidth saturation (because you don't know the begin and end points) or the latency.

MoH
 
Should be easy to test this - PS3 linux runs on PPE+6xSPEs, so just benchmark some heavy code on that compared to the IBM platform.
Most workloads tend to be data parrallel, so all SPE's are fetching data from XDR, rather than communicating between each other. However even if you had heavy communication the EIB is very unlikely to be fully saturated with SPE -> SPE traffix. ( And if you did have issues it's not difficult to benchmark and map a topology dynamically :) )
 
Someone brought up this point before. How bad is the performance hit due to the possible extra hops ? e.g., Since DMA is async, can't the developer take the worst latency number to make it predictable (where it matters) ?
 
Should be easy to test this - PS3 linux runs on PPE+6xSPEs, so just benchmark some heavy code on that compared to the IBM platform.
Most workloads tend to be data parrallel, so all SPE's are fetching data from XDR, rather than communicating between each other. However even if you had heavy communication the EIB is very unlikely to be fully saturated with SPE -> SPE traffix. ( And if you did have issues it's not difficult to benchmark and map a topology dynamically :) )

It's not really that the CELL in the PS3 is less powerful that the CELL of the IBM workstations. (I think that's obvious). The only disturbing thing is that the performance might be different across different PS3 machines.

MoH
 
It's not really that the CELL in the PS3 is less powerful that the CELL of the IBM workstations. (I think that's obvious). The only disturbing thing is that the performance might be different across different PS3 machines.

MoH

If it's a known thing, the developers will just take the worst case number to optimize against. In cases where it doesn't matter, then the devs can adopt more aggressive approach.
 
Should be easy to test this - PS3 linux runs on PPE+6xSPEs, so just benchmark some heavy code on that compared to the IBM platform.
Most workloads tend to be data parrallel, so all SPE's are fetching data from XDR, rather than communicating between each other. However even if you had heavy communication the EIB is very unlikely to be fully saturated with SPE -> SPE traffix. ( And if you did have issues it's not difficult to benchmark and map a topology dynamically :) )

AFAICR, it was done (although this is a single example and covers DP FP heavy processing):

http://www.cs.berkeley.edu/~samw/research/papers/sc07.pdf

It shows benchmarks run on the the blade with 1 CELL BE (8 SPE's) vs PS3 (6 SPE's).
 
Those numbers are pretty close to what I was guessing, I'd say. The one wildcard is that the hypervisor doesn't just remove one core -- it makes that core do stuff, which may or may not soak additional EIB bandwidth.

There are some excellent tech demos floating around in the SDK work and some of the published papers showing ways to take advantage of SPE affinity to improve the effective available EIB bandwidth (and reduce latency), and those are all difficult at best on the PS3.

It's not horrible, but there's a pretty noticable gap when comparing "Cell" performance numbers to what actual code on a PS3 can do.
 
Those numbers are pretty close to what I was guessing, I'd say. The one wildcard is that the hypervisor doesn't just remove one core -- it makes that core do stuff, which may or may not soak additional EIB bandwidth.

Which makes those tests still relevant as the HV is running and probably still giving some non trivial work to that SPE.
 
Which makes those tests still relevant as the HV is running and probably still giving some non trivial work to that SPE.

I suspect the HV does a lot less work during a purely computational load under Linux than it does, say, in a game that may be streaming data from blu-ray.

BTW, yes, I did make an obvious error in the diagram above; it's SPE 4-7, not 5-8, on the bottom. Sorry!
 
Yes, with another entity running on the same bus, the performance may be lower due to contention... just like any other concurrent systems.

But for the unpredictable part... is there any reason why one cannot assume a worst case framework ? I assume the hypervisor will only do kick in based on the application's or user's request ? If required, can't the developers detect the SPU # on-the-fly and organize accordingly (only for critical work) ?
 
Yes, with another entity running on the same bus, the performance may be lower due to contention... just like any other concurrent systems.

Yes.

But for the unpredictable part... is there any reason why one cannot assume a worst case framework ? I assume the hypervisor will only do kick in based on the application's or user's request ? If required, can't the developers detect the SPU # on-the-fly and organize accordingly (only for critical work) ?

I don't know that you're given access to let you detect the SPU# or organize on the fly; it looks like you can't in Linux, although I haven't explored very deeply. The Cell SDK simply documents that affinity isn't available on the PS3 platform.

You can, to some extent, develop for the worst-case scenario -- but that could be noticably worse than a reasonably average case, which is why I think it's a bad thing.

There's two separate questions here:
1. Impact of losing two SPEs. Fairly predictable.
2. Impact of having it not always be the same two. Harder to guess.

I think that, if the PS3 ALWAYS shipped with, say, SPE3 disabled, and SPE7 running the hypervisor, you would see slightly better performance than you will on systems in the wild now.
 
Given that you know which SPE has been disabled in terms of hardware, can't you balance things out again at leat to some extent by cleverly choosing which of the SPEs is going to run the hypervisor?

Other than that, I must say that I'm not convinced of the significance of the peformance gains of having a theoretical situation where you have full mastery of the full Cell vs just having 1 PPE and 6 SPEs available, compared to the benefits of having the hypervisor in the first place and of course being able to produce the chips at a lower cost (higher yields).

You should be able to figure this out though in more detail simply by running some tests on several different LInux PS3s. The test linked above isn't as useful, because it compares one PS3 to one Blade Cell chip, whereas what you / we are looking for in this case is performance differences between different PS3s.
 
You can, to some extent, develop for the worst-case scenario -- but that could be noticably worse than a reasonably average case, which is why I think it's a bad thing.

Yes, it means you may have to use other mechanisms to achieve speed up.

EDIT: e.g., I remember DeanoC has some EIB cache hack that share small amount of info across the SPUs, but my memory is vague now. Of course this is not a general solution.

I don't know that you're given access to let you detect the SPU# or organize on the fly; it looks like you can't in Linux, although I haven't explored very deeply. The Cell SDK simply documents that affinity isn't available on the PS3 platform.

I think that, if the PS3 ALWAYS shipped with, say, SPE3 disabled, and SPE7 running the hypervisor, you would see slightly better performance than you will on systems in the wild now.

I thought the PPU can set up the entire SPU environment at will ? They may not need to fix the SPU #... just their relative position within the pictured framework above.
 
Last edited by a moderator:
Given that you know which SPE has been disabled in terms of hardware, can't you balance things out again at leat to some extent by cleverly choosing which of the SPEs is going to run the hypervisor?

Maybe. You could present people with a consistent "3 up, 3 down" view of the Cell, but with some unredictable (to them) bus contention.

Other than that, I must say that I'm not convinced of the significance of the peformance gains of having a theoretical situation where you have full mastery of the full Cell vs just having 1 PPE and 6 SPEs available, compared to the benefits of having the hypervisor in the first place and of course being able to produce the chips at a lower cost (higher yields).

Well, the hypervisor is a tradeoff between politics and performance. The hypervisor's there to keep you from stealing stuff. If it weren't all about control, it could just be an OS bit that didn't need to be a hypervisor.

I agree that there's also a price/performance tradeoff with the disabled SPE, and it may even be a necessary one -- but it does mean that you really can't assume that "Cell" benchmarks tell you what "a PS3" can do.

You should be able to figure this out though in more detail simply by running some tests on several different LInux PS3s. The test linked above isn't as useful, because it compares one PS3 to one Blade Cell chip, whereas what you / we are looking for in this case is performance differences between different PS3s.

There are three interesting cases:

1. 6 SPEs, but you control affinity. If you had a blade server, you could compare 6-SPE and 8-SPE configurations, effectively, by targeting specific SPEs, then simply ignoring two of them.
2. 6 SPEs, no control over affinity.
3. A different 6 SPEs, no control over affinity.

My guess is that there would be noticable but minor variance in performance between case 2 and 3 -- but it would be hard to reproduce from one test to another. I'd guess that they'd both be marginally worse than case #1 when a workload was built for it.

Consoles, though, are in a way hard realtime; it is perfectly fine to have every frame ready .2ms before you need it, but dropping frames gets you dinged points in reviews, so if performance is even SLIGHTLY unpredictable, you have to leave larger margins.
 
Back
Top