PS3's Cell implementation is gimped?

Discussion in 'Console Technology' started by Butta, Nov 29, 2007.

  1. Butta

    Regular

    Joined:
    Jan 18, 2007
    Messages:
    361
    Likes Received:
    2
    I've been reading a thread on PS3forums where a posted with significant exposure to PS3 Cell is claiming that it is gimped due to the disabled SPU and Hypervisor. Here are a few quotes: (Any Thoughts?)

    Link: http://ps3forums.com/showthread.php?t=22858&page=32

     
    #1 Butta, Nov 29, 2007
    Last edited by a moderator: Nov 29, 2007
  2. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Ok, so they made sacrifices in order to make the chip easier to manufacture and to assist the Game OS+HV, but these sacrifices mean that the final game console chip would not be as fast/efficient as the CELL Blades 8-SPE's variant used for HPC workloads ?

    I am shocked...
     
  3. Butta

    Regular

    Joined:
    Jan 18, 2007
    Messages:
    361
    Likes Received:
    2
    I think that what you mention is a given... but what seems to be more concerning to me is the mention that PS3 has an unpredictable performance ceiling and even a simple OS update could throw performance off. Also the fact that he mentions that performance may not be equal on all PS3's seems a little weird (if not unbeliveable) to me.
     
  4. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Both 360 and PS3 have some sort of OS/kernel overhead on their CPUs. e.g., We already know about the extra memory reserved by the PS3 OS. So I don't think this is anything new.

    As for the "unpredictable" missing SPU (I supposed he meant the portion of SPU used by the hypervisor and the Game OS from time to time), the developers can optimize their code on the remaining SPUs... leaving some slacks for the OS. This is no different from other CPUs like Xenon that has to give time to other tasks, or contend for resources (hence, operating at lower level too). The unpredictability is just a side effect of concurrency, and is common knowledge.

    All in all, the Cell still have more bandwidth and high performing cores (SPUs) to deal out damages.
     
    #4 patsu, Nov 29, 2007
    Last edited by a moderator: Nov 29, 2007
  5. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,146
    Likes Received:
    8,533
    Location:
    ಠ_ಠ
    Mod Note: I changed the thread title to be less ZOMG :runaway: to something that can possibly generate some sort of meaningful discussion. Some sort.
     
    Goodtwin likes this.
  6. seebs

    Newcomer

    Joined:
    Nov 29, 2007
    Messages:
    44
    Likes Received:
    0
    Location:
    Minnesota
    Quick summary of the "unpredictable" part. Draw a map of the Cell:

    http://www.ibm.com/developerworks/power/library/pa-fpfeib/

    PPE <-> SPE0 <-> SPE1 <-> SPE2 <-> SPE3 <-> I/O
    MIC <-> SPE5 <-> SPE6 <-> SPE7 <-> SPE8 <-> BIO

    (there's also vertical arrows between PPE/MIC and IO/BIO).

    There are four "rings" for data transfer; two clockwise, two counterclockwise. Each ring is independent, and each ring can move up to three transactions AT A TIME... But only if they don't overlap.

    So, for instance, you could have data simultaneously moving:

    PPE->SPE1 SPE3->BIO SPE8->SPE6

    all three on a single one of the four rings. (Each ring can do about 100GB/sec.)

    Now, here's why performance might differ between two systems: It is not consistent which SPE is disabled. Since the disabled SPE is a workaround for manufacturing flaws, a Cell where SPE2 failed validation might look different from one where SPE6 failed validation:

    PPE <-> SPE0 <-> SPE1 <-> SPE2 <-> SPE3 <-> I/O
    MIC <-> SPE5 <-> XXXX <-> SPE7 <-> SPE8 <-> BIO

    PPE <-> SPE0 <-> SPE1 <-> XXXX <-> SPE3 <-> I/O
    MIC <-> SPE5 <-> SPE6 <-> SPE7 <-> SPE8 <-> BIO

    Where does the hypervisor go? What happens when you try to get a pair of "adjacent" SPEs to work on a task, to reduce their effective impact on EIB?

    You can't just say "I'll always put the hypervisor on SPE8" -- SPE8 may be the one that's disabled. The hypervisor's transfers can compete for EIB bandwidth with anyone else's, and there's no guarantee that there are three adjacent SPEs available that are neither the disabled SPE nor the hypervisor. (You could make such a guarantee -- but you'd reduce the number of two-SPE pairs...) The net result is that, if my game runs great on the machine where SPE2 is disabled, because my streaming 3-SPE workload process ends up grabbing data from MIC, passing it through 5, 6, and 7, and dumping it through the BIO port to RSX, it may run like crap on the machine where SPE6 is disabled, because if the hypervisor is down on the bottom row too, one of my SPEs has to be on the top, and that means the data gets shoved around most of the ring instead of staying down in one place, and that introduces potential bandwidth starvation as my dedicated streaming is suddenly competing for bandwidth with the physics engine which was running on another SPE! (I think RSX is off BIO; I could be wrong, though; in any event, it's somewhere, and the same argument applies regardless.)
     
  7. MoHonRi

    Newcomer

    Joined:
    May 23, 2005
    Messages:
    64
    Likes Received:
    0
    I think what he is saying is that since you cannot predict WHICH SPU is disabled that means that you cannot fully optimize or rely on performance numbers will be the same on each PS3 Cell.

    I think the key point is that there is a latency of memory transfers depending on which SPU is requesting the data because of the architecture of the EIB ring. e.g. not all SPU's have a direct connection to the memory controller, some have more 'hops' than others to get at their data. because of tis you cannot predict which SPU will be SPU #1 or SPU #6. It might be 1 hop away, it might be more. If this is true then you cannot predict the bandwidth saturation (because you don't know the begin and end points) or the latency.

    MoH
     
  8. Crazyace

    Regular

    Joined:
    Feb 9, 2002
    Messages:
    333
    Likes Received:
    6
    Should be easy to test this - PS3 linux runs on PPE+6xSPEs, so just benchmark some heavy code on that compared to the IBM platform.
    Most workloads tend to be data parrallel, so all SPE's are fetching data from XDR, rather than communicating between each other. However even if you had heavy communication the EIB is very unlikely to be fully saturated with SPE -> SPE traffix. ( And if you did have issues it's not difficult to benchmark and map a topology dynamically :) )
     
  9. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Someone brought up this point before. How bad is the performance hit due to the possible extra hops ? e.g., Since DMA is async, can't the developer take the worst latency number to make it predictable (where it matters) ?
     
  10. MoHonRi

    Newcomer

    Joined:
    May 23, 2005
    Messages:
    64
    Likes Received:
    0
    It's not really that the CELL in the PS3 is less powerful that the CELL of the IBM workstations. (I think that's obvious). The only disturbing thing is that the performance might be different across different PS3 machines.

    MoH
     
  11. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    If it's a known thing, the developers will just take the worst case number to optimize against. In cases where it doesn't matter, then the devs can adopt more aggressive approach.
     
  12. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    AFAICR, it was done (although this is a single example and covers DP FP heavy processing):

    http://www.cs.berkeley.edu/~samw/research/papers/sc07.pdf

    It shows benchmarks run on the the blade with 1 CELL BE (8 SPE's) vs PS3 (6 SPE's).
     
  13. seebs

    Newcomer

    Joined:
    Nov 29, 2007
    Messages:
    44
    Likes Received:
    0
    Location:
    Minnesota
    Those numbers are pretty close to what I was guessing, I'd say. The one wildcard is that the hypervisor doesn't just remove one core -- it makes that core do stuff, which may or may not soak additional EIB bandwidth.

    There are some excellent tech demos floating around in the SDK work and some of the published papers showing ways to take advantage of SPE affinity to improve the effective available EIB bandwidth (and reduce latency), and those are all difficult at best on the PS3.

    It's not horrible, but there's a pretty noticable gap when comparing "Cell" performance numbers to what actual code on a PS3 can do.
     
  14. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Which makes those tests still relevant as the HV is running and probably still giving some non trivial work to that SPE.
     
  15. seebs

    Newcomer

    Joined:
    Nov 29, 2007
    Messages:
    44
    Likes Received:
    0
    Location:
    Minnesota
    I suspect the HV does a lot less work during a purely computational load under Linux than it does, say, in a game that may be streaming data from blu-ray.

    BTW, yes, I did make an obvious error in the diagram above; it's SPE 4-7, not 5-8, on the bottom. Sorry!
     
  16. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Yes, with another entity running on the same bus, the performance may be lower due to contention... just like any other concurrent systems.

    But for the unpredictable part... is there any reason why one cannot assume a worst case framework ? I assume the hypervisor will only do kick in based on the application's or user's request ? If required, can't the developers detect the SPU # on-the-fly and organize accordingly (only for critical work) ?
     
  17. seebs

    Newcomer

    Joined:
    Nov 29, 2007
    Messages:
    44
    Likes Received:
    0
    Location:
    Minnesota
    Yes.

    I don't know that you're given access to let you detect the SPU# or organize on the fly; it looks like you can't in Linux, although I haven't explored very deeply. The Cell SDK simply documents that affinity isn't available on the PS3 platform.

    You can, to some extent, develop for the worst-case scenario -- but that could be noticably worse than a reasonably average case, which is why I think it's a bad thing.

    There's two separate questions here:
    1. Impact of losing two SPEs. Fairly predictable.
    2. Impact of having it not always be the same two. Harder to guess.

    I think that, if the PS3 ALWAYS shipped with, say, SPE3 disabled, and SPE7 running the hypervisor, you would see slightly better performance than you will on systems in the wild now.
     
  18. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,762
    Likes Received:
    2,639
    Location:
    Maastricht, The Netherlands
    Given that you know which SPE has been disabled in terms of hardware, can't you balance things out again at leat to some extent by cleverly choosing which of the SPEs is going to run the hypervisor?

    Other than that, I must say that I'm not convinced of the significance of the peformance gains of having a theoretical situation where you have full mastery of the full Cell vs just having 1 PPE and 6 SPEs available, compared to the benefits of having the hypervisor in the first place and of course being able to produce the chips at a lower cost (higher yields).

    You should be able to figure this out though in more detail simply by running some tests on several different LInux PS3s. The test linked above isn't as useful, because it compares one PS3 to one Blade Cell chip, whereas what you / we are looking for in this case is performance differences between different PS3s.
     
  19. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Yes, it means you may have to use other mechanisms to achieve speed up.

    EDIT: e.g., I remember DeanoC has some EIB cache hack that share small amount of info across the SPUs, but my memory is vague now. Of course this is not a general solution.

    I thought the PPU can set up the entire SPU environment at will ? They may not need to fix the SPU #... just their relative position within the pictured framework above.
     
    #19 patsu, Nov 29, 2007
    Last edited by a moderator: Nov 29, 2007
  20. seebs

    Newcomer

    Joined:
    Nov 29, 2007
    Messages:
    44
    Likes Received:
    0
    Location:
    Minnesota
    Maybe. You could present people with a consistent "3 up, 3 down" view of the Cell, but with some unredictable (to them) bus contention.

    Well, the hypervisor is a tradeoff between politics and performance. The hypervisor's there to keep you from stealing stuff. If it weren't all about control, it could just be an OS bit that didn't need to be a hypervisor.

    I agree that there's also a price/performance tradeoff with the disabled SPE, and it may even be a necessary one -- but it does mean that you really can't assume that "Cell" benchmarks tell you what "a PS3" can do.

    There are three interesting cases:

    1. 6 SPEs, but you control affinity. If you had a blade server, you could compare 6-SPE and 8-SPE configurations, effectively, by targeting specific SPEs, then simply ignoring two of them.
    2. 6 SPEs, no control over affinity.
    3. A different 6 SPEs, no control over affinity.

    My guess is that there would be noticable but minor variance in performance between case 2 and 3 -- but it would be hard to reproduce from one test to another. I'd guess that they'd both be marginally worse than case #1 when a workload was built for it.

    Consoles, though, are in a way hard realtime; it is perfectly fine to have every frame ready .2ms before you need it, but dropping frames gets you dinged points in reviews, so if performance is even SLIGHTLY unpredictable, you have to leave larger margins.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...