IBM on CELL as online games server(do physics simulation).

Discussion in 'Console Technology' started by cho, Jan 15, 2006.

  1. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    416
    Likes Received:
    2
    #1 cho, Jan 15, 2006
    Last edited by a moderator: Jan 16, 2006
  2. tema

    Banned

    Joined:
    Dec 15, 2005
    Messages:
    115
    Likes Received:
    2
    XCPU (A)0.2 x 3 = 0.6 (B) 0.18 x 3 = 0.54 ?
     
  3. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    Didn't get a chance to read the whole paper - it looks very interesting though - but that old analogy of the PPE being the conductor and the SPEs the orchestra springs to mind..

    It sounds like they're going to pursue it further, hopefully they'll update us again :)

    It's worth noting with the performance comparisons that they were benching on a 3Ghz P4 vs a 2.4Ghz 6-SPE Cell, and the code was also originally Wintel (which makes it a pretty interesting case study).

    3 PPEs wouldn't necessarily scale linearly. And the PPE and Xenon cores aren't exactly the same anyway.
     
    #3 Titanio, Jan 15, 2006
    Last edited by a moderator: Jan 15, 2006
  4. BenQ

    Newcomer

    Joined:
    Jun 28, 2005
    Messages:
    216
    Likes Received:
    0
    I don't get it.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Interesting - even if I did skip over the maths involved in the physics modelling.

    Version 2 for the win :!:

    It's quite clear from this that the dev team were shooting in the dark, solving the wrong problem first (integration) and only finding out too late that they'd built-in a hideous bottleneck.

    It seems that if they'd moved a lot of the PPE's work onto one or two SPEs and also re-jigged the datastructures/DMA techniques to obviate the conductor having to do anything for the orchestra, they'd have a demo that truly lived up to the promise of Cell.

    But that's the nature of R&D, so nothing to criticise them for.

    It's also interesting that integration, being so compute-intensive, obviated double-buffered work-unit storage on each SPE. The compute time was 182x the DMA time.

    Overall, I guess this is why Havok and Ageia have a business.

    Jawed
     
  6. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    Well, they took the obvious route. Integration was the most intensive part of the Wintel code, so it made sense to farm that out to the SPEs. They did also intend putting some of the collision detection on the SPEs, but time prevented them from doing so. They just didn't bank on the PPE slowing with rest of the work, vs the P4.

    Their experience with the SPEs is very promising though. And they seem to have a clear idea of where to go next, so hopefully they'll be given the opportunity (and a DD3 3.2Ghz Cell ;)).
     
  7. rounin

    Veteran

    Joined:
    Sep 21, 2005
    Messages:
    1,251
    Likes Received:
    20
    Why didn't they bench against AMD stuff? Doesn't it make more sense to run benchmarks against, say, a Dual-Core AMD chip?
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Which was a little naive (they admit this), given that the Cell implementation of the "game" has a significant computational overhead in order to split the workload across SPEs.

    If you put that overhead on the PPE, then you only make the bottleneck even worse. They were surprised by how bad the bottleneck was - but there's no doubt they were expecting it.

    Pity they didn't have twice as long...

    Jawed
     
  9. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    They mention the game server was a CBEA prototype board runnign at 2.4 Ghz and with 6 SPE's... it seems like they were using the DD1 revision. Inthis case you would not only be 800 MHz from the final speed achieved by the CBEA processor with say PLAYSTATION 3, but you would also have a decisively slower PPE implementation (the PPE grew 2x going from DD1 to DD2).
     
  10. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,509
    Likes Received:
    839
    Still, isn't this more or less as expected?

    The compute bound parts of the physics engine is offloadet to SPEs and sees a great speed up, the remaining pointer chasing bound part (collision detection) of it is not (because it isn't as straight forward).

    Still, the craptacular performance of the PPE is quite surprising IMO. The fact that CELL just barely out performs a P4 in a real game situation is as well.

    Cheers
    Gubbi
     
    #10 Gubbi, Jan 15, 2006
    Last edited by a moderator: Jan 15, 2006
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Gubbi, I think they were misdirected in their approach and so the craptacular results are more a reflection of the dead ends they encountered rather than Cell intrinsically.

    Jawed
     
  12. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,509
    Likes Received:
    839
    Well, the multicycle latency for the schedule-execute loop for instructions, and a high load-to-use latency (which are probably the biggest culprits in the detrimental collision detection performance) of the memory arrays are intrinsic to CELL, so I disagree.

    To get solid performance they would have to be able to distribute collision detection too. We've discussed this in other threads, it is not a workload that fits the SPEs well.

    Cheers
     
  13. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    They said they would with more time (collision detection, or the "narrow phase" part of it at least), and move other things too..

    It's also worth considering if you scaled results according to the clockspeed difference present, there would be a better result. But also, in the second benchmark, if you considered it as a game, you may aswell free 3 of the SPEs, since you're not seeing any further improvement beyond 3 SPEs, and use those for other things. In other words, you're getting better performance than the P4, which is clocked higher, and still could do more than it elsewhere..

    It sounds like it is a going-concern, so we might get a further update later. If it was a DD1 Cell, it'd be interesting to even just see it running as is on a DD2 or DD3, to see if the obvious changes to the PPE in those iterations would alone have an impact.


    The main culprits they identify (collision detection, data packing) are all things that could be changed and improved, which the authors also agree upon.

    On collision detection, the method they discuss would be quite well suited to it. You do your bounding volume heirarchy traversal on the PPE, but then when you've gone down as far as you can go there, you offload the final check to a SPE (that's the approach they're suggesting). But whole collision detection on SPEs..that even is a very arguable issue.
     
    #13 Titanio, Jan 15, 2006
    Last edited by a moderator: Jan 15, 2006
  14. Lord Darkblade

    Newcomer

    Joined:
    Jun 2, 2005
    Messages:
    49
    Likes Received:
    0
    Its an interesting read, however as they note a lot of their slowdown was due to their approach to the problem rather than the cell. They noted that data structures were bad for the SPEs and were packaged and sent from the PPE (surely running some DMA here and accessing the actual data structures themselves would have been a better plan, let the SPEs do the fetching rather than the PPE?). The code on the PPE was hurt severely by its need to organise the SPEs heavily, a more light-handed approach would likely have given better results (and a lot of their collision detection could be easily vectorised as noted) which would make it a more suitable task for the SPEs.

    Overall its not a bad performance, with 2 SPEs the city demo was looking at a 1.3x speed up (ish) leaving a theoretical 4 more SPEs for other tasks (assuming that the load on the PPE would not be that much greater to co-ordinate other tasks). This was also a 2.4 rather than the 3.2 or 3.0 GHz cell and only 6 active SPEs (were the server blades not dual 1:6 2.4GHz systems?) so a PS3 running at this would still perform substantially better than a P4 (1.5ish speedup?)... so even though its bad there is still light at the end of the tunnel, what the cell is offering isn't perhaps the best solution to every problem without substantial revisions to the code however it is a solution that may give some advantage.
     
  15. blakjedi

    Veteran

    Joined:
    Nov 20, 2004
    Messages:
    2,975
    Likes Received:
    79
    Location:
    20001
    this seems pretty unspectacular considering the fact you are pitting 7 cores verses a single processor... 4 times speedup is NOT impressive imho... where is the magnitude level speedup...? how would it compare next to one of the new dual core athlons/p4s?
     
  16. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    Well, note that you're over 3.5x already with 4 SPEs. You may as well stop throwing SPEs at the problem at that point, as beyond that you're obviously not really being bound by the integration speed.

    How would a DD2 3.2Ghz Cell compare? :) We can only work with what we're given.

    I think it's safe to say this is a less than perfect implementation. Cell would suffer more for that than a P4 though, for sure, if you want to look at it that way. But conversely, with a really ideal implementation (on both), you'd probably see it stretch its legs vs the P4 more than is exhibited here.
     
  17. ERP

    ERP Moderator
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    I could just point out that for it's order of maginitude performance (1000%) difference in FPU terms, Cell returned a 30% speed improvement in the "realistic game scenario" :p
    And everything after the second SPU was wasted.

    But it's more interesting to compare the performance in the artificial test to the performance in a real scenario. Because it demonstrates how misleading artificial tests can be.
     
  18. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    Only misleading if you take them out of context ;)

    edit -

    There's also another Cell article online from that journal:

    MPI microtask for programming the Cell Broadband Engine™ processor

    http://www.research.ibm.com/journal/sj/451/ohara.html

     
    #18 Titanio, Jan 15, 2006
    Last edited by a moderator: Jan 15, 2006
  19. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    If its a CEB 20x0 series, its a DD2+.
    They said its a 2.4Ghz Cell with 512Mb of system RAM, if thats a CEB its a 2030 (by the amount of RAM) and then it would be a DD2 or greater, if however it not a CEB its could any revision.
     
    #19 DeanoC, Jan 15, 2006
    Last edited by a moderator: Jan 15, 2006
  20. Carl B

    Carl B Friends call me xbd
    Moderator Legend

    Joined:
    Feb 20, 2005
    Messages:
    6,266
    Likes Received:
    63
    This last section of the article stood out as offering some of the research team's more condensed/cohesive insights into utilizing the SPE's to greater effect in future attempts:

    I think overall, though they seemed to fumble some things initially, Cell if nothing else shows promise for the intended tasks. The paper does go on to mention that code-porting to the SPE's will be a 'daunting' task, and one that hopefully future applications will make easier. But of course we've known all along that getting code onto the SPE's is Cell's Achilles Heel in a sense.

    I walk away generally pleased though with Cell's potential. Truthfully I don't think a team at IBM would be my #1 pick for developing an MMOG engine for Cell straight-off. Given knowledge of the Cell, I rather see what some experienced PS2 devs could create.
     
    #20 Carl B, Jan 15, 2006
    Last edited by a moderator: Jan 15, 2006
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...