A new programming paradigm for the Cell

Discussion in 'CellPerformance@B3D' started by homy, Mar 21, 2007.

  1. homy

    Banned

    Joined:
    Jan 20, 2007
    Messages:
    136
    Likes Received:
    4
    First of all I'm not a game developer. But with the knowledge of distributed programming I would suggest that the current programming style on the Cell is questionable.
    The Cell is a parallel multi-processor hence it works best if the resource and computation are shared among its sub-processor(cores).
    This work has been proven by the following paper:
    http://hpc.pnl.gov/people/fabrizio/papers/ipdps07-graphs.pdf
    2 Cells beat performance of BlueGene/L 128 CPUs and pretty much neck-to-neck with BlueGene/L 256 CPUs

    I saw many PS3 games not using the distributed programming approach even though it looks quite nice.
    For example: HS uses 2 SPUs for AI and few other for physics...
    This style of dividing sub-processors for specific tasks is not efficient.
    At school, our department professor tested his algorithm on 2 SPUs give an advantage of 2.2 times to 1 SPU, however when he distributed the algorithm to 5 SPUs the performance got a boost of 16 times to 1 SPU.
    So the question to developers is: should you design your codes such that it's distributed to all SPUs?
    Suppose you want to have 60 fps, which means a frame takes up 16.67 millisecond. Out of this 16.67, a third is dedicated to graphic card and the rest goes into the Cell for other tasks such as geometry, animation or physic... So the geometry would take x amount, animation takes y amount and physic take z amount of 11.11 millisecond on all SPUs.
     
    #1 homy, Mar 21, 2007
    Last edited by a moderator: Mar 21, 2007
  2. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,823
    Likes Received:
    153
    Location:
    Minato-ku, Tokyo
    SPURS (SPU Runtime System) is available for PS3 developers AFAIK. In SPURS, SPEs are main processors and the PPE is merely a service co-processor invoked only when absolutely necessary.
     
  3. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    one, do you remember if the PPE is involved in any way for a SPU to access main memory ? My current impression is PPE is needed during the setup (memory map) stage. Thereafter, the SPU should be able to access main memory or another local store without any external help. There was also some mention of SPU interacting with I/O devices (via PPE ?) but I can't remember where I read it anymore.
     
  4. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,823
    Likes Received:
    153
    Location:
    Minato-ku, Tokyo
    I don't remember exact details, but the ideal workload configuration is PPE does the tasks possible only for PPE (setup, SPE booting, system calls).

    Also, SPU does lock-free synchronization by the atomic cache in SPE and doesn't use the mutex in the shared memory or provided in the OS API since all OS function calls are costly remote procedure calls via PPU and the scheduling for PPU threads is independent of SPEs. In SPURS Job, DMAs are automatically overlapped and pipelined.
     
  5. inefficient

    Veteran

    Joined:
    May 5, 2004
    Messages:
    2,121
    Likes Received:
    53
    Location:
    Tokyo
    That information is old/incorrect. HS and most of the non-launch window games coming out do not use SPU in that fashion. They use a more robust job/task system. If not SPURS, then something of a similar vein.
     
  6. homy

    Banned

    Joined:
    Jan 20, 2007
    Messages:
    136
    Likes Received:
    4
    I hope so because if they want to see a leap in performance they must do it in a distributed way.
     
  7. rendezvous

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    347
    Likes Received:
    12
    Location:
    Lund, Sweden
    I think this sounds a bit fishy, I have yet to see any porformace increase that isn't sublinear with the amounts of SPUs before. I could imagine scenarios where it could happen, when you are limited by the amount of local ram and intra SPE communication and omputations aren't the limiting factor.

    Could you please shead som light on what kind of algorithm he was testing, preferably with information of how he could reach such impressive numbers.
     
  8. Shompola

    Newcomer

    Joined:
    Nov 14, 2005
    Messages:
    142
    Likes Received:
    1
    This is definitely an issue of memory usage, the data (chunks etc.) is large enough that it fits the combined local memory better, and reduces memory access to the main RAM pool significantly. what happens if he increases data usage? The speed-up factor should decrease.
     
  9. inefficient

    Veteran

    Joined:
    May 5, 2004
    Messages:
    2,121
    Likes Received:
    53
    Location:
    Tokyo
    According to data from this presentation (link), the actual latency for one SPU reading from another SPU is still a whopping 200 cycles. In comparison a DMA from XDR to LS is 500cycles.

    Not to be a skeptic. Given those numbers, I don't know see that 16x performance increase quoted coming simply from utilizing the combined LS's for better memory performance.

    To me it sounds more likely that there was just a misunderstanding. Something like professor X ran a simulation on 1 SPU that was fairly unoptimized but he benched it. Then later down the road, by a combination of exploiting instruction level parallelism and multiprocessor parallelism, he was able to optimize it to the point of a 16x speed up. I think this is the most realistic scenario.

    Unless it was proof of concept where all code and data fit easily into the 5 x 256K combined memory space. But that kind of example is not really that usefully in the real world.
     
  10. Datasegment

    Newcomer

    Joined:
    Feb 7, 2007
    Messages:
    9
    Likes Received:
    2

    Actually, although the instigation of communitactions between two SPEs is relatively slow, the actual data throughput rate is phenomenally fast, around the 200Gb/s mark (4 rings * 25.6Gb per second in each direction) - AND this data transaction happens onchip, with no hit to the main memory subsystem at all. This coupled with the fact that these memory transactions can be performed while the SPE's simultaneously operate on other data means that the effective transfer penalty can be reduced to almost zero (setup time and handshaking still required).
     
  11. Datasegment

    Newcomer

    Joined:
    Feb 7, 2007
    Messages:
    9
    Likes Received:
    2
    (*communications)
     
  12. homy

    Banned

    Joined:
    Jan 20, 2007
    Messages:
    136
    Likes Received:
    4
    He's submitting his paper to a conference. Once his paper is published I'll post it here.
     
  13. Shompola

    Newcomer

    Joined:
    Nov 14, 2005
    Messages:
    142
    Likes Received:
    1
    That might take up to a few months no?
     
  14. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
  15. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    Answering my own question...

    Came across a PS3 developer's post in Slashdot.
    http://games.slashdot.org/comments....mmentsort=0&mode=thread&pid=18466573#18466779

    More here... http://games.slashdot.org/comments....mmentsort=0&mode=thread&pid=18465477#18466339
     
    #15 patsu, Mar 24, 2007
    Last edited by a moderator: Mar 25, 2007
  16. homy

    Banned

    Joined:
    Jan 20, 2007
    Messages:
    136
    Likes Received:
    4
    The following code fortifies my argument:

    Same source as above post.
     
  17. Xenon

    Banned

    Joined:
    Jun 29, 2007
    Messages:
    42
    Likes Received:
    0
    The fixed allocation of SPE's for specific processing related tasks is inherently inefficient due to the extreme amounts of idle time experienced. I just find it odd that game would even contemplate using this design pattern due to its obvious flaws.
     
  18. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    Even though Cell can be a building block for a supercomputer, it doesn't mean we always have to use it that way. In supercomputing, the problems are large and compute intensive, but people may not need the answers right away, so it is best to solve 1 problem using all the available nodes and hope that the answer comes out asap. The system is usually optimized for efficiency (so it can scale to larger problems with reasonable speed up).

    I remember Kutaragi mentioned that Cell is also suitable for running interactive, real-time applications. The needs are different although they "prefer" similar CPU traits.

    Like gaming, a real-time application has strict timing requirements even under heavy load and much concurrent activities. Because it has more cores, Cell can afford to allocate different ones to separate tasks to meet multiple (stringent) schedules. It is useless to achieve 100x speed up for large problem sizes if the answer is always late, even for small problem size.
     
    #18 patsu, Aug 5, 2007
    Last edited by a moderator: Aug 5, 2007
  19. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,681
    Likes Received:
    1,200
    Location:
    Maastricht, The Netherlands
    Because it's easier to implement into an existing single core based game design?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...