A very interesting article about CELL programming

Discussion in 'Console Technology' started by chris1515, Mar 12, 2007.

  1. chris1515

    Legend Regular

    Joined:
    Jul 24, 2005
    Messages:
    5,174
    Likes Received:
    4,482
    Location:
    Barcelona Spain
    It is about the algorithm Breadth-First Search (BFS) on a graph, they finish to optimise the algorithm and it is now 22 times faster on CELL than the PC implementation on a 3,4 Ghz Xeon HT. I remember that someone post something about this algorithm on B3D before.

    http://www.ddj.com/dept/64bit/197801624

    It is very interesting, they describe every optimisation step by step. You can view some code. And the authors think the final code is a very good exemple of optimized CELL code.

    It is fast but the program is much longer on CELL: 60 lines of code for original implementation... and 1200 lines for CELL.

    They compare the CELL programmation to DOS programmation :grin: and it seems the SPU with the local store are fast:

    And I like the new B3D design.

    EDIT:
    I find it was patsu:
    http://forum.beyond3d.com/showpost.php?p=901916&postcount=22

    I find this exemple very interesting. It is not easy to rethink the problem for CELL. And with a problem with random access and almost no computation, I find the performance very impressive.
     
    #1 chris1515, Mar 12, 2007
    Last edited by a moderator: Mar 12, 2007
  2. almighty

    Banned

    Joined:
    Dec 17, 2006
    Messages:
    2,469
    Likes Received:
    5
    Very interesting, great find :)
     
  3. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,070
    Likes Received:
    1,682
    Location:
    Guess...
    Would it not be possible to perform a similar level of very low level optimisation on the Xeon to produce significant performance increases also? Perhaps not to the level of Cell but at least to far les than a 22x difference?

    Or is this difference mostly born from the speed of the LS compared to normal PC RAM?
     
  4. chris1515

    Legend Regular

    Joined:
    Jul 24, 2005
    Messages:
    5,174
    Likes Received:
    4,482
    Location:
    Barcelona Spain
    Maybe, but the local store latency versus cache latency seems to be only one part of the Cell advantage.

    And it is not an article from STI employee, it is not the goal to downplay the Xeon. And most of the time the IBM bench are fair.

     
    #4 chris1515, Mar 12, 2007
    Last edited by a moderator: Mar 12, 2007
  5. Rainbow Man

    Veteran

    Joined:
    Dec 8, 2006
    Messages:
    1,063
    Likes Received:
    4
    Location:
    In front of the PC.
    Perhaps a large part of it also comes from Cell being as many as 9 processors on one chip while a xeon is just one..
    Peace.
     
  6. rounin

    Veteran

    Joined:
    Sep 21, 2005
    Messages:
    1,251
    Likes Received:
    20
    Nothing short of amazing.
     
  7. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    It's all the factors together. If you look at the slides I linked to (second last page), you can see that for this task, Cell is on par with the 128-CPU Blue Gene.
     
  8. chris1515

    Legend Regular

    Joined:
    Jul 24, 2005
    Messages:
    5,174
    Likes Received:
    4,482
    Location:
    Barcelona Spain
    And now it is faster.

     
  9. woundingchaney

    Regular

    Joined:
    Jan 29, 2006
    Messages:
    799
    Likes Received:
    1
    Location:
    Terre Haute, IN
    Good find Chris. Impressive figures for Cell. But why compare Cell to a Pentium 4 at this point in time?? If anything is it possible that there was no original coding done for the Pentium 4 and they simply used archival information??


    Also is it just coincidence that their 22 times increase in performance mirrors their 20 times increase in code (60:1200)
     
  10. chris1515

    Legend Regular

    Joined:
    Jul 24, 2005
    Messages:
    5,174
    Likes Received:
    4,482
    Location:
    Barcelona Spain
    It is probably an algorithm they use at work on x86. It seems very useful. And they compare the performance result with Blue Gene and other multicore on the first doc find by Patsu. It is probably just a coincidence the code increase mirroring the performance increase.

     
    #10 chris1515, Mar 13, 2007
    Last edited by a moderator: Mar 13, 2007
  11. woundingchaney

    Regular

    Joined:
    Jan 29, 2006
    Messages:
    799
    Likes Received:
    1
    Location:
    Terre Haute, IN
    Thx kindly didnt see patsu contribution.
     
  12. [maven]

    Regular

    Joined:
    Apr 3, 2003
    Messages:
    645
    Likes Received:
    16
    Location:
    DE
    As quite common with these sorts of comparisons, the speed-up is from a naive (single-threaded) C-implementation to a heavily optimized CELL implementation, using very different data-structures.
    Looking at the complete source-code from the Dobb's site, I can tell you which version of the code I'd rather maintain... ;)

    Although I'll sooner or later get around to installing Linux on my PS3 and then I'll try to optimise an interesting bit of code (maybe parts of my wavelet compression library?) for the fun of it... :)
     
  13. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    BFS/DFS are fundamental algorithms to computer science used so often they're practically ubiquitous.

    The key here is they are comparing the classic implementation on a P4 (simple) to a highly efficient cell implementation. Yes, you could probably optimize the P4 implementation some, but that's defeating the purpose. The whole point of a general purpose processor like the P4 with good cache, branching, and OoOE is that you can do things simply, and still get reasonable performance.

    If you have to start seriously optimizing BFSes on a P4, you've kinda thrown away some of the HW that you've paid for which was to increase your coding productivity.

    Most C programmers on x86 are simply not going to spend uber amounts of time on custom BFS, many will use quick and dirty obvious hand-rolled implementations or use a library version.

    I think what has been shown by this example is that a) CELL is a lot of work BUT b) since you have to do the work anyway, you can get better performance if done right.

    It's also shown that all of the fears of branch problems and random I/O can be overcome. Remember all the "CELL will suck at physics, traversing trees, branching randomly on collisions, etc"?
     
  14. chris1515

    Legend Regular

    Joined:
    Jul 24, 2005
    Messages:
    5,174
    Likes Received:
    4,482
    Location:
    Barcelona Spain
    I see the CELL version is very difficult to read.:shock:

    You can probably optimize the x86 version but I'm sure that you will not be as fast than CELL with the Xeon.
     
    #14 chris1515, Mar 13, 2007
    Last edited by a moderator: Mar 13, 2007
  15. chris1515

    Legend Regular

    Joined:
    Jul 24, 2005
    Messages:
    5,174
    Likes Received:
    4,482
    Location:
    Barcelona Spain
    Yes I think it is the meaning of the example.
     
  16. inefficient

    Veteran

    Joined:
    May 5, 2004
    Messages:
    2,121
    Likes Received:
    53
    Location:
    Tokyo
    It is definitely a coincidence. Though a funny one. ;)


    Anyone find the source code. I can't seem to find it on the site just yet. I want to see what they spent the 1200 lines on before I comment.
     
  17. [maven]

    Regular

    Joined:
    Apr 3, 2003
    Messages:
    645
    Likes Received:
    16
    Location:
    DE
    ftp://66.77.27.238/sourcecode/ddj/2007/0704.zip

    (cell.zip inside that archive)
     
  18. chris1515

    Legend Regular

    Joined:
    Jul 24, 2005
    Messages:
    5,174
    Likes Received:
    4,482
    Location:
    Barcelona Spain
    #18 chris1515, Mar 13, 2007
    Last edited by a moderator: Mar 13, 2007
  19. inefficient

    Veteran

    Joined:
    May 5, 2004
    Messages:
    2,121
    Likes Received:
    53
    Location:
    Tokyo
    Thanks.


    It looks like nearly half of the code is just dealing with moving data to and fro from the Local Stores via DMA. And then a good chunk of the other half is the manual loop unrolling and SIMDfication to process 16 words per iteration.

    He does say the difference between the deeply optimized implementation of step5 vs the simple straightforward version was a speedup of 3.68-times. So all that work does seem to be worth it. The Cell has such deep pockets for pulling out performance with optimization.
     
  20. [maven]

    Regular

    Joined:
    Apr 3, 2003
    Messages:
    645
    Likes Received:
    16
    Location:
    DE
    An alternative, slightly more provocative formulation would be: CELL is such a complicated architecture that getting good performance out of it takes great effort / redesign.

    And on that note, I would be quite interested in the speed difference between running the straight-forward C-version on the P4 and CELL.
     
    Simon F likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...