First Cell Benchmarks

Discussion in 'CellPerformance@B3D' started by Supernatural, Nov 25, 2006.

  1. Nesh

    Nesh Double Agent
    Legend

    Joined:
    Oct 2, 2005
    Messages:
    11,117
    Likes Received:
    1,685
    whats a Xeon processor?
     
  2. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
  3. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    The fastest processors Intel makes - i.e. high end versions of the Core 2 chips.
     
  4. Neb

    Neb Iron "BEAST" Man
    Legend

    Joined:
    Mar 16, 2007
    Messages:
    8,391
    Likes Received:
    3
    Location:
    NGC2264
    Not the fastest Intel CPU since they say "A single Cell SPE outperforms a 3.0 GHz single-core Xeon processor". Xeon based on P4 tech or Core 2 tech?
     
  5. sevanig

    Regular

    Joined:
    Jul 28, 2005
    Messages:
    254
    Likes Received:
    0
    Location:
    Sydney, Australia
    maybe it means a single core from a core 2?
     
  6. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    IBM just published a paper about a financial app on Cell, their end result was a Dual Cell blade beating a dual Xeon (Woodcrest, quad core) by 11X on single precision, 3X on double precision.

    That's 8 high end Core2 cores V's 16 SPEs - but the margin is way bigger than the difference in number of cores.
     
  7. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    I bet the SPU's would beat a GPGPU at compiling :)
     
  8. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    One quad core at 3Ghz theoretically achieves more than double the DP FLOPS of Cell so we are talking a >6x difference between theoretical and "reality" there.

    Somethings sounds up to me.

    What its essentially saying is that a single SPU is 50% faster than a 3Ghz Core2 at DP FLOPS despite the fact that the Core 2 has about 4-5x higher theoretical peak and a similar transistor count advantage. Sounds a bit ludicrous to me.
     
    #268 pjbliverpool, Jun 28, 2007
    Last edited by a moderator: Jun 28, 2007
  9. Npl

    Npl
    Veteran

    Joined:
    Dec 19, 2004
    Messages:
    1,905
    Likes Received:
    6
    Flops arent everything. Cell also has huge internal bandwidth, dunno how many times higher than a core duo.
     
  10. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    Internal bandwidth to what though? Between cores? Between Cores and LS? How does that compare with similar on a Core 2? Im not doubting the results and thus some architectural trait of Cell was obviously explouted to attain them. I am doubting the fairness of such a comparison given that it was compiled by IBM though. I.e. either the Core 2 was not fully taken advantage of or the results were not exactly the same on both architectures. Or perhaps this is a very special case were a slighly altered input factor would completely change the results.
     
  11. Npl

    Npl
    Veteran

    Joined:
    Dec 19, 2004
    Messages:
    1,905
    Likes Received:
    6
    I cant speak for that specific test, but Cells architecture allows it to get closer to its theoretical peak than other processors. Thats considering you can fit the code to suit it.
    You have a LS thats highly predictable, you (or the compiler) know how long reads/writes will take and - in the optimal case - can effectively remove all stalls caused by memory latency. Further, having seperate LS allows SPUs to operate independendly of each other - no shared cache that hinders scaling of multiple cores.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    Woodcrest has an FSB that provides ~10 GB/sec.
    Cell has an on board memory controller that provides ~25 GB/sec.

    Woodcrest is already considered FSB-limited in many workloads.

    It's not enough to explain the great disparity between Cell and Woodcrest, but it likely explains part of it.

    Cell is designed to work with a memory subsystem that isn't as limiting on scalability as it is for Intel.
    Coherency may also be an issue, but there are system and program details that can make it worse or make it less of an issue.
     
  13. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria
    Perhaps it's the interoperability between cores? SPUs can communicate with each other directly over the internal (extrem high bandwidht) EIB, something that "normal" SMP/Cache architectures like woodcrest just don't have.
     
  14. Kryton

    Regular

    Joined:
    Oct 26, 2005
    Messages:
    273
    Likes Received:
    8
    What about comparisons with HyperTransport based systems? A MOESI cache allows this if you properly manage the line ownership sensibly (i.e. avoid many shared/owned lines).
     
  15. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    It could simply be the act of porting to the cell architecture forced improvements in the algorithm.
    ..Sorting traversals for cache coherency, ..and keeping each core's working set seperate; ..tweaking such that the working set really does fit in LS/L1 given that you programatically control it.
    I really like the clarity the cell gives you with these issues.
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    Any algorithm running on x86 would try to optimize on-chip memory usage.

    The downfall for broadcast coherency is that even when such usage is optimal, coherency still generates traffic and adds latency.

    In the case of the 2 Cell blade versus the 2 Woodcrests, it is a matter of 2 coherent caches for Cell for the usually non-critical PPEs, versus a more complex set of caches that on average looks like something between 8 and 4 caches for the performance-critical cores on the x86 platform.
     
  17. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    point taken;

    Where my comment is comming from: most programmers I know won't find as much coherence as is really possible when they've got an L2 to lean on, they (unfairly) cite the lack of L2 as a serious drawback to the SPU's potential.
    (ok so i'm getting bitter as I'm sat here trawling through someone elses code trying to trace all the damn pointers...)

    Agreed the intra-local store transfers are the biggest potential physical difference that IBM would be likely to want to exploit in a benchmark designed to show the cell's superiority.
    That still involves some latency though doesn't it (and manual management..)

    Woodcrest... that does at least share L2 between it's 2 cores doesn't it, so at least it's not totally stuffed sharing results between cores on the same die?
    I suppose there's oooe which can hide some latency too.
     
  18. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    I think this is one of, if not *the* major reason Cell can perform so well.
    When you are programming a conventional the hardware issues are all hidden away from you. Processors don't like reading small chunks of memory randomly - they're very slow at it. OOO and large caches do a remarkably good job of hiding this from the programmer but can't hide it from the CPU itself which ends up waiting on memory much of the time.

    On the SPEs nothing is hidden, SPE development forces you to think about these issues and you'll soon find out if you're reading from memory isn't efficient. The result is new code which runs a great deal faster than before.
     
  19. MarketingGuy

    Newcomer

    Joined:
    Jul 6, 2007
    Messages:
    5
    Likes Received:
    0
    I think a lot of folks may be missing the obvious. Ask yourself, "What's the difference from a software point of view between a peer core and a synergistic core?" :idea: The right answer will tell you not only why the peer cores have a harder time achieving peak performance but also why the completion time of running on peer cores has so much more variability. I'm just MarketingGuy -- I leave it to you smart technical people to figure this out. :wink:
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    In the duel of marketing buzzwords, synergistic has more letters than peer, and more letters=better performance. ;)

    There are a lot of reasons, spanning from different design constraints, a more specialized platform, greater explicit control of resources, restricted coherence, restricted application space, and other variables as to why the given examples perform so well on Cell.

    Other designs decades ago did the same thing, long before someone put the word synergistic to Cell's cores.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...