First Cell Benchmarks

Discussion in 'CellPerformance@B3D' started by Supernatural, Nov 25, 2006.

  1. MarketingGuy

    Newcomer

    Joined:
    Jul 6, 2007
    Messages:
    5
    Likes Received:
    0
    All good possibilities. And I agree that more letters is better (although I would have chosen something better than SPU for an abbreviation). :wink:

    But that's the possible. Synthesize it to a single probable. There is a simple answer here that has to do with the difference in the stack on top of a quad core and the stack on top of a Cell. (And, no, it's not that 9 > 4 .)
     
  2. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    Ah ok in this example - 2Cell blade - there's still going to be off chip communication going on;
    so it's not necasserily internal bandwidth. It's still having to divide up into multiple independant working sets with main memory being the final point of sharing.

    on the 9 v 4 issue - whats the count of execution units (and functionality per unit..)?

    Well the biggest difference from a software point of view is the manual async cache management/"distributed memory" ???

    In porting regular code to DMA you're having to encode not just the locality, but also assumptions about object ownership per thread.
    It's almost like the concept of 'syntactic Salt' at the program design level- a hoop you have to jump through to prove you know whats going on.

    You've done this work at design/compile time, so the processor doesn't need to waste silicon/watts on guessing it at run time.

    Isn't it <20m transistors for an SPE, vs approx 50m for a Xenon core or PPE. (and god knows how many for a Core2.. but i suppose you have to count execution units x clock etc to make a comparison) So thats where the benefit of the approach shows up...

    I know the cell also gets a benefit over other processors for it's clean SIMD based instruction set.. I've no idea how much that contributes here.

    The question some of my colleagues have is "but how much of this can you do with decent cache control instructions". - a properly shared L2 should be able to do the job of inter-LS transfers? - Prefetch +(Hyperthreading?) could help deal with cache misses (without the expense of oooe)? - maybe increase the line size to get the effect of the larger dma transfers?
    - the difference is the wasted control logic, right ??

    when i say "i like the clarity" - i like the fact you appear to be able to take more implementation decisions based on reasoning (one big factor being code coherence, you actually know the size of each module earlier..) rather than measuring random cache effects :) that seems to point to something being very 'right' about the cell.

    Whats Larabee going to do. I had heard that they definitely won't do cell style DMA "but for high thoughput they will definitely extend the memory model , you wont be far off if you think for the cell". locked cache lines?
     
    #282 ebola, Jul 9, 2007
    Last edited by a moderator: Jul 9, 2007
  3. MarketingGuy

    Newcomer

    Joined:
    Jul 6, 2007
    Messages:
    5
    Likes Received:
    0

    Getting close. Think about management of misses a bit more... again in the context of stacks on peer cores versus a stack on Cell.

    There really is a simply "a-ha" here... so simple that even the Marketing Guy undderstands it. :grin:
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    Can you enlighten us of this simple truth?

    I'd like to know which "ahas" have not yet been listed in this thread.
     
  5. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    So I was thinking about the alternative to Cell - "massive threading" to hide latencies - a la GPU, or Niagara, (i'm almost including OoOE as a type of threading here..)....

    ... all those approaches are going to waste a lot of fast, close-to-ALU, on-chip memory on execution states that are actually idle, waiting to be re-activated when the requested data turns up?? (or be swapping them in & out..)

    By 'on chip memory' i'm including (rename) registers or whatever. (gpu's have a lot of internal register renaming going on dont they..).


    So - is that "the answer" you are getting at? or is there something else..
    Ok this seems to be why I personally find the cell more exciting than all alternative devices - the large working set. 256k of fast memory seems very exciting in terms of scope for algorithms..
    And I suppose you could state this as 'Cell maximizes utilization of on chip ALUs AND on chip-memory' which does sound like it's already been mentioned.

    With the Cell of course this 'management' is moved into sorting tasks by data. But this works just fine for game engines.

    normal processors have prefetch but you can't go prefetching entire chunks of BSP tree or whatever like you can with the MFC.
     
    #285 ebola, Jul 12, 2007
    Last edited by a moderator: Jul 12, 2007
  6. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    I think you're referring to the manual control of data on Cell. On a traditional CPU memory reads are abstracted, you have no idea if data is in cache or in RAM.

    On Cell it's all explicit, you know what is in LS and what isn't. Because data transfer is explicit you can manage it so that processing is going on while data is being transferred, this allows you to completely hide RAM latency.

    Doing this on a traditional processor is possible but with nowhere near the same degree of certainly - you can use cache control instructions but you don't actually know if the dat gets to cache in time or another task is going to switch in and flush the cache... On Cell you know what is in the LS, if you had pre-emptive task switching (actually you do to a degree now) your app will return with the LS in the same state is was before the switch - you still know with certainty what is in the LS.

    This feeds into the point I made above - once you know things like this you realise how completely inefficient normal CPUs and programming are, it becomes natural to write optimised apps on Cell because doing it any other way is completely pointless.
     
  7. MarketingGuy

    Newcomer

    Joined:
    Jul 6, 2007
    Messages:
    5
    Likes Received:
    0
    OK, OK -- I just wanted to see how people are thinking about this. The answer I was looking for is just this:
    The OS runs on each of the peer threads and none of the synergistic threads.

    This leads to the observed effect that the variability of the performance is a lot higher on a peer-type processor -- you never know when a computational thread will be interrupted to go serve a page fault for some other process. Sometimes you're lucky; sometimes you're not. Variability.

    Of course, it's possible not to run the OS on the peer cores, but neither the chips nor the OSes tend to be architected that way. Instead, the assumption of "peer" carries right through to having a full stack on every thread. (I had an intentional "slip" in my previous post, trying to give that away.:wink: ) Cell and the Linux support for Cell explicitly make the opposite assumption: while you could run an OS on an SPU thread, you probably don't want to.

    But, hey, I'm just a Marketing Guy. Maybe the above is just assumed by you all and is only news to me.
     
  8. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,672
    Likes Received:
    1,192
    Location:
    Maastricht, The Netherlands
    Not something I have thought about earlier. There'll probably be some people here on the forum who can confirm whether or not that has a real impact on the issue at hand though - I wouldn't be surprised if it wasn't really applicable, but may it is?
     
  9. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    yeah pretty much. hehe. Well it was fun getting all the Cells' strengths listed on one thread..

    In the context of console games programming you're most likely to be comparing the PS3/Cell to it's rival, the xbox 360;

    For that machine we'd implemented a simple co-operative task-manager rather than relying on system threads ... and wrapped allocations in various specialized pools so that they're not hitting the page table - avoiding all that horrible variablity. All standard stuff for games... so it's very unlikely that the OS is going to impact worker threads there.

    But clock for clock/ core for core we are still seeing speedups when porting most code over ( in varying states of optimization, very little 'fully tuned' due to time constraints ). The explanations I have are [1] the hastle of porting forces you to create more coherence (and you exploit the large local store to avoid writing insane software caching), and [2] the ISA is very efficient, particuarly at conditionals(i dont mean branches) & mixing data types... (i think it's ironic that it gets knocked as harder to program - harder to port generic code, sure..)

    Actually I suppose "not running the OS on the synergistic threads" could be considered as showing up in the simplification of the load/store pipeline - no tlb or caching logic to contend with.
     
    #289 ebola, Jul 13, 2007
    Last edited by a moderator: Jul 14, 2007
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...