SPU and its little Atomic Cache Unit

Discussion in 'CellPerformance@B3D' started by patsu, May 22, 2007.

  1. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    Ok... I was searching for PS3 DLNA information and came across the article. Decided to post it here after a quick read.

    Here's DeanoC hard at work on his blog... :)

    Make life easier ? on PS3 ? Hmppphhh ! :p

    P.S. Free beer from me next time you guys (or any of the pushing-the-envelope guys) stop by Bay Area.

    EDIT: Holy Sh*t ! Why didn't anyone highlight this before ? It will make a huge difference.

     
  2. idsn6

    Regular

    Joined:
    Apr 14, 2006
    Messages:
    475
    Likes Received:
    117
    Wait, SPUs have caches? Little snoopy caches for main memory address space?
    That's actually really cool.
     
  3. Kryton

    Regular

    Joined:
    Oct 26, 2005
    Messages:
    273
    Likes Received:
    8
    Yeah, remember you've got the EIB (a token ring bus) which can push data between SPUs far faster than consulting either memory pool.
     
  4. idsn6

    Regular

    Joined:
    Apr 14, 2006
    Messages:
    475
    Likes Received:
    117
    Yeah, a snooping protocol actually makes a lot of sense for the EIB (high bandwidth for the coherence traffic, broadcasting to all SPUs).
     
  5. one

    one Unruly Member
    Veteran

    Joined:
    Jul 26, 2004
    Messages:
    4,823
    Likes Received:
    153
    Location:
    Minato-ku, Tokyo
  6. DeanA

    Newcomer

    Joined:
    Oct 26, 2005
    Messages:
    244
    Likes Received:
    36
    Location:
    Cambridge, UK
    I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

    So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.

    And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.

    Cheers,
    Dean
     
  7. idsn6

    Regular

    Joined:
    Apr 14, 2006
    Messages:
    475
    Likes Received:
    117
    LS->LS DMA still requires more programmer synchronization between SPUs than the ACU to my naive mind for some cases, so it seems more a matter of convenience than speed...though if, as you say, DMAs with memory can erase lines in the cache then that is a bit of a bummer. I assumed that the ACU locked the atomic lines in while other DMAs went directly between LS and main memory, though I guess I had no reason to think this.
     
  8. DeanA

    Newcomer

    Joined:
    Oct 26, 2005
    Messages:
    244
    Likes Received:
    36
    Location:
    Cambridge, UK
    Hmm.. I thought that the ACU shares some bits with the DMA subsystem, but hey.. irrespective of this, if other SPUs are doing things (unrelated to stats update), then it would be possible for entries to become evicted.

    Probably wouldn't affect things too much though, to be honest..

    Dean
     
  9. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Cos its very hard to do LS->LS DMA in real world usage (you need static memory layout and synchronised tasks). In practise you do a LS->EA on one SPU and EA->LS on another. If you lucky this occurs at the same time so its shortcut, else it goes in back into the main cache/memory system. Tho atomic put/get is higher priority so should be faster for 128 bytes than a LS->LS DMA anyway...

    The ACU cache gives you a place to leave the data effectively on the ring bus for a while without knowing any details of the destination. Its partly LRU and AFAICT doesn't get evicted via normal DMA get, tho put does. Its also a high speed ring bus op, faster than normal ring bus movement. So its should always be better or the same as normal get.

    Its not perfect but it does appear to be better than the alternatives 'most' of the time. Which is true of all caches really.
     
  10. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    LOL. I vaguely remember your reply but I didn't quite grasp it last time.

    Ok cool... I have confirmation about efficient LS<->LS DMA (without PPE or other external subsystem involvment). The gain from the cache would be relatively smaller if so.

    What is the time saved between an atomic cache write/read (cache hit) versus a LS atomic store/read (cache miss) for multiple SPUs ?

    Yes it seems. The algorithm in question should be pretty regular/predictable (Some globally shared data structure needs to be consulted/updated "everytime"). In DeanoC's case it looks to be the death/alive counter.

    EDIT: Ah ! DeanoC replied with more juicy details. :D
     
  11. homy

    Banned

    Joined:
    Jan 20, 2007
    Messages:
    136
    Likes Received:
    4
    from http://blog.deanoc.com/


    An interesting read

    This atomic cache implementation has been used in parallel processors for a while but this is the first time its been used in a consumer product.
     
  12. mech

    Regular

    Joined:
    Feb 12, 2002
    Messages:
    535
    Likes Received:
    0
    Nice find and an interesting read, thanks!
     
  13. archie4oz

    archie4oz ea_spouse is H4WT!
    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    1,608
    Likes Received:
    30
    Location:
    53:4F:4E:59
    Well not much of a "find" considering Deano links to his blog in his sig... :p
     
  14. StefanS

    StefanS meandering Velosoph
    Veteran

    Joined:
    Apr 20, 2002
    Messages:
    3,608
    Likes Received:
    75
    Location:
    Vienna
    Since this has been posted, in the Heavenly Sword thread already and there have been some really nice responses (thanks DeanA, DeanoC, etc.), I've decided to copy the posts over here, since some might miss those.
     
  15. LunchBox

    Regular

    Joined:
    Mar 13, 2002
    Messages:
    901
    Likes Received:
    8
    Location:
    California
    WOW!I learned something new today :)thanks for the article it's really a good read.and thanks for making it into a thread coz i rarely go to the HS thread coz there's too many posts to skim through just to get to the juicy parts :)
     
  16. ban25

    Veteran

    Joined:
    Apr 7, 2002
    Messages:
    1,380
    Likes Received:
    6
    Location:
    San Francisco, CA
    This reminds me, any fellow game developers in the Bay Area may want to check out the SF game dev meetup. It used to be at Thirsty Bear, but now it's at the Metreon. The next meeting should be around mid-June (check out the site for details). There are lots of local developers in attendance. It's an informal get together, so please, no solicitors (i.e. people trying to sell middleware) or journalists.
     
  17. mech

    Regular

    Joined:
    Feb 12, 2002
    Messages:
    535
    Likes Received:
    0
    Hah, yeah, I figured I phrased that wrong after I wrote it, but couldn't be bothered editing it :) It's been a while since I've been on Beyond3D so I haven't seen what Deano's been up to lately...
     
  18. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    This ACU - does it get used for successive NON128-byte aligned DMAs.. or just atomic ops.

    e.g. lets say you're streaming through a list of 96 byte objects*, - do the crossover cachelines get buffered instead of adding main-memory accesses for the overlap..
    Up until now ive' been thinking in terms of manually buffering this sort of data with larger 128byte aligned loads (i.e. to get multiple missaligned objects in togther back to back)
     
  19. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Just atomic ops, normal DMA gets don't go through it (tho a normal DMA put will clear it).

    The atomic ops have to be 128 byte aligned as well, when the SPU does a <128 byte atomic it grabs the whole line and the masks out the bit you want.

    So I suspect its not going to be very helpful in the 96 byte case.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...