Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 22-May-2007, 07:20   #1
patsu
Regular
 
Join Date: Jun 2005
Posts: 15,103
Default SPU and its little Atomic Cache Unit

Ok... I was searching for PS3 DLNA information and came across the article. Decided to post it here after a quick read.

Here's DeanoC hard at work on his blog...

Quote:
Atomic Cache Unit
Never heard of it right?

Well its one of my favorite things on the PS3 and gets little love cos its one of those tiny features that make life so much nicer.
Make life easier ? on PS3 ? Hmppphhh !

P.S. Free beer from me next time you guys (or any of the pushing-the-envelope guys) stop by Bay Area.

EDIT: Holy Sh*t ! Why didn't anyone highlight this before ? It will make a huge difference.

Quote:
The ACU (s) are a part of each SPU that allow atomic updates to occur very quickly. It appears fairly simple each SPU had 512 bytes of cache (yes contrary to what you might have heard SPU do have a tiny bit of cache). 512 bytes is divided into 4 128 byte lines. The MFC (DMA) unit can bring a cache line in from memory and mark it reserved… if another processor writes to the same bit of memory the reservation is lost and you know to repeat the read/modify/write cycle to guarantee atomicy.

All good, but whats really clever its how its implemented. If another SPU asks for that bit of memory its get its from another SPUs cache, if its in there. So you effectively have a fast SHARED 512 bytes. When an SPU writes, the other SPU only have to read from the writing SPU cache rather than DMAing it back to main memory and DMAing it into LS. Cuts down alot of memory traffic. I even abuse it and just use it as a conventional cache and communication channel between SPU. You can push alot of data around with a fast 128 byte path.
patsu is offline   Reply With Quote
Old 22-May-2007, 08:16   #2
idsn6
Member
 
Join Date: Apr 2006
Posts: 215
Default

Quote:
Originally Posted by patsu View Post
EDIT: Holy Sh*t ! Why didn't anyone highlight this before ? It will make a huge difference.
Wait, SPUs have caches? Little snoopy caches for main memory address space?
That's actually really cool.
idsn6 is offline   Reply With Quote
Old 22-May-2007, 08:37   #3
Kryton
Member
 
Join Date: Oct 2005
Posts: 273
Default

Quote:
Originally Posted by idsn6 View Post
Wait, SPUs have caches? Little snoopy caches for main memory address space?
That's actually really cool.
Yeah, remember you've got the EIB (a token ring bus) which can push data between SPUs far faster than consulting either memory pool.
Kryton is offline   Reply With Quote
Old 22-May-2007, 08:41   #4
idsn6
Member
 
Join Date: Apr 2006
Posts: 215
Default

Quote:
Originally Posted by Kryton View Post
Yeah, remember you've got the EIB (a token ring bus) which can push data between SPUs far faster than consulting either memory pool.
Yeah, a snooping protocol actually makes a lot of sense for the EIB (high bandwidth for the coherence traffic, broadcasting to all SPUs).
idsn6 is offline   Reply With Quote
Old 22-May-2007, 09:10   #5
one
Unruly Member
 
Join Date: Jul 2004
Location: Bunkyo-ku
Posts: 4,694
Default

Quote:
Originally Posted by patsu View Post
Holy Sh*t ! Why didn't anyone highlight this before ? It will make a huge difference.
Hehehe
http://forum.beyond3d.com/showpost.p...89&postcount=4
one is offline   Reply With Quote
Old 22-May-2007, 09:13   #6
DeanA
Member
 
Join Date: Oct 2005
Location: Cambridge, UK
Posts: 236
Default

Quote:
Originally Posted by patsu View Post
It will make a huge difference.
I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.

And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.

Cheers,
Dean
__________________
Opinions I share here are my own, and should not be incorrectly interpreted as the views of SCEE, SCE, or Sony Corporation.
DeanA is offline   Reply With Quote
Old 22-May-2007, 09:49   #7
idsn6
Member
 
Join Date: Apr 2006
Posts: 215
Default

Quote:
Originally Posted by DeanA View Post
I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.

And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.

Cheers,
Dean
LS->LS DMA still requires more programmer synchronization between SPUs than the ACU to my naive mind for some cases, so it seems more a matter of convenience than speed...though if, as you say, DMAs with memory can erase lines in the cache then that is a bit of a bummer. I assumed that the ACU locked the atomic lines in while other DMAs went directly between LS and main memory, though I guess I had no reason to think this.
idsn6 is offline   Reply With Quote
Old 22-May-2007, 10:07   #8
DeanA
Member
 
Join Date: Oct 2005
Location: Cambridge, UK
Posts: 236
Default

Quote:
Originally Posted by idsn6 View Post
I assumed that the ACU locked the atomic lines in while other DMAs went directly between LS and main memory, though I guess I had no reason to think this.
Hmm.. I thought that the ACU shares some bits with the DMA subsystem, but hey.. irrespective of this, if other SPUs are doing things (unrelated to stats update), then it would be possible for entries to become evicted.

Probably wouldn't affect things too much though, to be honest..

Dean
__________________
Opinions I share here are my own, and should not be incorrectly interpreted as the views of SCEE, SCE, or Sony Corporation.
DeanA is offline   Reply With Quote
Old 22-May-2007, 10:50   #9
DeanoC
Senior Member
 
Join Date: Feb 2003
Location: Cambridge, UK
Posts: 1,396
Default

Quote:
Originally Posted by DeanA View Post
I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.

And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.

Cheers,
Dean
Cos its very hard to do LS->LS DMA in real world usage (you need static memory layout and synchronised tasks). In practise you do a LS->EA on one SPU and EA->LS on another. If you lucky this occurs at the same time so its shortcut, else it goes in back into the main cache/memory system. Tho atomic put/get is higher priority so should be faster for 128 bytes than a LS->LS DMA anyway...

The ACU cache gives you a place to leave the data effectively on the ring bus for a while without knowing any details of the destination. Its partly LRU and AFAICT doesn't get evicted via normal DMA get, tho put does. Its also a high speed ring bus op, faster than normal ring bus movement. So its should always be better or the same as normal get.

Its not perfect but it does appear to be better than the alternatives 'most' of the time. Which is true of all caches really.
__________________
Spouter of random rubbish @ http://blog.deanoc.com
DeanoC is offline   Reply With Quote
Old 22-May-2007, 11:31   #10
patsu
Regular
 
Join Date: Jun 2005
Posts: 15,103
Default

Quote:
Originally Posted by one View Post
LOL. I vaguely remember your reply but I didn't quite grasp it last time.

Quote:
Originally Posted by DeanA
I'm not sure it'll make a huge difference... infact, I'm interested as to why you think it would! Even without keeping this data in the 4 entry cache, it's my understanding that full LS to LS DMAs stay on the EIB.. they don't go via main memory.

So bearing that in mind, I'm not sure why Deano is describing a system where data goes out from LS, to main memory, and back to LS. As that simply doesn't happen in the case of LS->LS DMA.
Ok cool... I have confirmation about efficient LS<->LS DMA (without PPE or other external subsystem involvment). The gain from the cache would be relatively smaller if so.

What is the time saved between an atomic cache write/read (cache hit) versus a LS atomic store/read (cache miss) for multiple SPUs ?

Quote:
And surely the utilisation of the SPU cache in this way pretty much requires that in order to run at full speed the other SPUs you're communicating with are not evicting cache contents by performing other DMAs? So your system needs to be pretty static in terms of DMA usage to reap the full benefit of what is described.
Yes it seems. The algorithm in question should be pretty regular/predictable (Some globally shared data structure needs to be consulted/updated "everytime"). In DeanoC's case it looks to be the death/alive counter.

EDIT: Ah ! DeanoC replied with more juicy details.
patsu is offline   Reply With Quote
Old 23-May-2007, 05:08   #11
homy
Naughty Boy!
 
Join Date: Jan 2007
Posts: 136
Default SPU and its little Atomic Cache Unit

Quote:
Never heard of it right?

Well its one of my favorite things on the PS3 and gets little love cos its one of those tiny features that make life so much nicer.

Atomic ops refer to the most important principle in multi threading. It say that a single processor must read/modify/write without another thread interrupting (hence atomic). Without atomicy, multiple core system are much harder (if not near impossible)

The ACU (s) are a part of each SPU that allow atomic updates to occur very quickly. It appears fairly simple each SPU had 512 bytes of cache (yes contrary to what you might have heard SPU do have a tiny bit of cache). 512 bytes is divided into 4 128 byte lines. The MFC (DMA) unit can bring a cache line in from memory and mark it reserved… if another processor writes to the same bit of memory the reservation is lost and you know to repeat the read/modify/write cycle to guarantee atomicy.

All good, but whats really clever its how its implemented. If another SPU asks for that bit of memory its get its from another SPUs cache, if its in there. So you effectively have a fast SHARED 512 bytes. When an SPU writes, the other SPU only have to read from the writing SPU cache rather than DMAing it back to main memory and DMAing it into LS. Cuts down alot of memory traffic. I even abuse it and just use it as a conventional cache and communication channel between SPU. You can push alot of data around with a fast 128 byte path.
And the nicest thing about it… It just works… All the cache snooping, routing etc. all just happens magically inside the chip. You say ATOMIC_GET and ATOMIC_SET and treat yourself to a 512 byte shared cache.

So for example for some of the army stuff, I need statistics to be kept (things like how many dead, ko’ed etc.)

These are 128 byte structure, that each SPU read/writes to as nessecary. When first you look at it, its seems to be really slow if not for the ACU. All those 128 bytes DMA, every time I need to add a number I’m doing 2 128 DMA (one read/one write) but due to the fact that its sitting inside SPU cache most of the time its ends up just being EIB ring traffic. And thats fast, really really fast.

I just have a shared counter statistics system that all works even tho I can be making 100’s of atomic updates per frame.

Nice one… Whoever at STI who added that bit of hardware deserves a beer from me
from http://blog.deanoc.com/


An interesting read

This atomic cache implementation has been used in parallel processors for a while but this is the first time its been used in a consumer product.
homy is offline   Reply With Quote
Old 23-May-2007, 05:20   #12
mech
Member
 
Join Date: Feb 2002
Posts: 532
Default

Nice find and an interesting read, thanks!
mech is offline   Reply With Quote
Old 23-May-2007, 06:46   #13
archie4oz
ea_spouse is H4WT!
 
Join Date: Feb 2002
Location: 53:4F:4E:59
Posts: 1,566
Default

Well not much of a "find" considering Deano links to his blog in his sig... :P
__________________
"The sooner someone gets sued by Intel for violation, the sooner the patent can be revoked from orbit for gratuitous and wanton disregard for prior art and obviousness." ~TomF
archie4oz is offline   Reply With Quote
Old 23-May-2007, 06:52   #14
StefanS
meandering Velosoph
 
Join Date: Apr 2002
Location: Vienna
Posts: 3,677
Default

Since this has been posted, in the Heavenly Sword thread already and there have been some really nice responses (thanks DeanA, DeanoC, etc.), I've decided to copy the posts over here, since some might miss those.
__________________
"Anybody can be a glutton, but only a true cyclist is a bottomless pit." - Ken Kifer (R.I.P.)

"I think you'll find the improved video is a part of Sony's integration of the cutting edge Placebo technology. They've integrated it into all firmwares and this fabulous system provides all sorts of minor upgrades at very little developer cost. Great stuff!" - Shifty Geezer
StefanS is offline   Reply With Quote
Old 23-May-2007, 09:56   #15
LunchBox
Member
 
Join Date: Mar 2002
Location: California
Posts: 884
Default

WOW!I learned something new today thanks for the article it's really a good read.and thanks for making it into a thread coz i rarely go to the HS thread coz there's too many posts to skim through just to get to the juicy parts
__________________
"the anonymity of the internet gives people the ability to grow e-balls..."
LunchBox is offline   Reply With Quote
Old 23-May-2007, 19:17   #16
ban25
Senior Member
 
Join Date: Apr 2002
Location: San Francisco, CA
Posts: 1,224
Default

Quote:
Originally Posted by patsu View Post
P.S. Free beer from me next time you guys (or any of the pushing-the-envelope guys) stop by Bay Area.
This reminds me, any fellow game developers in the Bay Area may want to check out the SF game dev meetup. It used to be at Thirsty Bear, but now it's at the Metreon. The next meeting should be around mid-June (check out the site for details). There are lots of local developers in attendance. It's an informal get together, so please, no solicitors (i.e. people trying to sell middleware) or journalists.
ban25 is offline   Reply With Quote
Old 24-May-2007, 00:25   #17
mech
Member
 
Join Date: Feb 2002
Posts: 532
Default

Quote:
Originally Posted by archie4oz View Post
Well not much of a "find" considering Deano links to his blog in his sig... :P
Hah, yeah, I figured I phrased that wrong after I wrote it, but couldn't be bothered editing it It's been a while since I've been on Beyond3D so I haven't seen what Deano's been up to lately...
mech is offline   Reply With Quote
Old 28-May-2007, 22:30   #18
ebola
Junior Member
 
Join Date: Dec 2006
Posts: 55
Default

This ACU - does it get used for successive NON128-byte aligned DMAs.. or just atomic ops.

e.g. lets say you're streaming through a list of 96 byte objects*, - do the crossover cachelines get buffered instead of adding main-memory accesses for the overlap..
Up until now ive' been thinking in terms of manually buffering this sort of data with larger 128byte aligned loads (i.e. to get multiple missaligned objects in togther back to back)
ebola is offline   Reply With Quote
Old 30-May-2007, 10:40   #19
DeanoC
Senior Member
 
Join Date: Feb 2003
Location: Cambridge, UK
Posts: 1,396
Default

Quote:
Originally Posted by ebola View Post
This ACU - does it get used for successive NON128-byte aligned DMAs.. or just atomic ops.

e.g. lets say you're streaming through a list of 96 byte objects*, - do the crossover cachelines get buffered instead of adding main-memory accesses for the overlap..
Up until now ive' been thinking in terms of manually buffering this sort of data with larger 128byte aligned loads (i.e. to get multiple missaligned objects in togther back to back)
Just atomic ops, normal DMA gets don't go through it (tho a normal DMA put will clear it).

The atomic ops have to be 128 byte aligned as well, when the SPU does a <128 byte atomic it grabs the whole line and the masks out the bit you want.

So I suspect its not going to be very helpful in the 96 byte case.
__________________
Spouter of random rubbish @ http://blog.deanoc.com
DeanoC is offline   Reply With Quote

Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:21.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.