Was Cell any good? *spawn

Status
Not open for further replies.
Oh my... he is a God for me :oops: pretty curios to know whether FXAA was possible on the ps3 through SPU & how would works.

Highly doubt there'd be any advantage to doing the texture ops, branching etc that FXAA employs, and even so, you're talking about trying to overtake 1-2ms on GPU. Plus, you'd still need the frame texture for input, just like MLAA does on SPUs...
 
Hmm... it should be common for SPUs to tear through datasets larger than 128K by streaming/staggering the data via DMA. The main issue is random access data, or data with too much dependencies (Can't fetch early enough or in parallel). Most graphics jobs are highly parallelizable.
Yes, of course. But the traditional cache based CPUs also perform very well if data access pattern is predictable (assuming proper manual cache hints). SPU should be faster when doing random access to 128K dataset, since that doesn't fit inside the PPC 32KB L1d. However when random data access to a bigger data set is needed, PPC cores should outperform SPU by a wide margin (L2 access is considerably faster than constantly transferring random data blocks from/to memory).

Oh my... he is a God for me :oops: pretty curios to know whether FXAA was possible on the ps3 through SPU & how would works.
FXAA is filled with bilinear filtering hacks (fetch+blend 2x2 pixels at once). It has been purposefully designed for fast GPU execution. You could run it on a CPU, but the GPU version is hard to beat (it's only 1 milliseconds on current consoles, and a fraction of that on high end hardware).
 
Last edited by a moderator:
However when random data access to a bigger data set is needed, PPC cores should outperform SPU by a wide margin (L2 access is considerably faster than transferring data from/to memory).

To make this more interesting, could you give a typical example of that type of large dataset random data access? I sometimes have a suspicion that there often are efficient streaming alternatives, but at the same time it is very likely that there are some cases where PPU's win easily, and I would like have a better understanding of what type of work that is.
 
Yes, of course. But the traditional cache based CPUs also perform very well if data access pattern is predictable (assuming proper manual cache hints). SPU should be faster when doing random access to 128K dataset, since that doesn't fit inside the PPC 32KB L1d. However when random data access to a bigger data set is needed, PPC cores should outperform SPU by a wide margin (L2 access is considerably faster than constantly transferring random data blocks from/to memory).

If the cache is shared by multiple cores, it will impact performance overall once you lock (part of) it for the "co-processors".

Bigger dataset is not the problem if the access pattern is predictable. The data will just be streamed in while the SPU gets busy with the other half of data already in the Local Store.

If data is accessed randomly, cache hit may be rare too. For problems like that, then the devs may have to rely more on the larger number of cores for speed up. But there should still be opportunity to batch the data.

Cell is used in BlueGene to solve supercomputing applications in place of regular CPUs (with vector engines) during that time period. They don't deal with small datasets there.
 
I don't seem to be good at short posts.

Going by elf sizes, around 20% of my game is SPU code.
Thanks for sharing that! Do you know how much that is in LOC, and how much of that is code you wrote vs. stuff from the SDK? 20% is a lot, but I really don't know how elf sizes compare.

I'm going to disagree with both your take on it and sebbi's.
That's cool. I may have failed my reading comprehension check, but I'm not sure which parts you disagree with and what your opinion on them is. Care to elaborate?

Add dynamic branching? add integer units?
Both already in there. It doesn't have a branch predictor or an integer divide instruction, but that's about it. Instead of the predictor, there are branch hints, so you don't usually need to take a hit for a branch. This can be tricky to get right, but mostly in cases where a predictor would be completely lost.

Agreed. SPUs should perform better when the data set required for each pixel is larger than L1d (32 KB), but smaller than local store size (or half of it minus code = ~128 KB, since you need double buffering to load next data while processing old one). But once the data set is larger than local store size, the VMX128 processing would simply start using the 1 MB L2 cache instead of the faster L1d, while the SPU code would need to frantically swap data to/from main memory (slowing it down to a crawl).

I'm still scratching my head a bit at the 128KB LS overhead, but I'm willing to accept that for some people this is how it works out. As we're sort of talking about cool hardware capabilities, let me say a word or two about double buffering and SPUs.
The way memory transfers down to main memory and VRAM work is that the MFC takes care of them. The SPU controls its MFC through a channel interface which is part of the ODD pipe (which is roughly the same as VXU Type 2 for you), meaning that you can issue commands to the MFC as ODD instructions.
If you are so inclined, you can put these channel commands into your regular processing loop. The MFC can queue up 16 DMAs, so this effectively gives an extremely controlled prefetch system, at the cost of some ODD pressure. Most people have ODD to spare, so it's more a case of working out the latencies and carefully adding the commands, as you described it for VMX. It's really very much the same thing, save for the added control and things like DMA lists.
What that means is there there is no need to just do a simple double buffering. You can do pretty sophisticated loops, if you need to. I've personally never had a situation where I needed to roll the channel commands in with the assembly, as this is really something you only need when you have sparse (i.e. random) memory accesses. People tell me that it works rather well, however.
I don't actually know how MFC DMAs compare to L2 fetches from memory in terms of latency, so it might come out even.
Programmers who are not used to doing these things often leave quite a bit of performance on the table. You're probably used to seeing the same thing with people who don't use the prefetch instruction on 360.

It's interesting to compare different vector architectures.
Absolutely. They are usually so much cleaner designs than the super-scalar OoO cores, where you can't really predict what's going to happen. :)

So he'll be working at a lower level on SPUs than anyone being in the luxurious position of just having to develop technologies without product deadlines to worry about [...]
It's not lower level than what other developers do. Every PS3 developer can get as low level as they want. Also what is this luxurious position without product deadlines you're talking about? I'm intrigued. ;)

Oh my... he is a God for me
:oops:

I did notice a recent lack of offerings at my shrine...

pretty curios to know whether FXAA was possible on the ps3 through SPU & how would works.

I'm pretty sure it could be done, although it may not be worth it. Last time I checked it was algorithmically a lot simpler than MLAA but very tailored to GPU ISAs. This is what makes it possible to do this nice and simple integration for it and what makes it run well on GPUs. Full blown MLAA is extremely hard to do on GPUs, which is why nobody is doing it. :)

This actually harkens back to what we talked about earlier, as MLAA is one of those pesky algorithms that needs the entire scanline more than just once.
 
What that means is there there is no need to just do a simple double buffering. You can do pretty sophisticated loops, if you need to. I've personally never had a situation where I needed to roll the channel commands in with the assembly, as this is really something you only need when you have sparse (i.e. random) memory accesses. People tell me that it works rather well, however.
That's good to hear. I thought the SPU couldn't request data fully on it's own. If the SPU can fetch (small) blocks of memory efficiently, relatively random (but predictable) accesses shouldn't be that much a problem either. Without L2 you of course need to go directly to the main memory every time you swap information (and this wastes the main memory bandwidth that's shared between all computing units).

CPU (full) cache misses can be around 500 cycles. I would like to know how long the latency (in cycles) is to fetch an aligned 128 byte (or larger) data block from main memory to SPUs local store (if that's public information). Are those comparable? Can you use standard bucketed data structures (such as bucketed lists*) efficiently by SPU without loading the whole list to local store first?

*) Bucket contains data + pointer to next bucket (or cache line index to next bucket if buckets are pooled). Buckets are aligned to cache lines. Every time you start processing a bucket you first cache hint the next bucket (or on SPU you would start loading the next bucket).
 
Last edited by a moderator:
Programmers who are not used to doing these things often leave quite a bit of performance on the table.
First of all, thanks for leaving a bit of your knowledge in this thread. Secondly, can you give a ballpark estimate of how much performance programmers could be leaving behind by not taking advantage of the MFC options available to them? Of course, the answer would vary, but I'm hoping you might be able to give a wide range.

*leaves offering with head low and backing away*
 
That's good to hear. I thought the SPU couldn't request data fully on it's own. If the SPU can fetch (small) blocks of memory efficiently, relatively random (but predictable) accesses shouldn't be that much a problem either. Without L2 you of course need to go directly to the main memory every time you swap information (and this wastes the main memory bandwidth that's shared between all computing units).

Let me quote section 19.2.1 from the Handbook
CBEA Programming Handbook said:
Regardless of the initiator (SPU, PPE, or other device), DMA transfers move up to 16 KB of data between an LS and main storage. An MFC supports naturally aligned DMA transfer sizes of 1, 2, 4, 8, and 16 bytes and multiples of 16 bytes. For naturally aligned 1, 2, 4, and 8-byte transfers,the source and destination addresses must have the same 4 least significant bits.

The performance of a DMA transfer can be improved when the source and destination addresses have the same quadword offsets within a 128-byte cache line. Quadword-offset-aligned transfers generate full cache-line bus requests for every cache line, except possibly the first and last. Transfers that start or end in the middle of a cache line transfer a partial cache line in the first or last bus request, respectively.

The performance of a DMA data transfer when the source and destination addresses have different quadword offsets within a cache line is approximately half that of quadword-aligned transfers, because every bus request is a partial cache-line transfer; in effect, there are two bus requests for each cache line of data.

Peak performance is achieved for transfers in which both the EA and the LSA are 128-byte aligned and the size of the transfer is a multiple of 128 bytes.

So unlike the L2, which always needs to fetch a 128B aligned cache line, you can actually fetch less than that, making more efficient use of LS if your data is sparse. Of course, you may pay for this with reduced bandwidth if you mess up alignment. Apart from that, as the text states, a aligned 128B MFC transfer and a L2 cacheline fetch are pretty much the same.

sebbbi said:
CPU (full) cache misses can be around 500 cycles. I would like to know how long the latency (in cycles) is to fetch an aligned 128 byte (or larger) data block from main memory to SPUs local store (if that's public information). Are those comparable? Can you use standard bucketed data structures (such as bucketed lists*) efficiently by SPU without loading the whole list to local store first?

As you can probably tell, I'm playing the citation game to make sure I'm not breaking any NDAs. I did however find something here, which might give an idea.
IBM said:
The latency of DMA operations between an LS and main memory is quite high, approximately in the order of 100-200 SPU cycles[11]. However, for consecutive DMA operations, it is possible to overlap the latency for the second operation with the DMA transfer of the rst, as the MFC can process and queue multiple DMA requests before they are issued to the EIB.

I actually found that reference 11 as well (it's an article by IBM engineer's published in IEEE Micro) and they get a total time to memory of 290 cycles. I only skimmed the article, so I don't know what kind of memory that was (other than "main"). LS to LS is 140cy.

So that sounds comparable to L2. This is just literature search and not personal experience, but I tend to believe the IBM engineers when it comes to their chips. ;)

Does that answer your question?

First of all, thanks for leaving a bit of your knowledge in this thread.

No worries. :) Thank STI for publishing all this stuff. You'd kind of expect that given the interest in Cell, this would be more common knowledge.

Secondly, can you give a ballpark estimate of how much performance programmers could be leaving behind by not taking advantage of the MFC options available to them? Of course, the answer would vary, but I'm hoping you might be able to give a wide range.

I really can't. But at times the MFC will allow you to write a radically more efficient algorithms, at times it's just a stupid data fetcher. So not every problem can benefit from it in the same way (or at all).
Also, this being the internet means that there is a real chance that someone would take whatever I write and the next time they are unhappy with some PS3 title, they will loudly proclaim that a developer "doesn't properly MFC" and I will get beaten up next Siggraph.
I'd like to avoid that.

*leaves offering with head low and backing away*
You guys are weird, you know that? :)
 
Peak performance is achieved for transfers in which both the EA and the LSA are 128-byte aligned and the size of the transfer is a multiple of 128 bytes.
Thanks for the information. The memory fetch latencies (290 cycles) also sound very good.

Actually it seems that our data structures (bucketed cache line aligned structures) would be pretty compatible for SPU execution. After reading so many developers complaining how they had to restructure their engines completely, I thought the latency would have been considerably higher (thousands of cycles) and efficient transfers would have to be considerably larger (tens of kilobytes). With only 290 cycle latency and good efficiency of (small) 128 byte transfers, SPU doesn't feel considerably harder to use in performance critical code compared to traditional CPUs with manual cache hints (and SIMD intrinsics). Without L2 of course some algorithms are a bit harder to implement, but on the other hand local store is much larger than L1d... It's a difficult system to judge without any personal experience of programming it. I would probably love it (if we for some reason decided to do something for it), but the C++ game programmers wouldn't likely agree with me :) . Technology alone is sadly not enough to make a good game...
 
Last edited by a moderator:
Thanks for the information. The memory fetch latencies (290 cycles) also sound very good.

Thanks for reading all that!

I suspect that the real number is closer to an L2 miss. Conceptually, I'd expect the L2 miss time to be roughly (L2 hit time) + (memory latency). So judging from the IEEE Micro article, I'd estimate the actual latency to be (L2 miss time) - (L2 hit time) + 150cy.
In any case, the PPE and SPEs are on the same (crazy fast) bus, so there should not be too much difference.


Actually it seems that our data structures (bucketed cache line aligned structures) would be pretty compatible for SPU execution.

I think your bucketed lists may actually be a pretty SPU friendly structure, especially if you massage it a bit. For example, right now you have the next pointer in the first cache line of each bucket, since that allows you to trigger an early prefetch. You could externalize the entire list structure and have the loose buckets separately. Then you can turn the externalized list into DMA list form (basically 2 words, [size, address]), which would then allow you to load the DMA list directly into LS and execute bits of it, without ever needing to modify it. The cool thing about this is that a DMA list does automatic gather/scatter. So if you have a DMA list of 10 elements, you can tell MFC to just fetch e.g. 3 elements starting at offset 2 (representing 3 of your buckets) and put them at address X for LS. The MFC will then gather the data and write your three buckets into a linear piece of LS, starting at X. Your actual processing code only ever sees linear memory.
You can then use the same list to scatter the data back into main memory.
Since the LS offset and the number of list elements to process are channel commands and not part of the list, you'll never have to modify the DMA list unless you add/remove buckets.

If you want fine grained synchronization so you can process the list as soon at data arrives, you can use the tag mechanism. Every dma command can have a tag (there's 32 tags) that can be used to sync it against other in-flight DMAs, or to check on the SPU if it was completed.

I suppose having the next pointers externalized wouldn't even make the PPU too unhappy, even if you will use 8 bytes instead of 4 per element.
Caveat: The maximum transfer size per DMA list element is 16K, so you may need to split buckets.

With only 290 cycle latency and good efficiency of (small) 128 byte transfers, SPU doesn't feel considerably harder to use in performance critical code compared to traditional CPUs with manual cache hints (and SIMD intrinsics). Without L2 of course some algorithms are a bit harder to implement, but on the other hand local store is much larger than L1d...

Exactly. It's a convenience vs. performance tradeoff. Or as some would call it: A convenience vs. fun tradeoff. ;)

It's a difficult system to judge without any personal experience of programming it. I would probably love it (if we for some reason decided to do something for it), but the C++ game programmers wouldn't likely agree with me :) . Technology alone is sadly not enough to make a good game...

There's a good point to be made about productizing engine code better, so that the gameplay guys don't even see that there's some funky bit of SPU code underneath it. To a degree, I suspect "I give you 10 times the raycasts per frame, but they will be asynchronous." is something they will find hard to refuse, considering their love for ray-casts. ;)
If you ever get your hands on the SPUs, tell me how it went. I'm sure you'll have a blast.
 
Actually it seems that our data structures (bucketed cache line aligned structures) would be pretty compatible for SPU execution...
With the recent acquisition, is there reasonable chance you'll be working on a PS3 title (Trials port or something new) in the near future to get some first-hand Cell experience? Ubisoft's interest in your engine suggests to me they'll want a port, althuogh I understand if you can't even talk about that. Would be good to get your opinions going from theoretical to practical experience of Cell though! ;)
 
With the recent acquisition, is there reasonable chance you'll be working on a PS3 title
If I had experience on the platform, I wouldn't be asking silly SPU questions here, would I? :)

We have developed for many platforms in the past, but I don't know yet what future brings. Trials Evolution (for Xbox 360) is currently the most important thing for us. We are focusing all our effort to make it as good as possible.
 
Then you can turn the externalized list into DMA list form (basically 2 words, [size, address]), which would then allow you to load the DMA list directly into LS and execute bits of it, without ever needing to modify it. The cool thing about this is that a DMA list does automatic gather/scatter. So if you have a DMA list of 10 elements, you can tell MFC to just fetch e.g. 3 elements starting at offset 2 (representing 3 of your buckets) and put them at address X for LS. The MFC will then gather the data and write your three buckets into a linear piece of LS, starting at X. Your actual processing code only ever sees linear memory.
You can then use the same list to scatter the data back into main memory.
Since the LS offset and the number of list elements to process are channel commands and not part of the list, you'll never have to modify the DMA list unless you add/remove buckets.
That sounds really efficient.
There's a good point to be made about productizing engine code better, so that the gameplay guys don't even see that there's some funky bit of SPU code underneath it.
Agreed. That's the best way to do things. Our first cross platform game was a Warhammer 40K game for Sony PSP an Nintendo DS, but also had both native Linux (OpenGL) and Windows (DirectX) clients as well (for debugging purposes). Our network programmers favored Linux (and at that time Valgrind was the only good tool to track memory issues). That was the first game we had single game code that compiled directly (without any modifications) to all four platforms. The lowest level technology code under the hood was of course very different (but invisible to game programmers).
"I give you 10 times the raycasts per frame, but they will be asynchronous." is something they will find hard to refuse, considering their love for ray-casts. ;)
Yeah, game programmers absolutely love raycasts :). Our game programmers have implemented this smart camera system to our game, that detects the rough shapes of forthcoming obstacles and adjusts the camera accordingly (by ray casts of course). Most players do not even notice that it's there, but when you play the recent copycat versions (on mobile phones) you notice you do not always have enough time to react to obstacles (and you fail miserably). A smart camera would be even more important on those small screens. In the end, technology is there to provide game programmers/designers way to implement their vision, and for the graphics artists to make the game look the way they want.
 
If I had experience on the platform, I wouldn't be asking silly SPU questions here, would I? :)
No, I was asking if you will be working on PS3, and so will get to have a hands-on look yourself and compare. ;) Although I can well imagine that at the moment there are no considerations beyond the current project.
 
If you exclude dual issue, both will run at the same speeds more or less, vmx vs spu. Dual issue is a factor but realistically you can't always get a perfect mix of float/int instructions to hit the magical 0.5 instructions/cycle count.

VMX can't be faster than SPU even in theory.
It may be on par only in case of read-only totally latency-insensitive FPU calculations running from L1 cache in a crazy unrolled loops.
PPU has anemic memory BW (even prefetched) and ridiculous latencies and stalls.
Throw VMX code on SPU even without modifications and quadruple the performance.
That can be true even for scalar PPU code.
In SPU code I frequently did table lookups / texture fetches (to accelerate RSX rendering). I've no idea how to do it in VMX without LHS or loop duplication.

DMA to LS is not only utilises available memory BW and usually eliminates entire memory access cost, but allows to reformat data to simplify SIMD code.
 
sebbbi
Besides CELL Handbook, there are a numner of measurements and EIB/MFC details is spreaded over IBM site.
As I remember there should be an article with MFC latency measurements, but I was unable to find it.
Still you can check some links
http://www.ibm.com/developerworks/forums/message.jspa?messageID=13950126#13950126
http://www.ibm.com/developerworks/power/library/pa-expert9/
http://www.ibm.com/developerworks/power/library/pa-cellperf/
http://www.ibm.com/developerworks/power/library/pa-qsmemperf/index.html
http://www.ibm.com/developerworks/library/pa-celldmas/
http://www.ibm.com/developerworks/power/library/pa-celldmas2/index.html

I simply assume 1000-cycles MFC DMA latency. I've never actually faced with latency bound situations on SPU.
 
It may be on par only in case of read-only totally latency-insensitive FPU calculations running from L1 cache in a crazy unrolled loops.
PPU has anemic memory BW (even prefetched) and ridiculous latencies and stalls.
Throw VMX code on SPU even without modifications and quadruple the performance.
We were talking about running a perfectly optimized simple post process filter loop on SPU vs VMX. It's reasonable to assume the loop is unrolled as much as it's needed. My input was basically about VMX128, and it since it has 128 vector registers it doesn't have all the same problems as the basic VMX. The vector unpack/pack instructions (4xfloat16 / 10-10-10-2 / 8888 / etc <--> 4xfloat32) also help in the post process pixel loop (and can be used to reduce required BW in other algorithms as well).
In SPU code I frequently did table lookups / texture fetches (to accelerate RSX rendering). I've no idea how to do it in VMX without LHS or loop duplication.
Yeah, you cannot index memory directly with vector register contents (same limit with SSE/AVX and other general purpose CPU vector sets). It makes life sometimes very difficult (LHS stalls, as you have to transfer the load address though memory/cache). I have always liked GPU programming because the vectors are first class citizens (data indexing/addressing is possible using vector/float contents, not just using scalar integer registers like on general purpose CPUs). Intel AVX2 will finally bring proper (gather) loads to general purpose CPUs as well (vector register is used as four memory indices). SPU doesn't seem to have gather load support, but it can load complete vector registers using first 32 bits of a vector register as address (that's how I understood it by reading the Naughty Dogs SPU optimization guides)? But that's still very good... something I would kill for to have :)

Update: Thanks for the links. I was writing my response simultaneously.
 
We were talking about running a perfectly optimized simple post process filter loop on SPU vs VMX
Then you much likely already bottlenecked by memory B/W. CELL has twice of real B/W versus theoretical B/W of X360 :D

SPU doesn't seem to have gather load support, but it can load complete vector registers using first 32 bits of a vector register as address
Exactly.
 
I am a weakling at technical matters, but I am following the discussion and I wonder if what happens in this video can be considered ray-casts:


I always thought that any character in a videogame don't have eyes, as a sense, that their eyes were just there for physical accuracy and depict natural proportions of living beings.

I mean, in a game, theoretically, they should be just a body, and "their eyes" could be just all over their body without making a difference, since you can't picture what non playable characters and creatures see no one would bother to simulate the sense of sight. Developers could only draw a person correctly and get away with it, because it seems enough to treat a body as a whole, without distinguishing between the eyes and the rest of the body. In this case it's obvious an object placed at eyes' height is occluding the non playable characters vision, and it looks realist.
 
Status
Not open for further replies.
Back
Top