Was Cell any good? *spawn

Discussion in 'Console Technology' started by Shifty Geezer, Oct 13, 2011.

Thread Status:
Not open for further replies.
  1. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    21,737
    Likes Received:
    7,403
    Location:
    ಠ_ಠ
    Highly doubt there'd be any advantage to doing the texture ops, branching etc that FXAA employs, and even so, you're talking about trying to overtake 1-2ms on GPU. Plus, you'd still need the frame texture for input, just like MLAA does on SPUs...
     
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Yes, of course. But the traditional cache based CPUs also perform very well if data access pattern is predictable (assuming proper manual cache hints). SPU should be faster when doing random access to 128K dataset, since that doesn't fit inside the PPC 32KB L1d. However when random data access to a bigger data set is needed, PPC cores should outperform SPU by a wide margin (L2 access is considerably faster than constantly transferring random data blocks from/to memory).

    FXAA is filled with bilinear filtering hacks (fetch+blend 2x2 pixels at once). It has been purposefully designed for fast GPU execution. You could run it on a CPU, but the GPU version is hard to beat (it's only 1 milliseconds on current consoles, and a fraction of that on high end hardware).
     
    #142 sebbbi, Nov 10, 2011
    Last edited by a moderator: Nov 10, 2011
  3. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,065
    Likes Received:
    1,662
    Location:
    Maastricht, The Netherlands
    To make this more interesting, could you give a typical example of that type of large dataset random data access? I sometimes have a suspicion that there often are efficient streaming alternatives, but at the same time it is very likely that there are some cases where PPU's win easily, and I would like have a better understanding of what type of work that is.
     
  4. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    If the cache is shared by multiple cores, it will impact performance overall once you lock (part of) it for the "co-processors".

    Bigger dataset is not the problem if the access pattern is predictable. The data will just be streamed in while the SPU gets busy with the other half of data already in the Local Store.

    If data is accessed randomly, cache hit may be rare too. For problems like that, then the devs may have to rely more on the larger number of cores for speed up. But there should still be opportunity to batch the data.

    Cell is used in BlueGene to solve supercomputing applications in place of regular CPUs (with vector engines) during that time period. They don't deal with small datasets there.
     
  5. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    I don't seem to be good at short posts.

    Thanks for sharing that! Do you know how much that is in LOC, and how much of that is code you wrote vs. stuff from the SDK? 20% is a lot, but I really don't know how elf sizes compare.

    That's cool. I may have failed my reading comprehension check, but I'm not sure which parts you disagree with and what your opinion on them is. Care to elaborate?

    Both already in there. It doesn't have a branch predictor or an integer divide instruction, but that's about it. Instead of the predictor, there are branch hints, so you don't usually need to take a hit for a branch. This can be tricky to get right, but mostly in cases where a predictor would be completely lost.

    I'm still scratching my head a bit at the 128KB LS overhead, but I'm willing to accept that for some people this is how it works out. As we're sort of talking about cool hardware capabilities, let me say a word or two about double buffering and SPUs.
    The way memory transfers down to main memory and VRAM work is that the MFC takes care of them. The SPU controls its MFC through a channel interface which is part of the ODD pipe (which is roughly the same as VXU Type 2 for you), meaning that you can issue commands to the MFC as ODD instructions.
    If you are so inclined, you can put these channel commands into your regular processing loop. The MFC can queue up 16 DMAs, so this effectively gives an extremely controlled prefetch system, at the cost of some ODD pressure. Most people have ODD to spare, so it's more a case of working out the latencies and carefully adding the commands, as you described it for VMX. It's really very much the same thing, save for the added control and things like DMA lists.
    What that means is there there is no need to just do a simple double buffering. You can do pretty sophisticated loops, if you need to. I've personally never had a situation where I needed to roll the channel commands in with the assembly, as this is really something you only need when you have sparse (i.e. random) memory accesses. People tell me that it works rather well, however.
    I don't actually know how MFC DMAs compare to L2 fetches from memory in terms of latency, so it might come out even.
    Programmers who are not used to doing these things often leave quite a bit of performance on the table. You're probably used to seeing the same thing with people who don't use the prefetch instruction on 360.

    Absolutely. They are usually so much cleaner designs than the super-scalar OoO cores, where you can't really predict what's going to happen. :)

    It's not lower level than what other developers do. Every PS3 developer can get as low level as they want. Also what is this luxurious position without product deadlines you're talking about? I'm intrigued. ;)

    :shock:

    I did notice a recent lack of offerings at my shrine...

    I'm pretty sure it could be done, although it may not be worth it. Last time I checked it was algorithmically a lot simpler than MLAA but very tailored to GPU ISAs. This is what makes it possible to do this nice and simple integration for it and what makes it run well on GPUs. Full blown MLAA is extremely hard to do on GPUs, which is why nobody is doing it. :)

    This actually harkens back to what we talked about earlier, as MLAA is one of those pesky algorithms that needs the entire scanline more than just once.
     
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    That's good to hear. I thought the SPU couldn't request data fully on it's own. If the SPU can fetch (small) blocks of memory efficiently, relatively random (but predictable) accesses shouldn't be that much a problem either. Without L2 you of course need to go directly to the main memory every time you swap information (and this wastes the main memory bandwidth that's shared between all computing units).

    CPU (full) cache misses can be around 500 cycles. I would like to know how long the latency (in cycles) is to fetch an aligned 128 byte (or larger) data block from main memory to SPUs local store (if that's public information). Are those comparable? Can you use standard bucketed data structures (such as bucketed lists*) efficiently by SPU without loading the whole list to local store first?

    *) Bucket contains data + pointer to next bucket (or cache line index to next bucket if buckets are pooled). Buckets are aligned to cache lines. Every time you start processing a bucket you first cache hint the next bucket (or on SPU you would start loading the next bucket).
     
    #146 sebbbi, Nov 10, 2011
    Last edited by a moderator: Nov 11, 2011
  7. Lucid_Dreamer

    Veteran

    Joined:
    Mar 28, 2008
    Messages:
    1,210
    Likes Received:
    3
    First of all, thanks for leaving a bit of your knowledge in this thread. Secondly, can you give a ballpark estimate of how much performance programmers could be leaving behind by not taking advantage of the MFC options available to them? Of course, the answer would vary, but I'm hoping you might be able to give a wide range.

    *leaves offering with head low and backing away*
     
  8. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    Let me quote section 19.2.1 from the Handbook
    So unlike the L2, which always needs to fetch a 128B aligned cache line, you can actually fetch less than that, making more efficient use of LS if your data is sparse. Of course, you may pay for this with reduced bandwidth if you mess up alignment. Apart from that, as the text states, a aligned 128B MFC transfer and a L2 cacheline fetch are pretty much the same.

    As you can probably tell, I'm playing the citation game to make sure I'm not breaking any NDAs. I did however find something here, which might give an idea.
    I actually found that reference 11 as well (it's an article by IBM engineer's published in IEEE Micro) and they get a total time to memory of 290 cycles. I only skimmed the article, so I don't know what kind of memory that was (other than "main"). LS to LS is 140cy.

    So that sounds comparable to L2. This is just literature search and not personal experience, but I tend to believe the IBM engineers when it comes to their chips. ;)

    Does that answer your question?

    No worries. :) Thank STI for publishing all this stuff. You'd kind of expect that given the interest in Cell, this would be more common knowledge.

    I really can't. But at times the MFC will allow you to write a radically more efficient algorithms, at times it's just a stupid data fetcher. So not every problem can benefit from it in the same way (or at all).
    Also, this being the internet means that there is a real chance that someone would take whatever I write and the next time they are unhappy with some PS3 title, they will loudly proclaim that a developer "doesn't properly MFC" and I will get beaten up next Siggraph.
    I'd like to avoid that.

    You guys are weird, you know that? :)
     
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Thanks for the information. The memory fetch latencies (290 cycles) also sound very good.

    Actually it seems that our data structures (bucketed cache line aligned structures) would be pretty compatible for SPU execution. After reading so many developers complaining how they had to restructure their engines completely, I thought the latency would have been considerably higher (thousands of cycles) and efficient transfers would have to be considerably larger (tens of kilobytes). With only 290 cycle latency and good efficiency of (small) 128 byte transfers, SPU doesn't feel considerably harder to use in performance critical code compared to traditional CPUs with manual cache hints (and SIMD intrinsics). Without L2 of course some algorithms are a bit harder to implement, but on the other hand local store is much larger than L1d... It's a difficult system to judge without any personal experience of programming it. I would probably love it (if we for some reason decided to do something for it), but the C++ game programmers wouldn't likely agree with me :) . Technology alone is sadly not enough to make a good game...
     
    #149 sebbbi, Nov 11, 2011
    Last edited by a moderator: Nov 11, 2011
  10. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    Thanks for reading all that!

    I suspect that the real number is closer to an L2 miss. Conceptually, I'd expect the L2 miss time to be roughly (L2 hit time) + (memory latency). So judging from the IEEE Micro article, I'd estimate the actual latency to be (L2 miss time) - (L2 hit time) + 150cy.
    In any case, the PPE and SPEs are on the same (crazy fast) bus, so there should not be too much difference.


    I think your bucketed lists may actually be a pretty SPU friendly structure, especially if you massage it a bit. For example, right now you have the next pointer in the first cache line of each bucket, since that allows you to trigger an early prefetch. You could externalize the entire list structure and have the loose buckets separately. Then you can turn the externalized list into DMA list form (basically 2 words, [size, address]), which would then allow you to load the DMA list directly into LS and execute bits of it, without ever needing to modify it. The cool thing about this is that a DMA list does automatic gather/scatter. So if you have a DMA list of 10 elements, you can tell MFC to just fetch e.g. 3 elements starting at offset 2 (representing 3 of your buckets) and put them at address X for LS. The MFC will then gather the data and write your three buckets into a linear piece of LS, starting at X. Your actual processing code only ever sees linear memory.
    You can then use the same list to scatter the data back into main memory.
    Since the LS offset and the number of list elements to process are channel commands and not part of the list, you'll never have to modify the DMA list unless you add/remove buckets.

    If you want fine grained synchronization so you can process the list as soon at data arrives, you can use the tag mechanism. Every dma command can have a tag (there's 32 tags) that can be used to sync it against other in-flight DMAs, or to check on the SPU if it was completed.

    I suppose having the next pointers externalized wouldn't even make the PPU too unhappy, even if you will use 8 bytes instead of 4 per element.
    Caveat: The maximum transfer size per DMA list element is 16K, so you may need to split buckets.

    Exactly. It's a convenience vs. performance tradeoff. Or as some would call it: A convenience vs. fun tradeoff. ;)

    There's a good point to be made about productizing engine code better, so that the gameplay guys don't even see that there's some funky bit of SPU code underneath it. To a degree, I suspect "I give you 10 times the raycasts per frame, but they will be asynchronous." is something they will find hard to refuse, considering their love for ray-casts. ;)
    If you ever get your hands on the SPUs, tell me how it went. I'm sure you'll have a blast.
     
  11. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    With the recent acquisition, is there reasonable chance you'll be working on a PS3 title (Trials port or something new) in the near future to get some first-hand Cell experience? Ubisoft's interest in your engine suggests to me they'll want a port, althuogh I understand if you can't even talk about that. Would be good to get your opinions going from theoretical to practical experience of Cell though! ;)
     
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    If I had experience on the platform, I wouldn't be asking silly SPU questions here, would I? :)

    We have developed for many platforms in the past, but I don't know yet what future brings. Trials Evolution (for Xbox 360) is currently the most important thing for us. We are focusing all our effort to make it as good as possible.
     
  13. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    The questions are not silly. It's the attitude and desire to seek knowledge that count. ^_^
     
  14. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    That sounds really efficient.
    Agreed. That's the best way to do things. Our first cross platform game was a Warhammer 40K game for Sony PSP an Nintendo DS, but also had both native Linux (OpenGL) and Windows (DirectX) clients as well (for debugging purposes). Our network programmers favored Linux (and at that time Valgrind was the only good tool to track memory issues). That was the first game we had single game code that compiled directly (without any modifications) to all four platforms. The lowest level technology code under the hood was of course very different (but invisible to game programmers).
    Yeah, game programmers absolutely love raycasts :). Our game programmers have implemented this smart camera system to our game, that detects the rough shapes of forthcoming obstacles and adjusts the camera accordingly (by ray casts of course). Most players do not even notice that it's there, but when you play the recent copycat versions (on mobile phones) you notice you do not always have enough time to react to obstacles (and you fail miserably). A smart camera would be even more important on those small screens. In the end, technology is there to provide game programmers/designers way to implement their vision, and for the graphics artists to make the game look the way they want.
     
  15. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    No, I was asking if you will be working on PS3, and so will get to have a hands-on look yourself and compare. ;) Although I can well imagine that at the moment there are no considerations beyond the current project.
     
  16. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    VMX can't be faster than SPU even in theory.
    It may be on par only in case of read-only totally latency-insensitive FPU calculations running from L1 cache in a crazy unrolled loops.
    PPU has anemic memory BW (even prefetched) and ridiculous latencies and stalls.
    Throw VMX code on SPU even without modifications and quadruple the performance.
    That can be true even for scalar PPU code.
    In SPU code I frequently did table lookups / texture fetches (to accelerate RSX rendering). I've no idea how to do it in VMX without LHS or loop duplication.

    DMA to LS is not only utilises available memory BW and usually eliminates entire memory access cost, but allows to reformat data to simplify SIMD code.
     
  17. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    sebbbi
    Besides CELL Handbook, there are a numner of measurements and EIB/MFC details is spreaded over IBM site.
    As I remember there should be an article with MFC latency measurements, but I was unable to find it.
    Still you can check some links
    http://www.ibm.com/developerworks/forums/message.jspa?messageID=13950126#13950126
    http://www.ibm.com/developerworks/power/library/pa-expert9/
    http://www.ibm.com/developerworks/power/library/pa-cellperf/
    http://www.ibm.com/developerworks/power/library/pa-qsmemperf/index.html
    http://www.ibm.com/developerworks/library/pa-celldmas/
    http://www.ibm.com/developerworks/power/library/pa-celldmas2/index.html

    I simply assume 1000-cycles MFC DMA latency. I've never actually faced with latency bound situations on SPU.
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    We were talking about running a perfectly optimized simple post process filter loop on SPU vs VMX. It's reasonable to assume the loop is unrolled as much as it's needed. My input was basically about VMX128, and it since it has 128 vector registers it doesn't have all the same problems as the basic VMX. The vector unpack/pack instructions (4xfloat16 / 10-10-10-2 / 8888 / etc <--> 4xfloat32) also help in the post process pixel loop (and can be used to reduce required BW in other algorithms as well).
    Yeah, you cannot index memory directly with vector register contents (same limit with SSE/AVX and other general purpose CPU vector sets). It makes life sometimes very difficult (LHS stalls, as you have to transfer the load address though memory/cache). I have always liked GPU programming because the vectors are first class citizens (data indexing/addressing is possible using vector/float contents, not just using scalar integer registers like on general purpose CPUs). Intel AVX2 will finally bring proper (gather) loads to general purpose CPUs as well (vector register is used as four memory indices). SPU doesn't seem to have gather load support, but it can load complete vector registers using first 32 bits of a vector register as address (that's how I understood it by reading the Naughty Dogs SPU optimization guides)? But that's still very good... something I would kill for to have :)

    Update: Thanks for the links. I was writing my response simultaneously.
     
  19. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    Then you much likely already bottlenecked by memory B/W. CELL has twice of real B/W versus theoretical B/W of X360 :grin:

    Exactly.
     
  20. Cyan

    Cyan orange
    Legend Veteran

    Joined:
    Apr 24, 2007
    Messages:
    9,063
    Likes Received:
    2,674
    I am a weakling at technical matters, but I am following the discussion and I wonder if what happens in this video can be considered ray-casts:



    I always thought that any character in a videogame don't have eyes, as a sense, that their eyes were just there for physical accuracy and depict natural proportions of living beings.

    I mean, in a game, theoretically, they should be just a body, and "their eyes" could be just all over their body without making a difference, since you can't picture what non playable characters and creatures see no one would bother to simulate the sense of sight. Developers could only draw a person correctly and get away with it, because it seems enough to treat a body as a whole, without distinguishing between the eyes and the rest of the body. In this case it's obvious an object placed at eyes' height is occluding the non playable characters vision, and it looks realist.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...