Was Cell any good? *spawn

Discussion in 'Console Technology' started by Shifty Geezer, Oct 13, 2011.

Thread Status:
Not open for further replies.
  1. kagemaru

    Veteran

    Joined:
    Aug 23, 2010
    Messages:
    1,358
    Likes Received:
    10
    Location:
    Ohio
    I don't believe only one person has all the answers, but I do believe I'll take the word of a developer over a forum poster with an agenda. If other devs wish to discuss such matters with Joker, great I'd love to read it, but that isn't what's going on here. :wink:
     
  2. joker454

    Veteran

    Joined:
    Dec 28, 2006
    Messages:
    3,819
    Likes Received:
    139
    Location:
    So. Cal.
    I agree that you need registers to help hide latency on vmx....but are you sure the above is correct? The 360's implementation of vmx is slightly modified to where it's improved over the ps3 version. From what I recall they upped the register count to 128 as well as having additional instructions.
     
  3. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    Yes, that's true, but the original comparison was just PPU VMX, and comparisons with other processors will be with smaller register sets, unless they're all sporting 128 register vector units these days! It's actually quite an importnat consideration, what Xenon's large VMX units bring to the table if anything. Sadly we don't get in depth feedback on those, so can't compare massive register sets on a future processor.
     
  4. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    I was comparing the CBEA components, as the docs are openly available. Xenon is a lot less publicly documented.
    However, IBM did talk about it a bit here, so yes, 128 registers with some restrictions and additions to the base VMX. They added a few AoS instructions and D3D format conversions.

    The L2 is also increased to 1MB, but it's now shared over the three cores, giving you effectively less L2 per unit. You win some, you lose some.

    All in all this makes VMX128 more capable than VMX32 in small-dataset floating point loops, but it's still not in the same league as an SPE, primarily due to the higher instruction latency.
     
    #124 T.B., Nov 5, 2011
    Last edited by a moderator: Nov 5, 2011
  5. joker454

    Veteran

    Joined:
    Dec 28, 2006
    Messages:
    3,819
    Likes Received:
    139
    Location:
    So. Cal.
    1MB of L2 shared three ways sounds like a hit, but comparitively you only realistically get around ~110k or so of usable local sore on each spu after you double buffer data and account for code + stack.

    Personally I'd say vmx128 is much more capable than vmx32 because vmx is very sensitive to lantency without a large register set, and because it's more subject to The Intern Effect (tm) than spu's are. So going from 32 to 128 is a huge improvement! Also generally speaking you are always dealing with small data sets on both machines because of the local store size on spu and the register count on vmx. Well ok, more like large data sets processed in really small chunks but the net result becomes the same basically that latency become much more manageable on vmx with the way data is churned through on both consoles. For the stuff as sebbbi mentions that are large data set with random access, you shift those to the gpu.


    Don't sweat it daddio, come January it will be 3 years since I touched console code so my memory is starting to get a bit hazy on some stuff. Feel free to treat me like one of the guys at this point :)
     
  6. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    I'd disagree with this on two levels. First, the equivalent to LS on a VMX in my mind is L1, not L2. L2 is far, far away in terms of latency. If all your data fits into L1 and is nicely laid out, I'd argue that VMX128 and the SPE are on more or less equal footing, assuming there is no second thread interfering with your VMX code and polluting its cache. If you're sufficiently hardcore, an SPE will still win in many cases due to funky things you can do with the MFC, the IMO more powerful ODD vs Type 2 and the less restrictive ISA, but that's a level of engineering you'll rarely ever see.
    Of course, the assumption of dropping SMT and fitting everything into 32KB means that we're probably talking about significantly more engineering effort on in the VMX than the SPE, paradoxically. Does anyone run only one thread on a core to maximize performance? I don't think I've ever seen that.

    The second part I disagree with is the 110k. Let's talk actual numbers again. Edge MLAA is less than 40k of code all in all and - depending on the exact configuration - north of 200k worth of data. Some of that data will be in flight at any given time and some of it will be actively processed. This is not different from a cache, which also has some data in flight and some actively useable. Actually, you can use the MFC to do a really tight DMA loop where only the minimum amount of data is in flight at any given time. This is much, much harder to do with a prefetcher, since the control is much more indirect.

    Absolutely. My point was not that VMX128 is weak. It isn't. But it's not as powerful as the SPE. And quite frankly, if IBM could not have engineered and new specialized core (the SPE) that beats an extended version of one of their old cores at a very specialized job, they would not be some of the best processor designers in the business.
    Looking at if from the other side, some of the VMX128 extensions allow it it beat SPEs at some very specialized tasks by a fair margin as well. But I'd say those are more rare.

    The comparison to GPUs is an interesting one, because GPUs are a whole different class of processors. VMX and SPE are sort of designed for the same thing, with pretty different parameters, but they are still very comparable. If you don't care about 25% or even 100% difference core-for-core, then yes, VMX and SPE look a lot alike. They are both 4 wide dual issue SIMD units attached to a bit of fast memory and clocked at 3.2GHz. Within that class however, they do differ in the details and those details are exploitable, if you are willing to spend the engineering effort, which is not always a sound investment for all teams.
     
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    (I have to admit, I was comparing SPU to VMX128 instead of the less powerful VMX32)

    A well programmed loop would never access main memory more than once for reading each pixel and once for writing each pixel. If you have fat pixels, you might need more than L1d and 128 VMX registers to hide pipeline latency, so your algorithm might sometimes hit the 1MB L2. But if the post process algorithm requires more than 1 MB of memory to hide the pipeline latency, there's something badly wrong in the code (as the data access pattern is very cache friendly).

    Both SPU and the VMX do exactly the same amount of loads and stores to the main memory. Each pixel is read once and write once. You do not even need to add any manual cache control instructions to reach this on VMX. However if you add manual cache control instructions, you are pretty much guaranteed to always hit L1d (assuming 8888 format pixels, you can have 8192 of them simultaneously in 32KB L1d), since the post process loop doesn't have branches (it's very easy to predict how long it's going to execute on a in-order CPU, so you can put the cache prefetch instructions in ideal places).

    Also VMX128 includes 3d/4d dot products. This helps if the input/output data is in AoS layout. Without it, you need to either have the pixel data interleaved in SoA layout (difficult as the data comes from GPU) or transform it to SoA layout (more instructions). Also when calculating dot products in SoA layout, you do four at a time, and thus need more registers. AoS dot products can relieve the register pressure. VMX128 also has fast (low latency) float16->float32->float16 conversion/packing (and conversion/packing to other pixel formats as well), so it's pretty well capable of processing pixel data in all of the currently used LDR and HDR formats. Many other vector processing algorithms benefit also from fast loading/storing values as 16 bit floats in memory (halves cache/memory footprint compared to 32 bit float vectors). In many cases you do not need more precision.

    Naughty Dogs lighting stuff:
    http://www.naughtydog.com/docs/gdc2010/intro-spu-optimizations-part-2.pdf
    On SPU: AoS = 10.75 cycles, SoA = 7.75 cycles. Not having AoS dot products hurt SPUs a bit when you have to process AoS data. Working on SoA layout is often the preferred way, but that's not always possible.

    With longer VMX instruction latencies, getting pipelines 100% utilized is of course a harder task, especially when doing it by hand like Naughty Dog does in their SPU lighting code. But it's not impossible, you just basically need to manually interleave the processing of a few pixels. The positive thing however is that code hot spots are often self contained, and optimizing the short inner loops often is enough to get good performance. With Cell like architecture, the whole game program needs to be adapted and optimized to suit the system, or the performance will be really poor. With a more traditional cache based UMA system, you only need to optimize the hotspots (= less than 1% of the whole code), the CPU automatically runs other code well enough.
     
    #127 sebbbi, Nov 8, 2011
    Last edited by a moderator: Nov 8, 2011
  8. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,065
    Likes Received:
    1,660
    Location:
    Maastricht, The Netherlands
    How about chaining? E.g. if I understood it correctly (I did read most of the manual, even if I never really worked with it more than running sample code back when PS3 still had Linux), you can chain SPUs with no additional delay to the pipeline. E.g. you could assign the task to one SPU, who does some basic work on the data, divides up the work and passes it to two other to two other SPUs. Can you do something similar with the three VMX128s in the 360's CPU?
     
  9. T.B.

    Newcomer

    Joined:
    Mar 11, 2008
    Messages:
    156
    Likes Received:
    0
    Seems as if we agree for the most part. :)

    Consider a scatter algorithm, like IIR gaussian approximation. You'll need to do a forward and a backward pass over each row and column and the intermediate values need to be float precision if you want extreme blurs. 1280*3*sizeof(float) = 15360B, and that's assuming you somehow got rid of the alpha channel.
    Even if everything is gather based, you can have more data to gather in LS than in L1.

    If your effect has nicely independent pixels, then of course, as I stated earlier, it's a pretty even playing field. And if you do, say, a tonemapping, VMX128 should be a good chunk faster. Again, I'm not saying VMX128 is bad (or even VMX32 is bad) or that it doesn't have cases where it can be faster than an SPE.

    This is more a case of SPEs being able to efficiently run a wider class of algorithms at high utilization. I'll need to think about if there is an interesting class of algorithms at which the VMX will be significantly faster for architectural reasons. The L2 cache lines are 128B, so that's a pretty DMA-able size...

    If this has been your experience with writing PS3 games, then kudos to you guys for going all the way. This is not usually how it works. :)
    I really don't think a lot of games have significantly more than 1% of their codebase on the SPUs, but that's just a gut feeling.
    In any case, I can't really talk too much about ease of development, since I've not done a whole lot of VMX128 coding. So I'll stick to commenting on chip design, where I actually might know what I'm talking about. :)
     
  10. Love_In_Rio

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    1,582
    Likes Received:
    198
    Do you work for IBM? If so give us a hint... Will we see VMX units in next gen?. SPUs improved ?.
     
  11. forumaccount

    Newcomer

    Joined:
    Jan 30, 2009
    Messages:
    140
    Likes Received:
    86
    Going by elf sizes, around 20% of my game is SPU code. So my estimation would be... probably a lot higher than 1% for the typical game. I definitely know of titles that used 0% SPU, but they didn't have much going on and that was at the start of the console gen.

    I don't have internal knowledge of more than 3 or 4 titles that shipped in the last 2 years, but I feel like I have a pretty good grasp on what can be done with PPU alone... I'm going to disagree with both your take on it and sebbi's.
     
  12. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    T.B. is one of the (2?) devs who wrote the God of War 3 MLAA. The module is written such that developers can plonk in the MLAA code easily (if they have spare SPU cycles).
     
  13. Love_In_Rio

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    1,582
    Likes Received:
    198
    Well, then he could answer those questions aswell ;).

    Now seriously, i have a real answerable question for him or any other in the known. What could be done to the SPUs to make them more flexible, easy to program in an improved Cell version?.

    Add dynamic branching? ooo capabilities? add integer units? increase local storage or replace it for a cache?.
     
  14. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Agreed. SPUs should perform better when the data set required for each pixel is larger than L1d (32 KB), but smaller than local store size (or half of it minus code = ~128 KB, since you need double buffering to load next data while processing old one). But once the data set is larger than local store size, the VMX128 processing would simply start using the 1 MB L2 cache instead of the faster L1d, while the SPU code would need to frantically swap data to/from main memory (slowing it down to a crawl). But all this is pretty much academic debate, since you only have one PPC core on PS3, and six SPUs. Using the only general purpose CPU core for post process pixel processing would be quite inefficient approach :)

    I haven't written code for PS3, just for Xbox 360 (and older Sony consoles). But I have of course followed PS3 game development quite closely. SPU programming articles (like that Naughty Dog one) are very interesting read for me, since I do most of our low level vector optimizations (and all our GPGPU stuff). It's interesting to compare different vector architectures.
     
  15. Love_In_Rio

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    1,582
    Likes Received:
    198
    What would you modify in the SPUs to make them easy to program and not make it a pain in the ass while maintaining their capabilities?. Could be made modifications that allowed disregarding in a new design the need of a PPU?.
     
    #135 Love_In_Rio, Nov 10, 2011
    Last edited by a moderator: Nov 10, 2011
  16. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Hmm... it should be common for SPUs to tear through datasets larger than 128K by streaming/staggering the data via DMA. The main issue is random access data, or data with too much dependencies (Can't fetch early enough or in parallel). Most graphics jobs are highly parallelizable.

    Developers can also combine similar jobs together (both code and data) to make a good/bigger batch size.

    Once they are satisfied with a single SPU implementation, they will have more cores to distribute the workload in a predictable way.
     
  17. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    He was also involved in Sacred 2 development before moving to Sony, or I'm confused?
    I know that they are two members involved in gaming industry that used 2letters pseudo I tend to confuse them from time to time :???:
     
  18. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    T.B. worked on Sacred 2 and moved to Sony's ATG in Cambridge. So he'll be working at a lower level on SPUs than anyone being in the luxurious position of just having to develop technologies without product deadlines to worry about, but perhaps hasn't the same experience with something like VMX128 (I don't think he programmed XB360 during Sacred 2). Whereas Sebbbi is all XB360 and no SPU experience. :D
     
  19. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,065
    Likes Received:
    1,660
    Location:
    Maastricht, The Netherlands
    If I were to loosely summarise what I have heard from some developers (and I could be very wrong) they would ideally have the local stores be a unified cache addressable in main address space. I imagine that you could then still lock parts of that memory space so that it works similar to local store in terms of predictability, but you have more flexibility, and make it easier to use for those who do not code at low level.
     
  20. assurdum

    Veteran

    Joined:
    Oct 31, 2008
    Messages:
    1,568
    Likes Received:
    0
    Oh my... he is a God for me :oops: pretty curios to know whether FXAA was possible on the ps3 through SPU & how would works.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...