Looking for feedback: SPU Shaders

Discussion in 'CellPerformance@B3D' started by Mike Acton, Sep 26, 2007.

  1. Mike Acton

    Mike Acton CellPerformance
    Newcomer

    Joined:
    Jun 6, 2006
    Messages:
    47
    Likes Received:
    2
    Location:
    Burbank, CA
    Hey guys!

    Now that we're wrapping up Ratchet and Clank Future and getting that ready to ship, I can finally take a breath and come over for a visit!

    So if you haven't seen it, a little while ago I posted a small article on what we're calling "SPU Shaders".

    It's nothing really revolutionary -- and admittedly a very simple concept -- but what we really needed was something to bridge the "Cell gap". To make things more approachable without adding unnecessary complexity. And this was the result.

    The short of the idea is this: Have the people that know the Cell and can really focus in on SPU programming build the large systems. (Same as before.) But leave a gap. Allow little fragments of SPU code to be injected into the system so that programmers who are more comfortable with scalar, sequential code can still get what they need on the SPUs (e.g. custom logic code)

    The key, I think is to not concentrate on making these fragments completely optimal, but rather to concentrate on making them small, making them deal with more than once input at a time and leaving complete control over decision making (e.g. memory transfers) in the hands of the fragment. This makes them optimizable while still being approachable.

    In this way we can concentrate on the "big" performance issues of the systems (lots of data, scheduling within the frame, etc.) and just dynamically load in the bits of branchy-scalar code on-demand.

    So while the idea looks like it has legs for us -- and we've already reworked a couple of systems to support this concept -- I'm curious what everyone else thinks.

    Mike.
     
  2. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    I personally was thinking along the lines of functional programming ideas ...the ability to plug in the peices you refer to as 'shaders' into framework code. i.e. Glorified, special cases of external iterators - filter/map/reduce in various combinations, with various assumptions about how to get the data & what common workspace is required.. (e.g. indexed arrays vs pointer lists etc.. and how to group data elements in DMA batches )...put another way, take the thinking behind google 'map reduce' as inspiration.

    I've already done this for threading on the 360 e.g. somethine like ForEach that runs a functor on elements of a collection in parallel.

    Am I thinking on similar lines ? You mention "leave memory transfers in the hands of the fragment" but i very specifically think those would be done in my helper code, as that seems to be part of the sticking point in added complexity when using SPUs.

    Definitely seems like the way to go to me, and also simplifies writing cross-platform code.

    Documenting a library of these things seemed very difficult. I think we'd end up with a myriad of combinations with very convoluted names - only a peice of equivalent sample code can document how it works - so how do you make it easy for newcommers to find what they need?? (i dont have a solution)

    The memory side complicates it - for example where stl "foreach" junkies can just use a functor, we must provice "a function", "a function + constant helper data", "a function + an array to accumulate into.." etc..

    I was looking to go back over 360 code and port to the SPU's more intelligently by re-factoring to fit these cases.

    The 'shader' analogy sounds like you're selling it to people who grew up on GPUs .. Ironic as it's almost the reverse, GPGPU is emulating what SPUs are designed for :)
     
    #2 ebola, Sep 26, 2007
    Last edited by a moderator: Sep 26, 2007
  3. minty

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    3
    Likes Received:
    0
    Mike,

    The article was interesting and inspired me to consider changes to the approach I took on a multi-core DSP design particularly if we port to the Cell.

    Our DSP infrastructure design dealt with reusable DSP algorithm building blocks (e.g. FIR filter) and let the higher level code arrange algorithm sequences. Things like dynamically adding filtering effects for a video processing pipeline work well this way and are much more flexible than a statically placed 'streamable' sequence when load is highly variable.

    The biggest win is when the processing load is determined by the data or events on the system, akin I guess to variable numbers of collisions to process etc. Being able to dynamically react and make use of all processors at all times is a big win over pre-allocated 'cpu X does A', 'cpu Y does B' when load is highly variable.

    As well as allowing a low-level function interface, for some types of applications we also provide a lightweight wrapper and a more generic interface that is used in some applications with a higher level 'fragment sequence' and 'wiring' definition to plug data sources to sinks on the fly.

    This 'generic' enabling layer needs to be lightweight and frankly its overhead makes very small fragments less attractive. If something is going to run for say less than 100us the overhead can be significant.

    The original design had a core designated as controller which spooned the data out to each processing core while handling 'wiring' and data synchronisation, this was designed so that relatively dumb FPGA-based co-processors could co-exist with the cpus and be spoon-fed over a very simple interface, but the attractiveness of FPGAs has waned since they are much more difficult to support long term than code!

    We've been exploring much the same switch around as you describe with the individual cores driving what they do next without the need for controller involvement.

    In some applications we still need additional features of our controller such as multi-box co-ordination, automatic data persistence but you've inspired me to consider further splitting this out.

    ----------------
    Anyway back on topic :) ....

    I was curious to know what sort of inter-fragment overhead there is over say inline code. Is there a pragmatic minimum limit you'd set for a fragment e.g. not efficient unless batching up less than X us of work.
    I suspect you may begin to be more guided by the maintainence overhead of dealing with a large sets of small fragments rather than any performance overhead?

    My other curiousity was how you manage data availability in so far as making sure data is available for a fragment to operate on. Do you only offer up a fragment when its data dependencies have all been satisfied (e.g. guided by a controller such as the PPU), or in the cases where data might not be guaranteed to be available, just explicitly put this synchronisation check in your fragment code?
     
  4. Mike Acton

    Mike Acton CellPerformance
    Newcomer

    Joined:
    Jun 6, 2006
    Messages:
    47
    Likes Received:
    2
    Location:
    Burbank, CA
    This isn't about hiding the SPUs...

    Ah that's a major point. I think it's really, really important that we don't do that. This is the path of "solve the 'problem' for the user so they don't have to think about it".

    That's not what we want at all. We want programmers to understand the issues. We want programmers to be able to take full advantage of the SPUs -- which means having the freedom to program for them in a way that's specific to their context (in this case, the shader).

    Trying to "help" the user by managing the data for them, etc. is a slippery slope to a lot of extra complexity (and more bugs). Which is exactly what we're trying to avoid.

    SPU shaders are just about putting the programmer in the right place at the right time -- i.e. Here's where you are in the pipeline, here's the data we have so far, go ahead and do what you need to do.

    We can't hide the fact that this code is on the SPU. We shouldn't. We don't want to. All we're doing here is giving the shader programmer a context for their code (and data) -- so it doesn't have to be about "where do I start" but more about "what unique transformation do I want to do here".

    I know it may seem counter-intuitive, but the idea is to make it easier for programmers to add stuff to the SPUs, not to add stuff to a software layer.

    Just like GPU shaders, you still need to understand the hardware. And just like GPU shaders, not much more is done for you except to hand you the context within which you need to work -- what you do with it is up to you.

    Mike.
     
  5. Mike Acton

    Mike Acton CellPerformance
    Newcomer

    Joined:
    Jun 6, 2006
    Messages:
    47
    Likes Received:
    2
    Location:
    Burbank, CA
    Well proven ground...

    True enough, but the key insight here (if you can call it that) is that the GPU shader concept worked. It allowed a whole lot of people to write code on a streaming processor in an approachable way. It also allowed enough freedom for people to optimize within a shader as they understood the hardware better.

    But first and foremost, it got people writing GPU code. And that was a good thing -- so I can only hope that a like approach will be successful on the SPUs.
     
  6. Mike Acton

    Mike Acton CellPerformance
    Newcomer

    Joined:
    Jun 6, 2006
    Messages:
    47
    Likes Received:
    2
    Location:
    Burbank, CA
    Giving data to the fragment...

    Basically there are two cases:

    1. System-dependent -- whatever the context of the system is at the point the shader is called, it will need to act on some data (multiple inputs, hopefully). The system has already processed this data to this point and while it was processing it, started loading the whatever fragment needs to be called in this context -- so the data is already in local store before the fragment gets called. This data comes in on the argument list. It's akin to interpolated vertex values being passed into a fragment shader.

    2. Shader-specific -- this is the data that only the shader knows about. It's specific to the actual unique transform or whatever this shader wants to do (maybe it needs a table from main ram? Maybe it needs to know the position of some unique character?). The calling system doesn't know anything about this data. So the shader will DMA up (or down) whatever it needs and manage that internally. Given multiple inputs, costs for DMAing data that's global for every instance is amortized depending on the number of inputs being processed. All the system does is pass one user value (big enough for a pointer) to the shader and if or what that value represents is up to the shader.

    Mike.
     
  7. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    Hi Mike, I enjoyed reading the article. I think the premises make (perfect) sense given Cell's design principles.

    I get the idea that this is some sort of Cell-specific design pattern.

    In the mean time, my only question is:

    How can the system be not aware of the changes or dependencies on gameplay data ? Doesn't it need to know at least when the data is safe to read/write for concurrency control purposes ?

    Or are you saying there are other policies in play ? Like...

    By the time the data/memory area is delivered to a SPU Shader, it has already been processed and may not changed during this "logical shading period".

    Otherwise, it seems that the Shader would implicitly know when to exercise some concurrency control mechanics while handling the data, and when not to.

    EDIT: :Oops: I re-read the paragraph I quoted and realized that I missed the last statement: All of that is contained in the shader.
    Silly me, in that case... yes it seems that the final system would be fully distributed (as much as it can anyway).
     
    #7 patsu, Sep 27, 2007
    Last edited by a moderator: Sep 28, 2007
  8. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,743
    Likes Received:
    11,225
    Location:
    Under my bridge
    I really don't have much to contribute, but I would suggest a change in the name. The term Shader isn't really applicable in any sense, as it's referring generic code not dealing with 'shading' on any level. It's more a code fragment, or SPUlet, or somesuch. Overloading the term 'shader' is going to make it hardware to talk about the systems. "I've an AI shader on Cell and a geometry displacement shader on RSX" where the two things are fundamentally different types of code dealing with very different content.
     
  9. LordOfThePing

    Newcomer

    Joined:
    Sep 30, 2007
    Messages:
    4
    Likes Received:
    0
    Location:
    Fair Oaks, CA
    Thank you for the paper Mr. Acton.

    The last project I worked on was for the CBE and we used a method very similar to what you describe. The overhead associated means it is no "magic bullet" but it can be very useful in some cases.

    Admittedly, I'm no expert but having that "bit" built in is good for a few key reasons. Mainly, fragments are independently optimizable. To me that just seems like a smart design choice. A secondary benefit, that I've seen, is that it gets different parts of the development team talking. From my observations getting all groups understanding where the other is at can lead to some very insightful conversations.

    I'd say you're barking up the right tree Mr. Acton.
     
  10. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,614
    Likes Received:
    60
    Welcome to the forum, LordofthePing. Are you at liberty to discuss your previous project ;-) ? ...like what does it do ?
     
  11. LordOfThePing

    Newcomer

    Joined:
    Sep 30, 2007
    Messages:
    4
    Likes Received:
    0
    Location:
    Fair Oaks, CA
    Thank you for the welcome.

    In the broad strokes it was a platform to turn vehicles into mobile sensor platforms.
    The Powerblock200 offers a lot of computational ability in a very small form factor. That opened a lot of doors for us. Being able to process and encode video in real time, send it over the network and still have clock time left over for range finding and/or GPS awareness simply can't be done on anything else, right now.

    Forgive me, I'm still very limited in what I can say. At least I can finally be "non-specific". :???:
     
  12. Mike Acton

    Mike Acton CellPerformance
    Newcomer

    Joined:
    Jun 6, 2006
    Messages:
    47
    Likes Received:
    2
    Location:
    Burbank, CA
    Overloading "Shader"

    Yes, we're overloading the word shader. But as pointed out above, the GPGPU community already has been using shaders for non-shading tasks for quite a while so I don't think we're introducing anything here.

    Also, I'm inclined to disagree that it isn't applicable "in any sense" -- I believe it is, in all the most important ones: Small fragment of code, optimizable framework, decision making and additional memory management is left up to the fragment, part of a larger pipeline, etc.

    And anyway -- The main value is that it evokes a certain idea and expectation of simplicity in the mind of the user and that makes it compelling.

    Mike.
     
  13. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    Ok looks like I'm trying to solve a slightly different problem, i.e. cross platform development... trying to create systems that can run data-parallel on both the 360 (5-6 threads) and SPUs.
    sure I agree hiding too much from the user is a bad thing.. and I still have this problem of naming all the permutations of my helper code. Reduce complexity somewhere and it pops back up elsewhere. It might be what i want to do is impossible to document :(
     
  14. LordOfThePing

    Newcomer

    Joined:
    Sep 30, 2007
    Messages:
    4
    Likes Received:
    0
    Location:
    Fair Oaks, CA
    Ouch, I could only imagine the issues that would arise under those conditions. Regardless, having an intimate understanding of the hardware processes can make the task less of a mountain, maybe a mole-hill. In the games sector it seems multi-platform titles are middleware "heavy". While , I guess, it makes sense from a production stand point one is limited to "how efficient is the middleware".

    Forgive me for going OT. Perhaps that question is worthy of its own topic.

    Mr. Acton in your paper, and in general regarding "dynamic SPU code", you mention reducing synchronization points between the on-chip assets. I was curious, would you be able to expand on that?

    You'll have to forgive me, I am pretty ignorant about developing games on the PS3 platform. From my experience "System-dependent" executions may require a synchronous state to be "cost effective". Any thoughts about the advantages/disadvantages between the two cases you mentioned?
     
  15. Mike Acton

    Mike Acton CellPerformance
    Newcomer

    Joined:
    Jun 6, 2006
    Messages:
    47
    Likes Received:
    2
    Location:
    Burbank, CA
    Cross-platform issues

    Absolutely, I agree - these are different problems - I consider myself very lucky to have the opportunity to focus on just the PS3 as a platform.

    However, I'm not sure why this, in particular, is a cross-platform issue.

    Re: memory management in the shaders

    You can still leave responsibility for managing (local) memory up to the shaders. In a cross-platform environment, certainly the shader writers would want a separate library or set of functions to abstract things where they want to be cross-platform. So apart from the main system, the shaders can still use cross-platform code, so that remains a cost/time issue on the application side. (Just like deciding to use "generic" Cg or HLSL instead card-specific options or microcode.)

    Re: naming fragments

    I'm not clear on why this is an issue. This is the same kind of thing with GPU shaders, is it not? Especially in this case where code fragments are permutations of some settings, why would we care what the name was? Convert the settings into an index and just look up the code location in a table. -- You can do this statically too and just grab the addresses from the linker (assuming this is part of a single monolithic executable), right? Generate arbitrary names in this case, since they won't get used directly.
     
  16. Mike Acton

    Mike Acton CellPerformance
    Newcomer

    Joined:
    Jun 6, 2006
    Messages:
    47
    Likes Received:
    2
    Location:
    Burbank, CA
    The short answer?

    This question deserves a well-thought answer, but you'll have to make due with what I have to say on the subject. ;)

    Probably (in the context of not being familiar with game code) the most significant bit of information you're missing is the concept of the game loop. Depending on the framerate, a typical game will loop roughly every 16ms (for 60fps) or 32ms (for 30fps). The vertical sync (vsync) is a natural synchronization point for the entire game. So, although not really the whole picture, the gist of it is that the a game is really a big digital signal generator (and the clock is the vsync). So in that context, when I talk about synchronization points, usually (but confusingly not always!) I'm referring to the number of sync points within a frame or the number of sync points every so many frames. And SPU systems usually (but not always!) run during a specific section of the frame (scheduled manually preferably) -- e.g. a system might run for 4ms every 16ms across two SPUs.

    Now, there a quite a lot of ways that sync points can be reduced here (which is necessary for good pipelining).

    But specifically a simple example:

    --- CASE 1 ---
    For each instance of data (assuming they each have custom code that must be run), I read the data (let's say on the PPU) determine the type and ready the appropriate code and data for the SPU.

    The SPU recognizes that this data has been added (to some queue? - which would most often be implemented through some atomic synchronization method), processes it and sends the results back to main ram, marking (somewhere) that it is ready.

    At some further point, the PPU checks that the result is ready (probably sync'ing at this point - assuming that it will be ready "most" of the time) and then uses the final result however it will.

    This method is treating the SPU like a co-processor and when scaled to large amounts of data and many systems, introduces a lot of sync points and a lot of stalling and is a pain to schedule.

    --- CASE 2 ---
    The data is already sorted by type, where type corresponds to the fragment of code that will need to be run.

    The SPU system is started once (at some point within the frame), walks through all the data, dynamically loading the appropriate code and the fragments themselves determine where to DMA the results.

    But let's say the results do need some post-processing by some PPU function (as in CASE 1). In this case the SPU shader could DMA the results back into an intermediate buffer (which could be some scratch memory used by multiple systems which do not overlap eachother in time)

    At some further point the PPU would have a single sync point to check when the entire system of data has been processed. And at that point it could process the entire result queue at once in a loop and do whatever post-process was needed.

    In this case the sync points were reduced from many per-frame to one per-frame (which is much easier to schedule).

    ----

    Now granted, these examples aren't ideal either way, but they are relatively typical. Ideally, in this case, the post-processing itself would also be done in the shader (or in a secondary one) and no PPU synchronization would be needed at all - it would typically either just always use the most recent data or double buffer the values (which would then be "flipped" by the SPU system - or even more likely, be globally flipped by the vsync.)

    This is just one example - reducing synchronization points is just like optimizing memory or code or anything like that - every case has it's own issues and the solution has to fit the system and how it's used in the context of the rest of the game.

    I'm specifically arguing against the one-size-fits-all uber-solutions so I'm definitely not going to say that this is the pattern one would use for everything. But it's not unusual.

    I hope that at least answers your question in part.

    Mike.
     
  17. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    Our earlier quick & dirty ports from 360 -> SPU (keeping the same code-path compiling on both) have used smart pointers (not as bad as it sounds, a syntactic salt system mainly, just inefficient from lacking double-buffering).

    By making an 'external iterator' do the transfers, I figure it's easier to write re-usable double-buffering schemes.

    So what i mean by this is naming my external iterators, not the back-end code fragments;

    On the 360, my "Par_ForEach" can take wrappers for collection classes and hence handle vectors, pointer-lists, indexed data.. functions or functors with context data..

    .. wheras in SPU land i'll need to write..

    template<typename SRC, DST> void Transform(SRC* srclist, DST* dstlist, int count, void (*kernel)(SRC*,DST*) {..// applies 'kernel' to fill dstlist with data created by srclist
    then
    ..Transform_SrcIndexed(SRC* list, U16* srcIndices, DST* list, int count..)
    ..Transform_SrcPtrList(SRC** list, DST* list, int count..)
    ..Transform_DstIndexed(SRC* list, U16* dstIndices, DST* list, int count..)
    etc etc..

    I might be able to make some template policies to shove in as arguments... the complexity re-appears in that heavily nested systems of templates can be a nightmare for anyone other than the original author to use.


    hehe in a multiplatform team I dont think you'll find the PS3 programmer saying he's lucky ...
    We all agree given the time we'd like to architect for the SPU's first then work back... giving us cache-friendly algorithms by design :)

    We've just gone from primarily xbox exlusive to multiplatform ( a driving game ) so a lot of sheltered 360 programmers are in for a shock ...
     
    #17 ebola, Oct 3, 2007
    Last edited by a moderator: Oct 3, 2007
  18. LordOfThePing

    Newcomer

    Joined:
    Sep 30, 2007
    Messages:
    4
    Likes Received:
    0
    Location:
    Fair Oaks, CA
    Thank you for the reply Mr. Acton.
    You would be correct. I failed to realize the cyclical nature of a games coding. It may get a laugh but I had a small Eureka moment when you mentioned the display rate. In light of that, I think the method you describe has even more merit. Particularly, in information systems with a regular synchronous state. Such as games.

    I probably should retract my statement about any methods I'm familiar with being similar to the "SPU Shader" method. Analogous now seems like a more fitting term. Almost all of our dynamic code is SPU initiated with minimal PPU involvement. Leaving the PPU open for function calls over the network and "general housekeeping" was a prime need for the system.

    :lol: I'm inclined to agree with you. On all counts...


    Thank you for your time, sir.
     
  19. Mike Acton

    Mike Acton CellPerformance
    Newcomer

    Joined:
    Jun 6, 2006
    Messages:
    47
    Likes Received:
    2
    Location:
    Burbank, CA
    Update

    Hey guys,

    I've posted another talk I gave on the SPU Shader concept a while back (after the feedback on this thread).

    [More on SPU Shaders]

    Mike.
     
  20. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,682
    Likes Received:
    1,200
    Location:
    Maastricht, The Netherlands
    Hi Mike,

    I finally got round to reading this excellent presentation in full. As always, thanks for sharing. I already had the impression that SPURS wasn't going to be all that effective, but boy, you guys really didn't like it. :D

    I'm impressed also how well you've developed the concept. The slideshow gives the genesis and development of the concept very clearly so that you can mentally evolve to the concept, and then gives a very clear and clearly very well tested manual of how to implement, with a nice example too. It's one thing to develop something useful, it's another to effectively share the knowledge!

    Great job, and a must-read for anyone developing on the PS3/Cell. I have a feeling that you can implement this on a great number of levels, and it appears to me to be very useful to multi-core processing in general.

    This is particularly required reading for the participants in this thread (hint: credits ... ;) )
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...