Three new DirectX 11 demos

Discussion in 'Rendering Technology and APIs' started by Andrew Lauritzen, Jul 31, 2010.

  1. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Hey all,

    Just wanted to throw out the links for three new DX11 demos from the Advances in Real-Time Rendering in 3D Graphics and Games and Beyond Programmable Shading courses at SIGGRAPH (slides will be posted soon). For now, you can run the demos (if you have a DX11 card) and check out the source at:

    http://visual-computing.intel-research.net/art/publications/deferred_rendering/
    http://visual-computing.intel-research.net/art/publications/sdsm/
    http://visual-computing.intel-research.net/art/publications/avsm/

    Enjoy!
     
    #1 Andrew Lauritzen, Jul 31, 2010
    Last edited by a moderator: Jul 31, 2010
  2. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,644
    Likes Received:
    214
    Location:
    The colonies
    Had to get the latest DX11 redist to get d3dcompiler_43.dll, and had to install the 32-bit version of the VC2010 redist (running Win7 x64).

    Those shadows on sdsm are really pretty. I wonder if there's a way to get AA on the non-shadow polygons too (without CP overrides that is).

    Also, I noticed weird dark spots over basic textures using Nvidia 3D glasses.

    And sadly avsm.exe crashes for me:
    Problem signature:
    Problem Event Name: APPCRASH
    Application Name: AVSM.exe
    Application Version: 0.0.0.0
    Application Timestamp: 4c4bb4a6
    Fault Module Name: nvwgf2um.dll
    Fault Module Version: 8.17.12.5721
    Fault Module Timestamp: 4c0d6d9b
    Exception Code: c0000005
    Exception Offset: 0030d49a
    OS Version: 6.1.7600.2.0.0.768.3
    Locale ID: 1033
    Additional Information 1: 0a9e
    Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
    Additional Information 3: 0a9e
    Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

    Cheers for posting :)
     
  3. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yup it mentions those in the readme :) Glad that you figured it out though and thanks for linking them directly in the thread (note that AVSM may require the VC2008 SP1 runtime files if you don't already have them - many games install them). FWIW all of the project files support native 64-bit builds but we just bundled 32-bit executables for client compatibility.

    Yeah also mentioned in the readme... it uses deferred shading so MSAA takes a bit of work (assuming you mean just normal polygon MSAA). However it is quite possible as the specific deferred shading demo shows :) I just haven't ported that to the SDSM app yet.

    Yeah no idea what the 3D stereo stuff is messing with in the app and no way for me to test ;)

    Also noted in the readme: it crashes the shader compiler on current NVIDIA cards (works on ATI). They know about it though and I imagine will have it fixed soon.
     
    #3 Andrew Lauritzen, Jul 31, 2010
    Last edited by a moderator: Jul 31, 2010
  4. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,644
    Likes Received:
    214
    Location:
    The colonies
    Ooh readme.txt, right...radical concept :eek:
     
  5. ShootMyMonkey

    Veteran

    Joined:
    Mar 21, 2005
    Messages:
    1,177
    Likes Received:
    71
    Cool. Hope the course materials go up soon as well. I was off at the OSL presentation, so I missed your part of the course and only got in around the time to see the tail end of Marco's bit. I was kind of saddened by the Uncharted piece -- not so much because it was a bad talk, but that he painted such a depressing picture of what he had to work with.
     
  6. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    Very interesting stuff. Haven't had the time to look deeper into it yet, but the demos look nice.
     
  7. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    The stuff is already up.
     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    AVSM and SDSM presentations from the "Advances in real-time rendering course" will be available at the end of this week.
     
  9. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    The remaining presentations (SDSM and AVSM) are now available at the above links as well as a video for SDSM. Enjoy!
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    In the SDSM presentation, slide 24, the histogram algorithm performs vastly slower on GTX480 than on HD 5870, 7.2ms versus 1.4ms. What the hell?

    This is only a shared memory atomic operation, isn't it? Not a global atomic? (Notes on that slide imply it is shared.)

    One of the peculiar things about the atomics in OpenCL (1.0 and 1.1) is that they always return a value, whereas the D3D11 atomics make a return value optional. (I presume the SDSM algorithm ignores the return value.)

    http://msdn.microsoft.com/en-us/library/ff471406(v=VS.85).aspx

    Why not offer both in OpenCL?

    Do NVidia atomics always return a value, in hardware? Is that the motivation for the OpenCL spec making the return not optional.

    Is the performance problem really due to atomics on NVidia?
     
  11. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yeah that was my reaction as well :) Incidentally this is what particularly motivated writing the fast reduction path.

    Correct. These are very fast on ATI (almost full-speed if they don't conflict!) but apparently less-so on NVIDIA. Indeed they are also the atomics that do not return values so no latency hiding is needed and they can theoretically be fully pipelined.

    Good question unless they assume that the compilers will detect when the return value is unused.

    I can only assume so as that shader isn't doing much else. Check out the source code if you want the details but it is literally just a pile of local atomics followed by a few global atomics (these turn out to be faster than writing out all the local histograms and then reducing them). I played with removing the global atomics on NVIDIA but it made very little performance difference... it appears to be the local atomics in particular that are much slower than on ATI.

    You can imagine that this and a number of other issues that are now exposed by these fairly low-level compute models complicate writing performance-portable code a lot...
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Pretty lucky with how it worked out in the end...

    I guess NVidia is doing local atomics as math in the ALUs, which would have the same speed for return/no-return value.

    What I don't get is why there's such a severe slowdown when shared memory banking should mean that non-colliding addresses are often full speed (purely a question of bank conflicts - though Fermi architecture is more susceptible to bank conflicts than GT200).

    Ah yes, that would work.

    Back here:

    http://forum.beyond3d.com/showthread.php?p=1305337#post1305337

    when we had quite a bit of fun with the treacherous "Dickens word count" histogram, you were alluding to non-random distributions of data causing severe problems for GPU algorithms that depend on some form of software-managed cache.

    I suppose this is that application: depth is generally not random at all. I suppose your algorithm lops bits of precision off Z so that there aren't too many bins, resulting in a huge collision rate, and disastrous slow down on NVidia.

    So this poor atomic performance might be something of a corner case in that respect. Overall, atomics are still a little too new on GPUs to really know...

    Yeah it's a great example. Actually, it's the kind of thing that could persuade developers to ignore atomics entirely, which seems like quite a shame.

    Jawed
     
  13. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yup although it was really the combination of that and finding the robustness issues with the more complex schemes (that require the full histogram) that made the reduce path the clear winner.

    They can't be doing lane collision analysis in the ALUs, can they? That would be crazy-expensive! Alternatively are you implying that they can somehow "lock" memory banks over multi-instruction sequences and then loop in the core to serialize conflicts?

    Yes, precisely. In this case there's two problems:
    1) Collisions. They are super-common with this data set. It's a big win if you can lane-swizzle in your SIMD and do some cheap address comparisons to handle the majority of these cases and serialize the rest as appropriate. Unfortunately none of the current programming models have a lane swizzling mechanism because they all try to pretend it's not SIMD ;)
    2) Bin distribution for a spatial region. Each core is handling a tile of the screen and in that tile (or even the whole distribution sometimes) only a small number of bins are touched. Statically allocating local memory for the whole local histogram here and incidentally reducing the number of wavefronts in flight due to this is just a bad model... you want a proper cache. I would implement this by relying on Fermi's L1$ but I don't believe their global atomics are fully cached... i.e. just the switch from local to global atomics would probably slow it down a ton, even if there's no data-sharing between cores at all.

    Note that this was disastrously worse when we had to use vertex scatter + ROPs back in the pre-DX11 days... collisions in that pipeline were ridiculously expensive since they amounted to global atomics even with some clever use of multiple accumulation bins + reductions. Local atomics are the only thing that make this sort of algorithm feasible at all on GPUs but we still have a ways to go in the hardware.

    Yup definitely possible as it is a fairly special data set, but not an abnormal one for graphics ;)

    Yup we'll see. I would obviously prefer that people rise to the bar that ATI has set though :)
     
    #13 Andrew Lauritzen, Aug 11, 2010
    Last edited by a moderator: Aug 11, 2010
  14. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    If your programming model doesn't explicitely expose horizontal ops and you have to do interlane/interthread communication vertically your hardware and/or compiler better be very good at local atomics.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Well, it is slow...

    I guess someone could analyse this stuff in detail. Different patterns exploiting either address conflicts or bank conflicts, in varying counts per hardware thread, would enable a decent characterisation.

    I don't think that's strictly true. Shared memory enables lane-swizzling, though obviously with latency.

    GPU matrix multiplication is a brilliant example of a very harsh lesson: it looks insanely, almost-trivially, parallelisable but it's taken years for optimal algorithms to appear on NVidia and ATI - and the algorithms are quite different, too.

    I'm not saying there's a way to make atomics vastly faster on GF100 for this kind of data/bin configuration - merely that it's early days.

    Before, we saw that there was a huge speed-up on x86 with the Dickens histogram by carefully fine-tuning for the cache architecture. Having a "proper cache" didn't obviate this optimisation at all...

    Yeah. It does seem to me that Microsoft (or whoever) had its head screwed on with the optional return value thing.
     
  16. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Not only latency but it requires a barrier which can't be used inside "potentially divergent" control flow. It's not the same.

    Sure, but this points to the fact that the value of a portable abstraction layer (which carries penalties) is diminishing rapidly if you need to use different algorithms on different hardware. This makes it all-the-more stupid that our languages are handcuffing our ability to write optimized code based on so-called "portability".

    On a separate note, I'd be really happy if someone could come up with a better algorithms to solve the SDSM histogram problem. It's not like I just tried the one thing... I tried the full gamut of different ways to do it and all were as slow or slower. Hopefully a faster algorithm will "emerge", but that's not doing me a lot of good today :)

    Sure, and it wouldn't be a problem if ATI hadn't already demonstrated that it can be done very well :) It's all relative.

    Not sure what you're trying to say exactly... yes obviously caches need tuning just like everything else but the static allocation/partitioning of resources model that GPUs use right now is simply not flexible enough for cases like this that have a small but *data-dependent* working set.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Not all hardware has that programming restriction. Yet another dislocation between the reality of hardware and the language. (And if you get stuck in an infinite loop it's your own fault!)


    I think there are two conflicting things going on here:
    1. Learning by IHVs and API bodies - what works, what do developers want, and when does the transistor budget allow it at a useful performance level
    2. Abstraction - to stand a chance of using it
    One of the advantages of the current abstractions is that they aren't imposing a low-level schema that we'll be stuck with for ever after. Currently we're not witnessing a build up akin to x86 cruft in the hardware.

    The price for these exciting times (accompanied by fame for originating the cool stuff) is a hell of a lot of bootstrapping. SDSM looks fuck-off cool, by the way.

    If one does turn up it'll be annoyingly obvious looking :razz:

    If it turns out NVidia is not using the ALUs to do this, then I have to say I've not got the foggiest why it's so slow. Shared memory has the bank count and bandwidth in raw terms to be in the same ball park as ATI. That's what's so disturbing.

    Well at least NVidia has taken a step in this direction with L1$. Global atomics are nearly useless in GT200, but the ball is rolling. etc.

    Do you have some other histogram-based algorithm that's really slow on ATI because it's a poor fit for the architecture? You must be working on something else... Guess you can't say.
     
  18. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Sure, but that's my point: it's precisely due to the languages trying to "abstract" over SIMD widths. Obviously if you had a purely MIMD machine then lane swizzling *wouldn't* be free or even make much sense. However the reality is we can probably put a reasonably minimal bound on the minimum SIMD widths of these GPUs... 4 is practically guaranteed and even 16 is probably reasonable.

    Similar programming model problems arise when trying to implement persistent threads in current APIs.

    Debatable... we're already seeing legacy problems with code compiled to static shared memory sizes. Equivalently the most efficient code chooses block sizes related to the number of cores on the GPU. These are significant problems going forward.

    Glad you like it :) To be clear, I'm not saying that the GPU models are a failure or anything - obviously this stuff would only be possible with them! I'm more commenting on the things that keep coming up and appear to be problematic moving forward... just still that I think we should be focusing our innovation and research efforts on improving. I definitely do not think the current GPU computing models are near an end state and I think most people would agree with that.

    I sure hope so! In my presentation I unsubtly hinted to the audience to find me a faster implementation so maybe with all those smart heads there and a touch of motivation/competition we'll get something :)

    Agreed. I think caches are going to be absolutely necessary going forward and I imagine they will be better-integrated with atomics in the future as well. It's certainly quite possible to do well in hardware.

    Not really doing any other histogram-related stuff right now... I didn't mean to imply that I was. I've been doing mostly deferred shading stuff lately (as per the other demo) which brings out its own interesting hardware and software puzzles :) Check my presentation and demo above for the initial batch, although there's nothing quite as drastic as the 4x perf difference in the SDSM histogram path (although frankly I do find the MSAA scheduling results to be pretty interesting in terms of programming models and hardware scheduling moving forward).
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    LRBni doesn't provide arbitrary lane swizzle, does it? Swizzles are restricted to quad-lane neighbourhoods. Scatter/gather via a cache line is required.

    As for SIMD width, 16 looks like a safe minimum for a few years yet.

    I dare say that seems likely to be an explicit feature of D3D12. Fingers-crossed.

    I was referring to cruft in the hardware implementation.

    Cruft in the software is an on-going problem. The size of shared memory is just one variable out of many that results in "over-optimisation" for today's hardware. Cache-line size, L1$ size, SIMD width, register file size, DDR burst length and DDR banking are some others.

    NVidia, with CUDA, has attempted to obfuscate some parameters to prevent "over-optimisation". Guaranteeing warp size of 32 for the foreseeable future is good, though shared memory in Fermi (or at least GF100) has new bank conflict issues.

    Even when hardware parameters are obfuscated developers are liable to dig, resulting in potential "over-optimisation".

    I dare say game developers are mostly still trying to catch up with D3D10(.1) and so their input on what's needed for D3D12 is limited. Obviously there are people at the cutting edge and maybe it's best that there's only a few of you stirring the pot.


    I still haven't looked at your code, but these are three generic ideas:
    1. hash by SIMD lane - this is similar to the SR (globally shared register) technique that I referred to before on ATI.

      e.g. with a maximum of 2048 bins and shared memory capacity of 32KB, you can hash by a factor of 16. The obvious key is (absolute work item ID & 15). So you can do (ZCoarse << 4) + (WorkItemID & 15) to generate a collision-free atomic.

      Obviously with more bins you'd hash by a smaller factor and so would suffer some collisions.

    2. tile by work item - generally you should be fetching multiple samples per work item in a coherent tile, between atomics.

      There are two benefits here. First the hardware prefers to fetch coherently (e.g. gather4) and ATI likes to fetch wodges (i.e. 16x gather4) and has the register file capacity to do so. Second by tiling like this you automatically serialise ZBuckets that, being neighbours in 2D, are likely to collide.

      e.g. each work item fetches an 8x8 tile.

    3. scatter work items - to improve the serialisation of ZCoarse (if 1 and 2 don't eliminate collisions), you can make each work item discontiguously sample, reducing the chances of collisions amongst neighbouring work items.

      e.g. if you are doing 8x8 tiles per work item from technique 2. then work item 1 is offset from work item 0 along the diagonal, starting at 8,8, work item 2 starting at 16,16 etc.
    Undoubtedly the different cards will want different tunings for techniques 2 and 3. 2 and 3 need to be balanced against each other, with 2 improving cache coherency while 3 spoils it.

    I've only had a quick look at that...
     
  20. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Swizzles to the quads are free, but you can shuffle the 128-bit blocks using an instruction. That's all you need for horizontal reductions/address comparisons. Furthermore even if it did not, any swizzle neighbourhood (even 4) is better than none.

    I'm not as convinced that there's quite enough buy-in on this yet, but maybe OpenCL's first pass at a task system will motivate some innovation in that space.

    Sure, although it's worth noting that increasing cache sizes behaves better with legacy code than scratch pad memory (which just goes unused).

    I can buy the "over-optimization" argument on CPUs where you're talking about single-digit % increases in a lot of cases (sometimes more, but on balance) but on GPUs you're often talking about an order of magnitude... that's too much performance to leave on the floor for "portability".

    Right but the bins aren't just sums - they are 7 32-bit values each! There's not enough shared memory to amplify them even 2x at the moment. You might be able to pack a few things into half-floats and get a 2x spread but you're definitely not going to get to one per SIMD lane!

    Of course the large majority of these bins don't get touched for a given tile so if you only had a cache... ;)

    Yes mostly already done. Tile sizes are decoupled from the compute domain and chosen so that there are ~ as many tiles as required to just fill the GPU. This is important to minimize global writes/atomics traffic at the end of each work group.

    I played with strided vs "linear" lookups across the thread group but the latter were generally faster. If NVIDIA's coalescing logic remains the same then the latter will definitely be faster. I haven't played with using gather4 explicitly though... it's quite an annoying programming model - if they want the access like that they have free reign to reorganize the work items in the group.

    The lookup from the Z-buffer is *definitely* not the bottleneck though so I'm hesitate to try and optimize this much more.

    Yup I played with more random sampling to reduce collisions and it does work but with one huge problem: you can't have a massive performance falloff in the case where EVERY pixel on the screen collides. Put another way, worst case performance is what matters, not making the easier cases faster. This is a key point for game developers and one that I've heard often. Thus I haven't put a lot of effort into making the fast cases faster :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...