Three new DirectX 11 demos

Andrew Lauritzen

Moderator
Moderator
Veteran
Hey all,

Just wanted to throw out the links for three new DX11 demos from the Advances in Real-Time Rendering in 3D Graphics and Games and Beyond Programmable Shading courses at SIGGRAPH (slides will be posted soon). For now, you can run the demos (if you have a DX11 card) and check out the source at:

http://visual-computing.intel-research.net/art/publications/deferred_rendering/
http://visual-computing.intel-research.net/art/publications/sdsm/
http://visual-computing.intel-research.net/art/publications/avsm/

Enjoy!
 
Last edited by a moderator:
Had to get the latest DX11 redist to get d3dcompiler_43.dll, and had to install the 32-bit version of the VC2010 redist (running Win7 x64).

Those shadows on sdsm are really pretty. I wonder if there's a way to get AA on the non-shadow polygons too (without CP overrides that is).

Also, I noticed weird dark spots over basic textures using Nvidia 3D glasses.

And sadly avsm.exe crashes for me:
Problem signature:
Problem Event Name: APPCRASH
Application Name: AVSM.exe
Application Version: 0.0.0.0
Application Timestamp: 4c4bb4a6
Fault Module Name: nvwgf2um.dll
Fault Module Version: 8.17.12.5721
Fault Module Timestamp: 4c0d6d9b
Exception Code: c0000005
Exception Offset: 0030d49a
OS Version: 6.1.7600.2.0.0.768.3
Locale ID: 1033
Additional Information 1: 0a9e
Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
Additional Information 3: 0a9e
Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

Cheers for posting :)
 
Had to download [..]
Yup it mentions those in the readme :) Glad that you figured it out though and thanks for linking them directly in the thread (note that AVSM may require the VC2008 SP1 runtime files if you don't already have them - many games install them). FWIW all of the project files support native 64-bit builds but we just bundled 32-bit executables for client compatibility.

Those shadows on sdsm are really pretty. I wonder if there's a way to get AA on the non-shadow polygons too tho (without CP overrides that is).
Yeah also mentioned in the readme... it uses deferred shading so MSAA takes a bit of work (assuming you mean just normal polygon MSAA). However it is quite possible as the specific deferred shading demo shows :) I just haven't ported that to the SDSM app yet.

Also, I noticed weird dark spots over basic textures using Nvidia 3D glasses.
Yeah no idea what the 3D stereo stuff is messing with in the app and no way for me to test ;)

And sadly avsm.exe crashes for me:
Also noted in the readme: it crashes the shader compiler on current NVIDIA cards (works on ATI). They know about it though and I imagine will have it fixed soon.
 
Last edited by a moderator:
Cool. Hope the course materials go up soon as well. I was off at the OSL presentation, so I missed your part of the course and only got in around the time to see the tail end of Marco's bit. I was kind of saddened by the Uncharted piece -- not so much because it was a bad talk, but that he painted such a depressing picture of what he had to work with.
 
Very interesting stuff. Haven't had the time to look deeper into it yet, but the demos look nice.
 
Cool. Hope the course materials go up soon as well. I was off at the OSL presentation, so I missed your part of the course and only got in around the time to see the tail end of Marco's bit. I was kind of saddened by the Uncharted piece -- not so much because it was a bad talk, but that he painted such a depressing picture of what he had to work with.

The stuff is already up.
 
AVSM and SDSM presentations from the "Advances in real-time rendering course" will be available at the end of this week.
 
In the SDSM presentation, slide 24, the histogram algorithm performs vastly slower on GTX480 than on HD 5870, 7.2ms versus 1.4ms. What the hell?

This is only a shared memory atomic operation, isn't it? Not a global atomic? (Notes on that slide imply it is shared.)

One of the peculiar things about the atomics in OpenCL (1.0 and 1.1) is that they always return a value, whereas the D3D11 atomics make a return value optional. (I presume the SDSM algorithm ignores the return value.)

http://msdn.microsoft.com/en-us/library/ff471406(v=VS.85).aspx

Why not offer both in OpenCL?

Do NVidia atomics always return a value, in hardware? Is that the motivation for the OpenCL spec making the return not optional.

Is the performance problem really due to atomics on NVidia?
 
In the SDSM presentation, slide 24, the histogram algorithm performs vastly slower on GTX480 than on HD 5870, 7.2ms versus 1.4ms. What the hell?
Yeah that was my reaction as well :) Incidentally this is what particularly motivated writing the fast reduction path.

This is only a shared memory atomic operation, isn't it? Not a global atomic?
Correct. These are very fast on ATI (almost full-speed if they don't conflict!) but apparently less-so on NVIDIA. Indeed they are also the atomics that do not return values so no latency hiding is needed and they can theoretically be fully pipelined.

Why not offer both in OpenCL?
Good question unless they assume that the compilers will detect when the return value is unused.

Is the performance problem really due to atomics on NVidia?
I can only assume so as that shader isn't doing much else. Check out the source code if you want the details but it is literally just a pile of local atomics followed by a few global atomics (these turn out to be faster than writing out all the local histograms and then reducing them). I played with removing the global atomics on NVIDIA but it made very little performance difference... it appears to be the local atomics in particular that are much slower than on ATI.

You can imagine that this and a number of other issues that are now exposed by these fairly low-level compute models complicate writing performance-portable code a lot...
 
Yeah that was my reaction as well :) Incidentally this is what particularly motivated writing the fast reduction path.
Pretty lucky with how it worked out in the end...

Correct. These are very fast on ATI (almost full-speed if they don't conflict!) but apparently less-so on NVIDIA. Indeed they are also the atomics that do not return values so no latency hiding is needed and they can theoretically be fully pipelined.
I guess NVidia is doing local atomics as math in the ALUs, which would have the same speed for return/no-return value.

What I don't get is why there's such a severe slowdown when shared memory banking should mean that non-colliding addresses are often full speed (purely a question of bank conflicts - though Fermi architecture is more susceptible to bank conflicts than GT200).

Good question unless they assume that the compilers will detect when the return value is unused.
Ah yes, that would work.

I can only assume so as that shader isn't doing much else. Check out the source code if you want the details but it is literally just a pile of local atomics followed by a few global atomics (these turn out to be faster than writing out all the local histograms and then reducing them). I played with removing the global atomics on NVIDIA but it made very little performance difference... it appears to be the local atomics in particular that are much slower than on ATI.
Back here:

http://forum.beyond3d.com/showthread.php?p=1305337#post1305337

when we had quite a bit of fun with the treacherous "Dickens word count" histogram, you were alluding to non-random distributions of data causing severe problems for GPU algorithms that depend on some form of software-managed cache.

I suppose this is that application: depth is generally not random at all. I suppose your algorithm lops bits of precision off Z so that there aren't too many bins, resulting in a huge collision rate, and disastrous slow down on NVidia.

So this poor atomic performance might be something of a corner case in that respect. Overall, atomics are still a little too new on GPUs to really know...

You can imagine that this and a number of other issues that are now exposed by these fairly low-level compute models complicate writing performance-portable code a lot...
Yeah it's a great example. Actually, it's the kind of thing that could persuade developers to ignore atomics entirely, which seems like quite a shame.

Jawed
 
Pretty lucky with how it worked out in the end...
Yup although it was really the combination of that and finding the robustness issues with the more complex schemes (that require the full histogram) that made the reduce path the clear winner.

I guess NVidia is doing local atomics as math in the ALUs, which would have the same speed for return/no-return value.
They can't be doing lane collision analysis in the ALUs, can they? That would be crazy-expensive! Alternatively are you implying that they can somehow "lock" memory banks over multi-instruction sequences and then loop in the core to serialize conflicts?

I suppose this is that application: depth is generally not random at all. I suppose your algorithm lops bits of precision off Z so that there aren't too many bins, resulting in a huge collision rate, and disastrous slow down on NVidia.
Yes, precisely. In this case there's two problems:
1) Collisions. They are super-common with this data set. It's a big win if you can lane-swizzle in your SIMD and do some cheap address comparisons to handle the majority of these cases and serialize the rest as appropriate. Unfortunately none of the current programming models have a lane swizzling mechanism because they all try to pretend it's not SIMD ;)
2) Bin distribution for a spatial region. Each core is handling a tile of the screen and in that tile (or even the whole distribution sometimes) only a small number of bins are touched. Statically allocating local memory for the whole local histogram here and incidentally reducing the number of wavefronts in flight due to this is just a bad model... you want a proper cache. I would implement this by relying on Fermi's L1$ but I don't believe their global atomics are fully cached... i.e. just the switch from local to global atomics would probably slow it down a ton, even if there's no data-sharing between cores at all.

Note that this was disastrously worse when we had to use vertex scatter + ROPs back in the pre-DX11 days... collisions in that pipeline were ridiculously expensive since they amounted to global atomics even with some clever use of multiple accumulation bins + reductions. Local atomics are the only thing that make this sort of algorithm feasible at all on GPUs but we still have a ways to go in the hardware.

So this poor atomic performance might be something of a corner case in that respect. Overall, atomics are still a little too new on GPUs to really know...
Yup definitely possible as it is a fairly special data set, but not an abnormal one for graphics ;)

Yeah it's a great example. Actually, it's the kind of thing that could persuade developers to ignore atomics entirely, which seems like quite a shame.
Yup we'll see. I would obviously prefer that people rise to the bar that ATI has set though :)
 
Last edited by a moderator:
If your programming model doesn't explicitely expose horizontal ops and you have to do interlane/interthread communication vertically your hardware and/or compiler better be very good at local atomics.
 
They can't be doing lane collision analysis in the ALUs, can they? That would be crazy-expensive! Alternatively are you implying that they can somehow "lock" memory banks over multi-instruction sequences and then loop in the core to serialize conflicts?
Well, it is slow...

I guess someone could analyse this stuff in detail. Different patterns exploiting either address conflicts or bank conflicts, in varying counts per hardware thread, would enable a decent characterisation.

Yes, precisely. In this case there's two problems:
1) Collisions. They are super-common with this data set. It's a big win if you can lane-swizzle in your SIMD and do some cheap address comparisons to handle the majority of these cases and serialize the rest as appropriate. Unfortunately none of the current programming models have a lane swizzling mechanism because they all try to pretend it's not SIMD ;)
I don't think that's strictly true. Shared memory enables lane-swizzling, though obviously with latency.

2) Bin distribution for a spatial region. Each core is handling a tile of the screen and in that tile (or even the whole distribution sometimes) only a small number of bins are touched. Statically allocating local memory for the whole local histogram here and incidentally reducing the number of wavefronts in flight due to this is just a bad model... you want a proper cache. I would implement this by relying on Fermi's L1$ but I don't believe their global atomics are fully cached... i.e. just the switch from local to global atomics would probably slow it down a ton, even if there's no data-sharing between cores at all.
GPU matrix multiplication is a brilliant example of a very harsh lesson: it looks insanely, almost-trivially, parallelisable but it's taken years for optimal algorithms to appear on NVidia and ATI - and the algorithms are quite different, too.

I'm not saying there's a way to make atomics vastly faster on GF100 for this kind of data/bin configuration - merely that it's early days.

Before, we saw that there was a huge speed-up on x86 with the Dickens histogram by carefully fine-tuning for the cache architecture. Having a "proper cache" didn't obviate this optimisation at all...

Yup definitely possible as it is a fairly special data set, but not an abnormal one for graphics ;)
Yeah. It does seem to me that Microsoft (or whoever) had its head screwed on with the optional return value thing.
 
I don't think that's strictly true. Shared memory enables lane-swizzling, though obviously with latency.
Not only latency but it requires a barrier which can't be used inside "potentially divergent" control flow. It's not the same.

GPU matrix multiplication is a brilliant example of a very harsh lesson: it looks insanely, almost-trivially, parallelisable but it's taken years for optimal algorithms to appear on NVidia and ATI - and the algorithms are quite different, too.
Sure, but this points to the fact that the value of a portable abstraction layer (which carries penalties) is diminishing rapidly if you need to use different algorithms on different hardware. This makes it all-the-more stupid that our languages are handcuffing our ability to write optimized code based on so-called "portability".

On a separate note, I'd be really happy if someone could come up with a better algorithms to solve the SDSM histogram problem. It's not like I just tried the one thing... I tried the full gamut of different ways to do it and all were as slow or slower. Hopefully a faster algorithm will "emerge", but that's not doing me a lot of good today :)

I'm not saying there's a way to make atomics vastly faster on GF100 for this kind of data/bin configuration - merely that it's early days.
Sure, and it wouldn't be a problem if ATI hadn't already demonstrated that it can be done very well :) It's all relative.

Before, we saw that there was a huge speed-up on x86 with the Dickens histogram by carefully fine-tuning for the cache architecture. Having a "proper cache" didn't obviate this optimisation at all...
Not sure what you're trying to say exactly... yes obviously caches need tuning just like everything else but the static allocation/partitioning of resources model that GPUs use right now is simply not flexible enough for cases like this that have a small but *data-dependent* working set.
 
Not only latency but it requires a barrier which can't be used inside "potentially divergent" control flow. It's not the same.
Not all hardware has that programming restriction. Yet another dislocation between the reality of hardware and the language. (And if you get stuck in an infinite loop it's your own fault!)

Sure, but this points to the fact that the value of a portable abstraction layer (which carries penalties) is diminishing rapidly if you need to use different algorithms on different hardware. This makes it all-the-more stupid that our languages are handcuffing our ability to write optimized code based on so-called "portability".


I think there are two conflicting things going on here:
  1. Learning by IHVs and API bodies - what works, what do developers want, and when does the transistor budget allow it at a useful performance level
  2. Abstraction - to stand a chance of using it
One of the advantages of the current abstractions is that they aren't imposing a low-level schema that we'll be stuck with for ever after. Currently we're not witnessing a build up akin to x86 cruft in the hardware.

The price for these exciting times (accompanied by fame for originating the cool stuff) is a hell of a lot of bootstrapping. SDSM looks fuck-off cool, by the way.

On a separate note, I'd be really happy if someone could come up with a better algorithms to solve the SDSM histogram problem. It's not like I just tried the one thing... I tried the full gamut of different ways to do it and all were as slow or slower. Hopefully a faster algorithm will "emerge", but that's not doing me a lot of good today :)
If one does turn up it'll be annoyingly obvious looking :p

Sure, and it wouldn't be a problem if ATI hadn't already demonstrated that it can be done very well :) It's all relative.
If it turns out NVidia is not using the ALUs to do this, then I have to say I've not got the foggiest why it's so slow. Shared memory has the bank count and bandwidth in raw terms to be in the same ball park as ATI. That's what's so disturbing.

Not sure what you're trying to say exactly... yes obviously caches need tuning just like everything else but the static allocation/partitioning of resources model that GPUs use right now is simply not flexible enough for cases like this that have a small but *data-dependent* working set.
Well at least NVidia has taken a step in this direction with L1$. Global atomics are nearly useless in GT200, but the ball is rolling. etc.

Do you have some other histogram-based algorithm that's really slow on ATI because it's a poor fit for the architecture? You must be working on something else... Guess you can't say.
 
Not all hardware has that programming restriction. Yet another dislocation between the reality of hardware and the language. (And if you get stuck in an infinite loop it's your own fault!)
Sure, but that's my point: it's precisely due to the languages trying to "abstract" over SIMD widths. Obviously if you had a purely MIMD machine then lane swizzling *wouldn't* be free or even make much sense. However the reality is we can probably put a reasonably minimal bound on the minimum SIMD widths of these GPUs... 4 is practically guaranteed and even 16 is probably reasonable.

Similar programming model problems arise when trying to implement persistent threads in current APIs.

One of the advantages of the current abstractions is that they aren't imposing a low-level schema that we'll be stuck with for ever after. Currently we're not witnessing a build up akin to x86 cruft in the hardware.
Debatable... we're already seeing legacy problems with code compiled to static shared memory sizes. Equivalently the most efficient code chooses block sizes related to the number of cores on the GPU. These are significant problems going forward.

The price for these exciting times (accompanied by fame for originating the cool stuff) is a hell of a lot of bootstrapping. SDSM looks fuck-off cool, by the way.
Glad you like it :) To be clear, I'm not saying that the GPU models are a failure or anything - obviously this stuff would only be possible with them! I'm more commenting on the things that keep coming up and appear to be problematic moving forward... just still that I think we should be focusing our innovation and research efforts on improving. I definitely do not think the current GPU computing models are near an end state and I think most people would agree with that.

If one does turn up it'll be annoyingly obvious looking :p
I sure hope so! In my presentation I unsubtly hinted to the audience to find me a faster implementation so maybe with all those smart heads there and a touch of motivation/competition we'll get something :)

Well at least NVidia has taken a step in this direction with L1$. Global atomics are nearly useless in GT200, but the ball is rolling. etc.
Agreed. I think caches are going to be absolutely necessary going forward and I imagine they will be better-integrated with atomics in the future as well. It's certainly quite possible to do well in hardware.

Do you have some other histogram-based algorithm that's really slow on ATI because it's a poor fit for the architecture?
Not really doing any other histogram-related stuff right now... I didn't mean to imply that I was. I've been doing mostly deferred shading stuff lately (as per the other demo) which brings out its own interesting hardware and software puzzles :) Check my presentation and demo above for the initial batch, although there's nothing quite as drastic as the 4x perf difference in the SDSM histogram path (although frankly I do find the MSAA scheduling results to be pretty interesting in terms of programming models and hardware scheduling moving forward).
 
Sure, but that's my point: it's precisely due to the languages trying to "abstract" over SIMD widths. Obviously if you had a purely MIMD machine then lane swizzling *wouldn't* be free or even make much sense. However the reality is we can probably put a reasonably minimal bound on the minimum SIMD widths of these GPUs... 4 is practically guaranteed and even 16 is probably reasonable.
LRBni doesn't provide arbitrary lane swizzle, does it? Swizzles are restricted to quad-lane neighbourhoods. Scatter/gather via a cache line is required.

As for SIMD width, 16 looks like a safe minimum for a few years yet.

Similar programming model problems arise when trying to implement persistent threads in current APIs.
I dare say that seems likely to be an explicit feature of D3D12. Fingers-crossed.

Debatable... we're already seeing legacy problems with code compiled to static shared memory sizes. Equivalently the most efficient code chooses block sizes related to the number of cores on the GPU. These are significant problems going forward.
I was referring to cruft in the hardware implementation.

Cruft in the software is an on-going problem. The size of shared memory is just one variable out of many that results in "over-optimisation" for today's hardware. Cache-line size, L1$ size, SIMD width, register file size, DDR burst length and DDR banking are some others.

NVidia, with CUDA, has attempted to obfuscate some parameters to prevent "over-optimisation". Guaranteeing warp size of 32 for the foreseeable future is good, though shared memory in Fermi (or at least GF100) has new bank conflict issues.

Even when hardware parameters are obfuscated developers are liable to dig, resulting in potential "over-optimisation".

Glad you like it :) To be clear, I'm not saying that the GPU models are a failure or anything - obviously this stuff would only be possible with them! I'm more commenting on the things that keep coming up and appear to be problematic moving forward... just still that I think we should be focusing our innovation and research efforts on improving. I definitely do not think the current GPU computing models are near an end state and I think most people would agree with that.
I dare say game developers are mostly still trying to catch up with D3D10(.1) and so their input on what's needed for D3D12 is limited. Obviously there are people at the cutting edge and maybe it's best that there's only a few of you stirring the pot.

I sure hope so! In my presentation I unsubtly hinted to the audience to find me a faster implementation so maybe with all those smart heads there and a touch of motivation/competition we'll get something :)


I still haven't looked at your code, but these are three generic ideas:
  1. hash by SIMD lane - this is similar to the SR (globally shared register) technique that I referred to before on ATI.

    e.g. with a maximum of 2048 bins and shared memory capacity of 32KB, you can hash by a factor of 16. The obvious key is (absolute work item ID & 15). So you can do (ZCoarse << 4) + (WorkItemID & 15) to generate a collision-free atomic.

    Obviously with more bins you'd hash by a smaller factor and so would suffer some collisions.

  2. tile by work item - generally you should be fetching multiple samples per work item in a coherent tile, between atomics.

    There are two benefits here. First the hardware prefers to fetch coherently (e.g. gather4) and ATI likes to fetch wodges (i.e. 16x gather4) and has the register file capacity to do so. Second by tiling like this you automatically serialise ZBuckets that, being neighbours in 2D, are likely to collide.

    e.g. each work item fetches an 8x8 tile.

  3. scatter work items - to improve the serialisation of ZCoarse (if 1 and 2 don't eliminate collisions), you can make each work item discontiguously sample, reducing the chances of collisions amongst neighbouring work items.

    e.g. if you are doing 8x8 tiles per work item from technique 2. then work item 1 is offset from work item 0 along the diagonal, starting at 8,8, work item 2 starting at 16,16 etc.
Undoubtedly the different cards will want different tunings for techniques 2 and 3. 2 and 3 need to be balanced against each other, with 2 improving cache coherency while 3 spoils it.

Not really doing any other histogram-related stuff right now... I didn't mean to imply that I was. I've been doing mostly deferred shading stuff lately (as per the other demo) which brings out its own interesting hardware and software puzzles :) Check my presentation and demo above for the initial batch, although there's nothing quite as drastic as the 4x perf difference in the SDSM histogram path (although frankly I do find the MSAA scheduling results to be pretty interesting in terms of programming models and hardware scheduling moving forward).
I've only had a quick look at that...
 
LRBni doesn't provide arbitrary lane swizzle, does it? Swizzles are restricted to quad-lane neighbourhoods.
Swizzles to the quads are free, but you can shuffle the 128-bit blocks using an instruction. That's all you need for horizontal reductions/address comparisons. Furthermore even if it did not, any swizzle neighbourhood (even 4) is better than none.

I dare say that seems likely to be an explicit feature of D3D12. Fingers-crossed.
I'm not as convinced that there's quite enough buy-in on this yet, but maybe OpenCL's first pass at a task system will motivate some innovation in that space.

I was referring to cruft in the hardware implementation.
Sure, although it's worth noting that increasing cache sizes behaves better with legacy code than scratch pad memory (which just goes unused).

Even when hardware parameters are obfuscated developers are liable to dig, resulting in potential "over-optimisation".
I can buy the "over-optimization" argument on CPUs where you're talking about single-digit % increases in a lot of cases (sometimes more, but on balance) but on GPUs you're often talking about an order of magnitude... that's too much performance to leave on the floor for "portability".

e.g. with a maximum of 2048 bins and shared memory capacity of 32KB, you can hash by a factor of 16. The obvious key is (absolute work item ID & 15). So you can do (ZCoarse << 4) + (WorkItemID & 15) to generate a collision-free atomic.
Right but the bins aren't just sums - they are 7 32-bit values each! There's not enough shared memory to amplify them even 2x at the moment. You might be able to pack a few things into half-floats and get a 2x spread but you're definitely not going to get to one per SIMD lane!

Of course the large majority of these bins don't get touched for a given tile so if you only had a cache... ;)

[*]tile by work item - generally you should be fetching multiple samples per work item in a coherent tile, between atomics.
Yes mostly already done. Tile sizes are decoupled from the compute domain and chosen so that there are ~ as many tiles as required to just fill the GPU. This is important to minimize global writes/atomics traffic at the end of each work group.

I played with strided vs "linear" lookups across the thread group but the latter were generally faster. If NVIDIA's coalescing logic remains the same then the latter will definitely be faster. I haven't played with using gather4 explicitly though... it's quite an annoying programming model - if they want the access like that they have free reign to reorganize the work items in the group.

The lookup from the Z-buffer is *definitely* not the bottleneck though so I'm hesitate to try and optimize this much more.

[*]scatter work items - to improve the serialisation of ZCoarse (if 1 and 2 don't eliminate collisions), you can make each work item discontiguously sample, reducing the chances of collisions amongst neighbouring work items.
Yup I played with more random sampling to reduce collisions and it does work but with one huge problem: you can't have a massive performance falloff in the case where EVERY pixel on the screen collides. Put another way, worst case performance is what matters, not making the easier cases faster. This is a key point for game developers and one that I've heard often. Thus I haven't put a lot of effort into making the fast cases faster :)
 
Back
Top