Three new DirectX 11 demos

Swizzles to the quads are free, but you can shuffle the 128-bit blocks using an instruction. That's all you need for horizontal reductions/address comparisons. Furthermore even if it did not, any swizzle neighbourhood (even 4) is better than none.
Sigh I totally forgot about shuffle - I remember being excited about that, once :LOL:

I'm not as convinced that there's quite enough buy-in on this yet, but maybe OpenCL's first pass at a task system will motivate some innovation in that space.
Well both GPUs support concurrent compute kernels now (still pretty fuzzy on what that really means on ATI, though) so if it isn't in 12, well there's always LRB...

Sure, although it's worth noting that increasing cache sizes behaves better with legacy code than scratch pad memory (which just goes unused).
I agree generally.

I can buy the "over-optimization" argument on CPUs where you're talking about single-digit % increases in a lot of cases (sometimes more, but on balance) but on GPUs you're often talking about an order of magnitude... that's too much performance to leave on the floor for "portability".
The learning curve is still too steep even for the hardware designers (local atomics and tessellation being great examples, currently), so I think this needs repeated revisiting. (Then there's the physics of what's buildable, which I think is why tessellation is poor in ATI currently - not sure what the deal is with local atomics in NVidia.)

Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores.

So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores.

Right but the bins aren't just sums - they are 7 32-bit values each!
:p Well my posting this morning was the summation of my thoughts as I fell asleep last night before having a proper look at your code.

There's not enough shared memory to amplify them even 2x at the moment. You might be able to pack a few things into half-floats and get a 2x spread but you're definitely not going to get to one per SIMD lane!
You reported about 1.4ms to compute the histogram on ATI, as I understand it (I've no idea what kind of overhead that has...). That's about 380 million cycles. For about 2 million samples? (Or is this 4xMSAA, 8 million samples?)

So around 200 cycles (or 50 for 4xMSAA?) per sample?

The inner loop is 31 cycles according to GPU Shader Analyzer.

Yes mostly already done. Tile sizes are decoupled from the compute domain and chosen so that there are ~ as many tiles as required to just fill the GPU. This is important to minimize global writes/atomics traffic at the end of each work group.
Yeah, I was assuming a "persistent" kernel.

I played with strided vs "linear" lookups across the thread group but the latter were generally faster. If NVIDIA's coalescing logic remains the same then the latter will definitely be faster. I haven't played with using gather4 explicitly though... it's quite an annoying programming model - if they want the access like that they have free reign to reorganize the work items in the group.
As I haven't done this stuff for real I don't understand the issue here. But I have a feeling I'm seeing coherent fetches where there aren't any :???:

The lookup from the Z-buffer is *definitely* not the bottleneck though so I'm hesitate to try and optimize this much more.
I made that point partly to defend against the cache thrashing that technique 3 engenders. On ATI clause switches cost significant cycles, so a single clause of 16 TEX is preferable to 16 clauses of 1 TEX - the latency of the latter and the increase in cache thrashing it directly causes would hurt 3.

Yup I played with more random sampling to reduce collisions and it does work but with one huge problem: you can't have a massive performance falloff in the case where EVERY pixel on the screen collides. Put another way, worst case performance is what matters, not making the easier cases faster. This is a key point for game developers and one that I've heard often. Thus I haven't put a lot of effort into making the fast cases faster :)
Does the entire histogram need to be generated each frame?

As the camera translates the histogram is, in general, shifting coherently, isn't it?
 
Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores.

If you are sharing active data between hundreds and thousands of cores, it doesn't matter if you are coherent or not. Which is what all the coherency bad people apparently don't get. What coherency buys you is significantly increased flexibility and dynamic scalability. The actual cost of the coherency when done correctly is fairly low in the actual hardware. In a lot of ways its a software/hardware trade-off that has been going on for decades. Every single time, coherency has won, simply because it is more flexible and thus requires orders of magnitude less software complexity as you scale.

So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores.

It means you don't have to track every bit of data and can instead optimize only for the data that has a performance impact.
 
Sigh I totally forgot about shuffle - I remember being excited about that, once :LOL:
Yeah it's nice although packstore is still the niftiest instruction IMHO :)

Well both GPUs support concurrent compute kernels now (still pretty fuzzy on what that really means on ATI, though) so if it isn't in 12, well there's always LRB...
Right although "concurrent" doesn't necessarily mean "can be launched by the GPU" or "guaranteed parallel execution" or anything like that. It's unclear how restricted the GPU schedulers are here.

Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores.
I hear that vocal section but TBH they're mostly software people making broad claims about hardware engineering based on generalized logic at best (of course this doesn't apply to everyone, but I've met a lot of these people :)). When I ask the hardware people though they say that it scales fine and I'm inclined to trust them more...

I'll accept some argument over requiring full coherence but I really don't see how we can get away without proper caches in the future. (As an aside, I'm also told that once you have a cache, coherence isn't very hard or expensive and scales fine too...) Unless people come up with something equivalently clever it will relegate GPUs to operate only on the fairly small subset of regular problems that they can already handle now. I think Fermi made a definite step in the right direction on this one and I'm excited to see what comes next.

So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores.
What do you mean by "local cache"? Fermi's L1$ counts IMHO, and that's all I'm really asking for. If they get it fully pipelined with atomics and the like then we're 90% there.

For about 2 million samples? (Or is this 4xMSAA, 8 million samples?)
No MSAA at the moment so yeah, ~2 million samples.

As I haven't done this stuff for real I don't understand the issue here. But I have a feeling I'm seeing coherent fetches where there aren't any :???:
The "gather" fetch in the shading languages is sort of like bilinear in that you give it a float texture coordinate and get back the four surrounding elements. Not only does this mean having to do a silly int->float in this case but it also messes with the typical parallel domain since now each work item is handling those 4 elements in its kernel. Not a huge deal and definitely possible to do but not a trivial change to the code either.

I also don't get why they don't just rearrange the compute domain into gather4-style blocks if their hardware benefits from this... DirectCompute leaves that completely up in the air and implementation-defined. I must admit to not completely understanding the fast path here on ATI and how the gather4 stuff relates to linear memory loads.

Does the entire histogram need to be generated each frame?
For complete correctness, yes. It's like occlusion culling... even a minor shift from one frame to the next could reveal a new object (or occude an old one). You can start to play games but you immediately lose the guarantee that ever screen-space sample will have a shadow map sample. How important this is depends on stuff like the frequency of geometry in the frame and the speed of movement of the camera and objects.
 
I'll accept some argument over requiring full coherence but I really don't see how we can get away without proper caches in the future. (As an aside, I'm also told that once you have a cache, coherence isn't very hard or expensive and scales fine too...) Unless people come up with something equivalently clever it will relegate GPUs to operate only on the fairly small subset of regular problems that they can already handle now. I think Fermi made a definite step in the right direction on this one and I'm excited to see what comes next.

What do you mean by "local cache"? Fermi's L1$ counts IMHO, and that's all I'm really asking for. If they get it fully pipelined with atomics and the like then we're 90% there.
Yes, I think Fermi L1$ fits the bill.

No MSAA at the moment so yeah, ~2 million samples.
Well, that's pretty disturbing then, because on ATI histogram generation is running at ~1/6th throughput: theoretical 31 cycles for the inner loop takes ~200 cycles.

The "gather" fetch in the shading languages is sort of like bilinear in that you give it a float texture coordinate and get back the four surrounding elements. Not only does this mean having to do a silly int->float in this case but it also messes with the typical parallel domain since now each work item is handling those 4 elements in its kernel. Not a huge deal and definitely possible to do but not a trivial change to the code either.
I have to admit after seeing the projection/unprojection stuff in the code I'm lost. I think you are sampling raw depth, so should be able to sample contiguous areas from depth coherently (and fast), but...

I also don't get why they don't just rearrange the compute domain into gather4-style blocks if their hardware benefits from this...
The mapping from compute domain to data should be totally under the programmer's control, ultimately because it's a balancing act. e.g. it can be optimal to do 16 gather4s or only 1. That decision is the programmer's. On NVidia the optimisation profile is different...

DirectCompute leaves that completely up in the air and implementation-defined. I must admit to not completely understanding the fast path here on ATI and how the gather4 stuff relates to linear memory loads.
Every fetch of a mere 32 bits in your code results in the 128-bit pipe from L1 to registers being 25% utilised (per core it's a 512-bit pipe). You've only got ~1TB/s of that bandwidth in Cypress (54GB/s per core).

In the end ALU:TEX is so high that Z fetches shouldn't be in the picture.

What is the worst-case with 100% collision rate?
 
Well, that's pretty disturbing then, because on ATI histogram generation is running at ~1/6th throughput: theoretical 31 cycles for the inner loop takes ~200 cycles.
But you're assuming no collisions there right? There are *lots* of collisions! These need to be serialized across the SIMD lanes. For the 7 scatters and assuming 1 cycle/colliding SIMD lane at best...

I think you are sampling raw depth, so should be able to sample contiguous areas from depth coherently (and fast), but...
Yes it's pure 1:1 streaming. Is there a faster way than to just write the obvious code that maps work items to pixels and loops over chunks of the screen? All the code I've seen uses this style and I've never seen explicit "gather4"s for performance or anything.

The mapping from compute domain to data should be totally under the programmer's control, ultimately because it's a balancing act. e.g. it can be optimal to do 16 gather4s or only 1. That decision is the programmer's. On NVidia the optimisation profile is different...
Right but in all the current APIs it is implementation defined. NVIDIA "sort of" defines it as row major and I believe ATI tends to use the same convention but none of the APIs guarantee it. The best you can do is just use a 1-D domain, assume the SIMD packing is linear in this domain and do the addressing math yourself.

Every fetch of a mere 32 bits in your code results in the 128-bit pipe from L1 to registers being 25% utilised (per core it's a 512-bit pipe). You've only got ~1TB/s of that bandwidth in Cypress (54GB/s per core).
Right but how does this interact with the work-item packing? Can it not collect four of those fetches from adjacent work items and do a 128-bit fetch similar to NVIDIA's coalescing?

What is the worst-case with 100% collision rate?
It doesn't get much slower, which is why this implementation is good. You can test yourself by just moving the camera and walking up to a wall :) That case absolutely destroys the vertex scatter implementations but this one remains pretty fast due to the local atomics. Of course the SIMD lanes still get serialized but there is really no avoiding that.
 
Last edited by a moderator:
But you're assuming no collisions there right? There are *lots* of collisions! These need to be serialized across the SIMD lanes. For the 7 scatters and assuming 1 cycle/colliding SIMD lane at best...
Yeah, that's why I asked about the worst case.

But now that we've discussed it more, it seems the algorithm needs to be optimal for what is currently causing 100% collision rate. i.e. it would be preferable if it was collision-free, or at least to minimise the cost of the worst-case.

Yes it's pure 1:1 streaming. Is there a faster way than to just write the obvious code that maps work items to pixels and loops over chunks of the screen? All the code I've seen uses this style and I've never seen explicit "gather4"s for performance or anything.
In general, if bandwidth/latency-hiding is a bottleneck, then it's faster. I think the DC sample universe is a bit new as yet.

Sampling each element only once means that bandwidth is capped by memory controllers, so the potential gain from gather4 is then mostly about defending against cache thrashing (since multiple hardware threads share a core) and, in the case of ATI, minimising systematic latency caused by clause switching (I think NVidia is mostly immune here).

Obviously if sampling is truly stochastic then elements will end up being fetched multiple times simply due to the banking/burstiness of DDR.

But your kernel is far away from being Z-fetch bandwidth bound so that isn't a motivation for gather4.

Right but in all the current APIs it is implementation defined. NVIDIA "sort of" defines it as row major and I believe ATI tends to use the same convention but none of the APIs guarantee it. The best you can do is just use a 1-D domain, assume the SIMD packing is linear in this domain and do the addressing math yourself.
Hmm, I was under the impression it's strictly defined row-major for work-items and work groups in all APIs.

http://msdn.microsoft.com/en-us/library/ff471570(v=VS.85).aspx

Right but how does this interact with the work-item packing? Can it not collect four of those fetches from adjacent work items and do a 128-bit fetch similar to NVIDIA's coalescing?
If work items address contiguous areas of samples, then yes you'll get coalesced memory access and nice cache usage.

But bandwidth twixt TEX L1 and registers will go unused if each fetch is less than 128-bits (on ATI - who knows what it is on NVidia?)

Bear in mind that a filtered texel's components are delivered to the pixel shader as floats, i.e. in graphics a 4-component texel is 128-bits in the register file, in general. So 128-bits of data in the register file are expanded from, say, 4 bits of compressed texel.

So gather4 was created as it's common to have a single-component resource, but single fetches of 32-bits leaves a lot of L1->register-file bandwidth on the table.

In CUDA the coalescing that's often discussed is for fetches that don't go through the texturing hardware. Texturing does, too, but texturing hardware is not used much in CUDA.

In DC shaders it's unclear to me how NVidia uses the available hardware - e.g. in your shader I have no idea whether the texturing hardware is being used or whether the load/store path is being used. These two paths are distinct and parallel, each with a dedicated L1. Fermi documentation suggests that load/store is the fastest path for compute. So there's a decent chance that 32-bit fetches are actually optimal on GF100.

In ATI you can see from GPU Shader Analyzer that it's using the texturing hardware:

Code:
        14 TEX: ADDR(274) CNT(1) 
             27  LD R0.x___, R2.xy0w, t2, s0
and that 96 bits of bandwidth are lost to the three "_"s.

It doesn't get much slower, which is why this implementation is good. You can test yourself by just moving the camera and walking up to a wall :) That case absolutely destroys the vertex scatter implementations but this one remains pretty fast due to the local atomics. Of course the SIMD lanes still get serialized but there is really no avoiding that.
The advantage of fetching say four samples per work item is that you can test if all four samples have the same bin. In that case you can use a single atomic to increment by 4 instead of four distinct atomics of 1, so this should be 4x faster for the worst case.

Similarly you can aggregate min and max for the light if all four samples hit the same bin, for another 4x speed-up.

Then you can ask: is it also worth testing if any 3 samples have the same bin? What about any 2?

Now, the fun bit is what's the optimal count of samples per work item? It might be 2. Or 8. etc.

It would be amusing if 2 was optimal. Though I suspect 2 won't help make this approach viable on NVidia, it appears you would need 8.

---

On NVidia, because it seems atomics are running at "return-value rate" and since a sample's light coordinate can't be both min and max, it might be worth making min/max atomics mutually exclusive. i.e. use the return value from max to decide if the min atomic is even worth executing.

Still, it seems unlikely this will be enough of a performance gain.

Might be worth giving this memory, i.e. each work item remembers binID and min/max for the prior sample. Potentially have several bins memorised like this. Anything to avoid a collision...
 
Hmm, I was under the impression it's strictly defined row-major for work-items and work groups in all APIs.
That is only in terms of the indices that you get into your shader. There's no discussion/guarantees about how those indices get packed into a SIMD block/warp/wavefront since they still like to pretend that these machines might not be SIMD :) Implementations could pack indices [0,0]-[16,0] into a SIMD warp, [0,0]-[0,16], [0,0]-[4,4], etc. Obviously each of these changes the memory access pattern of the kernel.


In DC shaders it's unclear to me how NVidia uses the available hardware - e.g. in your shader I have no idea whether the texturing hardware is being used or whether the load/store path is being used.
It seems like they use the load/store path for everything but real texture lookups (not entirely sure what they do with "Load", but definitely for the fetches that require no data or address conversions, etc). These seem to go through the regular path and get cached in the core L1.

In ATI you can see from GPU Shader Analyzer that it's using the texturing hardware:
Ah I see, interesting. I guess that's probably because it's from a Z-buffer which is most likely swizzled, etc. That actually begs the question as to whether or not NVIDIA has to go through the TEX unit to get the address swizzling as well... quite possibly yes.

The next question of course is how gather4 works (or doesn't?) with MSAAed surfaces since this code technically references a sample index as well (although it basically gets compiled out when MSAA_SAMPLES=1).

The advantage of fetching say four samples per work item is that you can test if all four samples have the
same bin. In that case you can use a single atomic to increment by 4 instead of four distinct atomics of 1, so this should be 4x faster for the worst case.
Sure, and this is precisely my argument for lane swizzling! :) But it should be noted that it reduces some of the overall parallelism of the kernel (turns it into ILP effectively) which with these high-end ATI GPUs is becoming problematic... they're so wide and require so much latency hiding that it's hard to fill them, particularly at lower resolutions.

Similarly you can aggregate min and max for the light if all four samples hit the same bin, for another 4x speed-up.
Sure. With lane swizzling I do the address math across the SIMD lanes and conservatively combine ones that match. I don't test all of the possibilities but I do a pow-2 style paired reduction and only combine addresses that match. This catches the most important case of "all the same" and does pretty well in the average case too. With Gather4 I could certainly do something similar for the 4-wide vector.

On NVidia, because it seems atomics are running at "return-value rate" and since a sample's light coordinate can't be both min and max, it might be worth making min/max atomics mutually exclusive. i.e. use the return value from max to decide if the min atomic is even worth executing.
Well, it can be both min and max if it's the first/only sample in the bin :)

Might be worth giving this memory, i.e. each work item remembers binID and min/max for the prior sample. Potentially have several bins memorised like this. Anything to avoid a collision...
Software cache style? I assume it would be best then to reorganize the threads so that each one traverses a continuous range rather than each iteration of the warp handling a block. This would sabotage memory coalescing on the load but I guess you could technically always load it in and reorganize in shared memory... if that worked out to be worth it.

Definitely an interesting suggestion... I'll see if I can get some time to play with it!
 
That is only in terms of the indices that you get into your shader. There's no discussion/guarantees about how those indices get packed into a SIMD block/warp/wavefront since they still like to pretend that these machines might not be SIMD Implementations could pack indices [0,0]-[16,0] into a SIMD warp, [0,0]-[0,16], [0,0]-[4,4], etc. Obviously each of these changes the memory access pattern of the kernel.
With the dual constraints of local memory (shared) and workgroup synchronisation the hardware threads that constitute a workgroup share a core.

Underlying that is a 1D execution domain and one or more 1D buffers, regardless of what's declared.

It seems like they use the load/store path for everything but real texture lookups (not entirely sure what they do with "Load", but definitely for the fetches that require no data or address conversions, etc). These seem to go through the regular path and get cached in the core L1.
Maybe NVidia's latest tools provide some insight?

Ah I see, interesting. I guess that's probably because it's from a Z-buffer which is most likely swizzled, etc. That actually begs the question as to whether or not NVIDIA has to go through the TEX unit to get the address swizzling as well... quite possibly yes.
Well, I don't have any insights on what NVidia prefers, and I'm not much good at D3D either...

The next question of course is how gather4 works (or doesn't?) with MSAAed surfaces since this code technically references a sample index as well (although it basically gets compiled out when MSAA_SAMPLES=1).
Not sure what happens with 8xMSAA, is 4xMSAA a special case with a nice swizzle and everything else is linear?...

Sure, and this is precisely my argument for lane swizzling!
Which I think is viable on the GPUs, through local memory as a communication medium.

But it should be noted that it reduces some of the overall parallelism of the kernel (turns it into ILP effectively) which with these high-end ATI GPUs is becoming problematic... they're so wide and require so much latency hiding that it's hard to fill them, particularly at lower resolutions.
And what happens with 1000 cores?

Actually, of course, 1000 cores means you've got "1000 atomics per clock" - so that would be pretty useful for this kernel. Ironically Larrabee has the highest core count of all these competing chips currently.

Sure. With lane swizzling I do the address math across the SIMD lanes and conservatively combine ones that match. I don't test all of the possibilities but I do a pow-2 style paired reduction and only combine addresses that match. This catches the most important case of "all the same" and does pretty well in the average case too. With Gather4 I could certainly do something similar for the 4-wide vector.
So is the Larrabee version faster than 1.4ms? What's the worst-case like?

Is the worst case uniform Z? I suspect not - I guess uniform Z is the fastest and more random Z is slower.

Well, it can be both min and max if it's the first/only sample in the bin
Ha, well that comes out in the wash in the end, if the minimum never budges from its initial, dummy, value.

Software cache style? I assume it would be best then to reorganize the threads so that each one traverses a continuous range rather than each iteration of the warp handling a block. This would sabotage memory coalescing on the load but I guess you could technically always load it in and reorganize in shared memory... if that worked out to be worth it.

Definitely an interesting suggestion... I'll see if I can get some time to play with it!
Yeah some kind of software cache approach.

Quite apart from the variations in GPUs out there, even given a single chip there's so many algorithm choices. It's like trying to find the single cut through a 12-dimensional cake that gets the best bits.
 
Underlying that is a 1D execution domain and one or more 1D buffers, regardless of what's declared.
Yes for sure, but my point is that you need to know that mapping of "declared" domain to "real" domain to manipulate the memory access patterns.

Which I think is viable on the GPUs, through local memory as a communication medium.
Fine, but like i said it's not reasonable in the current programming model due to syncs.

And what happens with 1000 cores?
Ahmdahl's law :) Seriously though we need to put some hardware into more cleverness now... we're reaching the limit of just "more" IMHO.

So is the Larrabee version faster than 1.4ms? What's the worst-case like?
Can't really talk specifics unfortunately.

Is the worst case uniform Z? I suspect not - I guess uniform Z is the fastest and more random Z is slower.
For any scatter-based algorithm uniform is going to be the worst case... ideally that scatter/atomics taper nicely into log(n) reduction as data becomes more uniform but in practice it tends towards linear at best for a non-adaptive algorithm.

Ha, well that comes out in the wash in the end, if the minimum never budges from its initial, dummy, value.
Sure, but you'd still have to do a separate branch for each of the 3 min's... not sure that would be worth it but I guess you could try. It would also necessitate adding a return value but as you note maybe on NVIDIA you're already paying for that.

Quite apart from the variations in GPUs out there, even given a single chip there's so many algorithm choices. It's like trying to find the single cut through a 12-dimensional cake that gets the best bits.
Indeed and I have tried a variety of approaches but in my past experience software caches that actually involve any address math are far too slow. Maybe a single-element "cache" could be a win though for this data set. Again having to go so wide to fill the machine means there's a limited number of work-item iterations that can actually benefit from this and I imagine trying to do this in shared memory would be prohibitively slow compare to just the atomics.
 
Yes for sure, but my point is that you need to know that mapping of "declared" domain to "real" domain to manipulate the memory access patterns.
OK, being more explicit: the mapping is linear on the GPUs. The developer doesn't have to unwind a space-filling curve or some other abomination.

Of course the developer might decide that it's worth implementing a space-filling curve for their algorithm's data-access. prunedtree's matrix multiplication algorithm does this, I think - effectively simulating the pixel shader access pattern in a compute shader. A newer variant of the algorithm, which uses a pixel shader as a compute shader (i.e. it doesn't use pixel outputs at all), gets the same performance without manipulating addresses as prunedtree does.

For any scatter-based algorithm uniform is going to be the worst case... ideally that scatter/atomics taper nicely into log(n) reduction as data becomes more uniform but in practice it tends towards linear at best for a non-adaptive algorithm.
Hmm, but your algorithm is adaptive on Larrabee, which is why it seems like it should be fastest for the uniform case, since that's a single write to memory whereas any other Z distribution incurs more atomics.
 
OK, being more explicit: the mapping is linear on the GPUs. The developer doesn't have to unwind a space-filling curve or some other abomination.
That's true in practice, but that is not guaranteed by the APIs/spec! That's my point.

Hmm, but your algorithm is adaptive on Larrabee, which is why it seems like it should be fastest for the uniform case, since that's a single write to memory whereas any other Z distribution incurs more atomics.
Sure, but they are only "local" memory "atomics" so it's not a big deal relatively-speaking. SIMD size is relatively smaller than GPUs and scalar throughput is better as well. In practice there is only moderate speed variation depending on the data collision pattern, and in any real graphics scenes the range is between "very locally coherent -> completely locally coherent" :) You never see highly random patterns even with foliage and the like.
 
Andrew (and Marco), thanks again for great research. If you have tried to port your algorithm to console hardware, I would like to hear about your experiments.

I am planning to use SDSM in our next console project (we are already using EVSM filtering and PSSM), so I started formulating the min/max z and the cascade min/max xyz operations for hardware without compute shader synchronization features or atomics.

The first draft (z min/max from 1280x720 depth texture --> min/max in two floats in a constant buffer):

- One PS pixel samples a 4x4 grid (16 tex instructions), and outputs the min & max value (to a 16f-16f RT pixel)
- Rendering to 4x4 rendertarget (16f-16f), one triangle covers the whole rendertarget (4x4 = 16 pixel triangles - enough for proper utilization)
- In total one (4x4 pixel) rendered triangle processes 4x4 * 4x4 = 16x16 = 256 input pixels
- Render (1280x720/16 = 80X45 =) 3600 triangles (4x4 pixels each) on top of each other with max blend in a single draw call.
- Resolve the 4x4 render from EDRAM to a texture.
- Read the 4x4 texture (16 samples) with one PS instance, get min/max from the 16 samples and memexport the min/maz z values to a constant buffer.

Blender provides the atomic max operation (and negative values with max blend = atomic min). And blending to EDRAM is bandwidth free. There is no need for a vertex buffer (vertex index is enough to calculate the "full screen" triangles with same vertex positions and different texcoords each).

Rendering to a 8x8 RT would drop the triangle count to 900, and 16x16 would drop it to 225. But both require one additional small downsample operation (read 16x16 -> output min/max to 4x4).

Similar algorithm would likely be usable for finding the min/max xyz values for all the four cascades, but it would require four rendertargets (of 4x4 size each). And the final memexport step could be done in parallel for all the four cascades... saves a few cycles :).

Alternatively I have been toying around with an algorithm doing recursive min/max downsampling and memexporting the result xyz maximums/minimums of each block. Too bad there are no atomic min/max memexports, it would make things much easier :)
 
That's true in practice, but that is not guaranteed by the APIs/spec! That's my point.

At least in CUDA, the mapping is guaranteed and explicitly set forth in the documentation.
See pages 8 and 81 of the current CUDA programming guide:
The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Section 2.2 describes how thread IDs relate to thread indices in the block.
The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y*Dx); for a three- dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y*Dx + z*Dx*Dy).

http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_Guide.pdf
 
If you have tried to port your algorithm to console hardware, I would like to hear about your experiments.
Given the nature of our current jobs, we have not, but I'm really interested in hearing about your experiences doing it and happy to share any experience we do have with the algorithm!

I am planning to use SDSM in our next console project (we are already using EVSM filtering and PSSM)
Awesome! Curiously, 16-bit EVSM? Impressive that that works well enough if so.

The first draft (z min/max from 1280x720 depth texture --> min/max in two floats in a constant buffer):
Note as well that we toyed with the idea of undersampling the depth buffer (i.e. only look at every other value in each dimension for instance)... you lose the guarantee of having a shadow sample for every pixel but it should be rare that you hit really bad cases. Might be something to play with if performance is a concern.

- One PS pixel samples a 4x4 grid (16 tex instructions), and outputs the min & max value (to a 16f-16f RT pixel)
- Rendering to 4x4 rendertarget (16f-16f), one triangle covers the whole rendertarget (4x4 = 16 pixel triangles - enough for proper utilization)
- In total one (4x4 pixel) rendered triangle processes 4x4 * 4x4 = 16x16 = 256 input pixels
- Render (1280x720/16 = 80X45 =) 3600 triangles (4x4 pixels each) on top of each other with max blend in a single draw call.
- Resolve the 4x4 render from EDRAM to a texture.
- Read the 4x4 texture (16 samples) with one PS instance, get min/max from the 16 samples and memexport the min/maz z values to a constant buffer.
Clever - yeah definitely makes sense to use/abuse the EDRAM/ROP setup in the 360 :) The typical solution FWIW is to do an O(logn) pass reduction with ping-ponged textures where each pass renders a primative that covers ~N pixels where N is around how many SIMD lanes you need to fill the machine. Then in the pixel shader it loops over the relevant block (resolution/N) and accumulates the reduction in registers, writing the final result out. You usually only need a few passes of this given current GPU core counts. It's usually worth doing the lookups "strided" for a given pixel invocation as that makes the lookups across pixels in the SIMD batch coherent (consecutive) and avoids bank conflicts.

That's the implementation we used in DX9/10 and it's quite fast. Yours seems even more clever on the 360 though as long as the rendering of 3600 triangles doesn't bottleneck too much on primitive setup or otherwise. I'm interested to see your results.

Blender provides the atomic max operation (and negative values with max blend = atomic min).
Yup we do the same.

Rendering to a 8x8 RT would drop the triangle count to 900, and 16x16 would drop it to 225. But both require one additional small downsample operation (read 16x16 -> output min/max to 4x4).
True, although are you planning to run the SDSM partition algorithm on the GPU or CPU? If the latter reading back the last level of the reduction can actually be faster than doing it on the GPU since GPUs are actually pretty bad at the "last few levels" of these reductions (which are basically serial work).

Similar algorithm would likely be usable for finding the min/max xyz values for all the four cascades, but it would require four rendertargets (of 4x4 size each). And the final memexport step could be done in parallel for all the four cascades... saves a few cycles :).
Yes for RTs or four passes, whichever is faster.

Alternatively I have been toying around with an algorithm doing recursive min/max downsampling and memexporting the result xyz maximums/minimums of each block. Too bad there are no atomic min/max memexports, it would make things much easier :)
Yes that's similar to the pixel shader implementation that I described above. It should work well enough.

Very interested to see your results and I'm glad you're trying it out!
 
Yes for RTs or four passes, whichever is faster.
The fill rate deficit for using four render targets should be pretty much masked by the 16 tex instructions. So four RTs should be faster (in my algorithm draft at least).

For 2x2 recursive downsampling, it should be likely the best choice to render the first pass to 4 render targets (first pass source data contains pixels that belong to all cascades), and the later passes to single render targets (pixel data for each cascade is already separated in the first pass to the four render targets).

Actually the xyz min/max (= 2*3 channels) for four cascades would take 2*3*4 = 24 channels. So that would require six render targets on pre DX10 hardware (as pre DX10 hardware doesn't have integer registers and integer bit operations, so packing 2x16 bit values to a single 32 bit integer output is not possible). And DX9 didn't support more than 4 simultaneous render targets, so doing it this way requires at least two passes (3 rt:s each).

Recursive output with memexport would allow me to write all the 24 values (xyz min/max for four cascades) in adjacent memory locations from a single shader. But I would lose the blender atomic min/max operations, so it would be more like a traditional recursive downsampling that way. Seems I have to implement both and do some profiling to see the bottlenecks.

Awesome! Curiously, 16-bit EVSM? Impressive that that works well enough if so.
Actually there are some visible graining artifacts, as I currently use 16-16 integer buffers. I use separate bias and scale for both channels, as they have different value range. Did you do experiments with integer formats when you developed the algorithm? (on consoles filtering from integer buffers is generally faster)
 
Last edited by a moderator:
The fill rate deficit for using four render targets should be pretty much masked by the 16 tex instructions. So four RTs should be faster (in my algorithm draft at least).
Makes sense. You may be able to early-out of certain tiles on the screen even as it's common for sub-tiles to below entirely to one partition (and fairly easy to detect). That said this falls into "optimizing for the easy case" which I typically try to avoid in game algorithms.

Actually the xyz min/max (= 2*3 channels) for four cascades would take 2*3*4 = 24 channels.
Right although if you have a small enough scene you may be able to ignore the z bounds... they really do help the light bleeding and precision with any filterable representation though.

And DX9 didn't support more than 4 simultaneous render targets, so doing it this way requires at least two passes (3 rt:s each).
Hmm yeah... unfortunate sequence of missing features :(

Seems I have to implement both and do some profiling to see the bottlenecks.
Yeah unfortunately I don't have good intuition for the 360 memexport/EDRAM trade-offs. I'm curious about the results though if you do implement and profile one or both!

Actually there are some visible graining artifacts, as I currently use 16-16 integer buffers. I use separate bias and scale for both channels, as they have different value range. Did you do experiments with integer formats when you developed the algorithm? (on consoles filtering from integer buffers is generally faster)
Interesting, and you manage to make that filter correctly? Impressive :) No I haven't actually played with fixed-point math with EVSMs since they so perfectly fit the floating point representation. Indeed they have better useful precision with float formats than conventional VSMs which only effectively make use of the mantissa (although of course this works well for fixed-point formats).

On PC as you probably know there's no support for 32-bit fixed-point/integer filtering which makes 32-bit float (which is filterable on all released DX10+ HW luckily even though not required to be) the most viable solution. I briefly looked into whether EVSMs made 16-bit float more viable and while it does improve the situation over VSMs slightly it's not really enough for large scenes. This was all before the SDSM work though so there's a possibility that the aggressive depth range clamping helps a lot there.

Again, very interested in your experiences. I'm glad to see you taking this stuff to the next level already!
 
Right although if you have a small enough scene you may be able to ignore the z bounds... they really do help the light bleeding and precision with any filterable representation though.
---
Interesting, and you manage to make that filter correctly? Impressive :) No I haven't actually played with fixed-point math with EVSMs since they so perfectly fit the floating point representation. Indeed they have better useful precision with float formats than conventional VSMs which only effectively make use of the mantissa (although of course this works well for fixed-point formats).
Optimal depth bounds are very important because of the EVSM filtering, and that's why I want to have the tight SDSM z-bounds as well. A tight near clip allows us to clamp the objects before near clip very close to the receivers (like you said, this reduces light leaking considerably), and a tight far plane in combination with tight near plane brings the best possible precision to the 16 bit fixed point.

The 16 bit EVSM works quite well. The filter quality is very good for the closest cascades, but there's visible graining in the furthest cascades (we have around 2 kilometer view distance). The quality (with four cascade PSSM) is not yet something I would release as a final product. But with tighter SDSM bounds and some fine tuning, the 16 bits are likely enough (at least I hope so).

I plan to use the z-buffer we render for the virtual texture page fetching to feed the SDSM system. It's 4x4 smaller than the game resolution, so the downsampling performance should be better. Currently we render the virtual texture page fetch frame by using a crude camera movement approximation (future frame). But I could modify it to render the next frame almost perfectly: perfect next frame camera matrix and near perfect object movement approximation (position & rotation added with velocity & angular velocity). This would allow me to lock the result next frame (prevent lock stalls) and use the information for SDSM (it should provide more correct result than just using last frame min/max xyz). And the downsampled depth buffer is really useful for occlusion culling as well (we plan to have similar real time occlusion culling system than they have in CryEngine 3)... Two weeks ago I refactored our tile based deferred away, but if we get near perfect tile near/far depth values (without lock stalls), it might change my plans a bit :)
 
Back
Top