Reading SV_Depth and SV_Target in pixel shader (hardware implementation issues?)

It's super-useful! Fully programmable blending, z and stencil is just the top of the ice-burg... what about building up a K-buffer in a single pass? How about building a deep shadow map in a single pass? It's useful in precisely any situation that you need to build up a data structure based on all of the fragments that hit a given pixel.

And on a tiled renderer like Larrabee, it goes even beyond that: all of the render target data is sitting close to the processors, so you happily work away on this data in a R/W fashion and only write out the final results. Deferred shading w/ MSAA and tone-mapping with only the final, resolved 32-bit RGBA buffer ever leaving the local cache? Yes please :)

Exactly. And it also allows completely new algorithms to be implemented that nobody has tried to solve before with the GPU because of this limitation. Imho this is much more important feature than geometry shaders or GPU tessellation (and likely consumes less die space if implemented correctly).
 
So curiously, what sorts of divergences are you expecting?

Full divergence in some cases. I'd like to have divergence for framebuffer R/W (also with atomic access) similar to texture access. With respect to atomic access I can see cases where many of the threads access the same address as well (have collisions). The cases I can think of would have good data locality.

I toss in another example, one that I've been working on. As triangles get very small, ie close to pixel sized, triangle rendering stops making sense (IMO). At this point one could go to points, draw verts directly into a tile (possibly keeping track of a sub-pixel offset per pixel as well). Do a hole-filling pass on the tile, to correct for any gaps. Then use reprojection of the previous frame to help remove the aliasing problems and deal with filtering (possibly mix in temporal jitter with reprojection). There are a lot of ways to do this. A proper level of detail and occlusion system such as a wide hierarchical tree of point batches is needed, and can be currently done in a highly parallel way, by having each tile manage its own display traversal and data structures. This is the sort of thing I think Larrabee would be great at... and needs fast divergent vector scatter.
 
While we're moving away a bit from the original topic, the question of "divergent" (I assume this means relatively incoherent here?) scatters is an interesting one. There's a crossover point somewhere related to the density of scattered items and collision policies after which you're better off just sorting/binning the data up front, or even converting it into a fairly pure gather rather than designing fancy hardware that effectively ends up doing the same thing. Having hardware that can asynchronously scatter to incoherent addresses is certainly pretty useful, but once you start hitting the stage where it's trying to buffer up different requests, coalesce/cache big chunks of data, etc. etc. you're often better off just handling that in software IMHO, but it'll be interesting to see how the various hardware evolves :)

Still, I will point to geometry shaders as a good example... for quite a while - and I haven't run the test again recently - it was faster to do the "pack" for amplification/deamplification manually in software than to let the GS handle it. As always it's a constant question of throughput vs latency... if you're willing to eat some latency, you can almost always do a better job by re-sorting/batching these sorts of operations over larger chunks and better scheduling them in parallel, rather than trying to dynamically load balance as the data comes in.

Anyways, that's getting a bit off-topic, so I'll stop :)
 
While we're moving away a bit from the original topic, the question of "divergent" (I assume this means relatively incoherent here?) scatters is an interesting one.

Divergent meaning that given a vector register of addresses, more than one memory transaction would be needed to service the full request (assuming say on Larrabee a cache line is the same size as the vector register). The number of memory transactions required to service the request would be a function of how "divergent" the addresses are. In the way I was describing divergence you could have a lot of divergent vector scatter, but still have relatively good data locality.

For those who isn't aware, in cases of newer NVidia hardware like the 260/280, the hardware is now smart enough to do the a minimal number of memory transactions for divergent vector scatter even when addresses are not sequential in the vector, but there is still a minimum memory transaction size of 32 bytes which is quite a bit larger than 4 bytes (for one vector element), so worst case for scatter of floats is an 8x bandwidth waste. CUDA supports scatter of float4 (16 bytes) which is only a 2x bandwidth waste in the worst case.

What I was hoping with a "surface cache" was that this bandwidth waste would only be a waste of cache bandwidth instead of main memory bandwidth.
 
What I was hoping with a "surface cache" was that this bandwidth waste would only be a waste of cache bandwidth instead of main memory bandwidth.
Ah, I see. Well in the case of Larrabee where your render targets are sitting in cache, that's definitely the case :) It would definitely be nice if that could also be implemented on modern NVIDIA/AMD GPUs fairly cheaply though.
 
Back
Top