Reading SV_Depth and SV_Target in pixel shader (hardware implementation issues?)

sebbbi

Veteran
Currently SV_Target (render target pixel color) and SV_Depth (render target pixel depth) system semantics are write only in DX10 / DX10.1 (sm 4.1) pixel shaders. This limitation has been in place since the first (sm 1.0) shader model.

Since DX8 launch, I have waited for this this limitation to be lifted, as it would allow a lot of new algorithms to be developed using the graphics hardware (without rendering every triangle separately and pinpointing between 2 buffers), and as a side effect would also make the blending fully programmable (read SV_Target, do any custom blending, write result to SV_Target). Reading from SV_Depth would potentially speed up many post process effects (depth of field, volumetric particles, etc, etc) and allow the programmer to implement custom depth comparison modes (making many new algorithms possible).

I am not sure I if understand the GPU inner workings 100% properly, but I have always assumed that the GPU only setups / renders a single triangle at a time (all pixel shaders are processing the same triangle). There is no possiblity that the same pixel is being processed simultaneously by 2 separate threads of the same triangle (as triangles are planar surfaces), and there should be no thread synchronization issues caused by render target sampling while rendering to it. However I doubt my one triangle at a time assumtion is true on any newer hardware, as it would cause a lot of idle cycles when rendering small triangles (unless the remaining unified shader units are processing the next vertices during the rasterization). Could anyone with more hardware knowledge light me on this issue?

With things simplified a bit:

If you read SV_Target or SV_Depth in your shaders, the GPU knows this in advance and can slightly adjust the pixel rendering procedure (writing to SV_Depth is handled similarly on current hardware). Basically the only new thing the GPU would need to do is to fetch the existing render target color and the depth value before running the pixel shader, instead of after pixel shader like the current hardware does (during pixel depth test and blending). No extra render target depth / color fetches or bandwidth would be needed, as only the order of operations differ. This would not even affect the hierarchial Z operation, as it culls the pixels before the pixel shader (skipping the render target & Z data fetching in both cases). The GPU knows the affected screen pixels right after the triangle setup, and could prefetch the affected pixel color & depth values from the render target without that much extra logic needed (in comparison for normal texture fetches the texture coordinate needs to be calculated first before the fetch). This should not cost more than a simple unfiltered texture fetch (and could be optimized to be faster with a little bit of fixed function logic).

I am pretty sure that there is at least one major hardware issue I have overlooked, since this feature has been rumoured to be included in the next DirectX version for a long time, but the major hardware manufacturers have not yet implemented it on any of their chips. Custom write to SV_Depth was made possible in DX9 and it does cost some performance (hierarchial Z needs to be disabled at least). Reading from SV_Depth and SV_Target would theoretically be almost free (as both color and depth data still needs to be fetched for blending and depth testing after pixel shading). Even if reading from SV_Target / SV_Depth would cost as much as a single (unfiltered) texture fetch (each), it would still be a really widely used feature in the future games. I doubt this feature would require as much transistors as geometry shaders for example (a much less important feature currently imho).

Just my thoughts as a long time graphics engine programmer with just a little bit of hardware implementation knowledge :). All information about this issue by real graphics hardware engineers would be more than welcome (I have been pondering this issue for a long time)!
 
According to this thread, triangles are rendered one after each other by the hardware (http://forum.beyond3d.com/showthread.php?t=50894). If this information also affects the hardware implementation directly, the possible thread synchronization issues should be more limited in SV_Depth and SV_Target reading.
 
Last edited by a moderator:
I believe having the option to read a pixel value from the active render buffer would greatly increase the amount of in-flight pixels to cover latency.

In the current arrangement, since you know that the texture value is static, you don't need to worry about read-after-write conditions, so you can issue the read right as it appears in the instruction stream and have a reasonable expectation that the value will return relatively quickly (due to texture cache). The read-after-write hazard resolution for blending happens further downstream.

If you're going to read from an active buffer, you need to make sure that there are no writes in flight for the same pixel. Since you can have many hundreds of pixels in-flight for a single pixel shader unit, you'd need to keep track of all of those, which not only costs area, but is also harder wrt scheduling. If you detect that a write is still in-flight, you also need to schedule a different thread, so your latency hiding capacity needs to go up.
 
This patent implies that multiple triangles can be in progress, in parallel, within the same batch:

Increased scalability in the fragment shading pipeline

Interestingly this patent application:

Rendering pipeline

relates to using screen space tiling to enforce temporal locality for render target reading/writing. This is much like what Intel is planning with Larabee. I haven't read this at all closely, but it's a big deal if it's what I think it is :p

Essentially D3D11 requires the pixel shader to be able to arbitrarily read and write the render target. So, not long to wait...

Jawed
 
I don't believe D3D11 makes any guarantees about ordering within a draw call on so-called "unordered access views" :), but I may be remembering incorrectly there.

Render-target read would indeed be really useful though and it's something that people have been wanting for a long time. On typical GPUs it would involve expanding the score-boarding duration to cover the entire time from the first read to the last write (presumably the fragment itself), whereas right now it only has to cover the ROP phase technically, as long as shader invocations remain ordered. On a tiled renderer like Larrabee, it would be even easier/cheaper since the render target data is sitting much closer to the processor (in cache).
 
Last edited by a moderator:
I believe having the option to read a pixel value from the active render buffer would greatly increase the amount of in-flight pixels to cover latency.
Only if you want to make immediate reuse efficient.

You can just let it stall if the developer reuses stuff too fast ... hell, you can just make it indeterministic (without a flush).
 
If the render target pixel write hasn't completed before the read to the same pixel, couldn't the hardware just stop the thread and switch the ALU to process another (non stopped) thread like it does whenever a texture cache miss happens? This situation is not that common after all, as pixels inside the same polygon cannot affect same pixels, and it's rare that the next polygon renders to the same pixels as the previous one did. The developer could also reorder the mesh vertices (polygon drawing order) to minimize the cache stalls if needed.
 
Sebbbi, the CUDA PTX docs describe this functionality in a .surf memory space (but this is not implemented currently). Access was to be via special surface functions (like R/W texture access, but not using texture samplers probably because of cache coherency issues between texture cache and ROP or surface cache). Perhaps some day we will see this.

I forget if Larrabee's CPU and texture caches are coherent with each other or not..
 
Where are we with "set current render target as a source texture" hacks? Does this still work in DX10 or does runtime phisically denie this? If you can smuggle this through you can have some sort of support for such functionality that might in some weird cases on some cards even work.
 
The D3D runtime prevents such things outright now :)

TimothyFarrar: I believe what sebbbi is asking for is just simply the ability to have the current framebuffer pixel as an input to the pixel shader. That sort of functionality wouldn't have to go through texture units.
 
If the render target pixel write hasn't completed before the read to the same pixel, couldn't the hardware just stop the thread and switch the ALU to process another (non stopped) thread like it does whenever a texture cache miss happens? This situation is not that common after all, as pixels inside the same polygon cannot affect same pixels, and it's rare that the next polygon renders to the same pixels as the previous one did. The developer could also reorder the mesh vertices (polygon drawing order) to minimize the cache stalls if needed.
With everything GPU becoming more general and orthogonal, I think just being able to read the same pixel is too restrictive: as soon as you're introducing the ability to read 1 pixel, everyone will start asking for random access gather from the render target. (Probably for good reason...) So maybe Ati/Nvidia/Intel/Msft will simply not bother to just implement the special case and go all out (or not).

It's also not just a matter of writing the same pixels: it's sufficient that you're rewriting pixels within the same pixel group (e.g. as defined by the burst length of the memory controller or some other architectural reason such a compression.) Since most triangles in a mesh share an edge with their predecessor, the chances of having a RaW collision could be pretty high.
 
I dunno, I think the "just the same pixel" case has a strong argument in the language design in that it doesn't interfere with the ability of a specific implementation to split up pixel processing any way that it sees fit. Providing access to "neighboring" data would introduce discontinuities into the execution space, ruin the "independence" guarantees of a specific pixel shader.

Being able to access just the current pixel is super-useful though as it allows one to build data structures over all of the primitives that hit a given pixel. It also generalizes the depth and stencil buffers, allowing for things like single-pass depth peeling and so forth.
 
If the render target pixel write hasn't completed before the read to the same pixel, couldn't the hardware just stop the thread and switch the ALU to process another (non stopped) thread like it does whenever a texture cache miss happens?
It's possible to design hardware this way, but it's not cheap. Stalling for a texture access is simple because ordering doesn't matter, and it's just reading. What you're describing not only requires waiting for a pixel to finish (that involves the entire time in flight!) first before waiting for the reading, but you have to keep track of the order that pixels are sent and prioritize the waiting. We could be looking at tens of thousands of cycles of latency with only a few pixels on top of one another. Given that you can have 30k pixels in flight nowadays without any guarantee of screen locality, it's not a trivial task to even do the ordering, let alone handle the latency. None of this complexity is there for texture reads.

As for the rarity of the situation, think about particle effects where lots of small primitives are rendered on top of each other.

I would love this ability as much as Andy, but if it costs a lot of die space then I'm not interested. IMO, offloading ROP math to the shader will absolutely pale in comparison the cost of the memory controller complexity to implement this functionality. Even doing this in Larrabee's tiled software renderer will be tough and/or slow, despite being orders of magnitude easier than with immediate mode rendering.

However, if you ignore ordering, it's pretty easy. I guess it's better than ping-pong, but not overwhelmingly so.
 
Being able to access just the current pixel is super-useful though as it allows one to build data structures over all of the primitives that hit a given pixel. It also generalizes the depth and stencil buffers, allowing for things like single-pass depth peeling and so forth.

Yup. From a software perspective this feature is very desirable. It solves a wide range of rendering problems. The limited fixed function blending that's available is one of the few remaining very significant limitations in the programming model for GPUs. I'm suspecting that Intel will make a big deal about being able to do this in Larrabee. Larrabee is unlikely to be competitive on performance, so they will almost certainly focus on things that Larrabee can do that traditional GPUs can't. I think it would be wise for ATI and Nvidia to finally tackle this problem. Ultimately it'll have to be solved in one way or another anyway. It's the biggest roadblock on GPU programmability as it stands now. With this limitation removed Larrabee's advantage in programmability would be mostly academic and quite frankly largely uninteresting.
 
I would love this ability as much as Andy, but if it costs a lot of die space then I'm not interested.

I'd be willing to sacrifice a quite significant chunk of the die area for this. I'd rather have 640 shader cores with the ability to read the active render target than 800 shader cores without that ability.
If it ends up being near impossible to solve, then at least give us a short blend shader. Say 16 ALU instructions (with swizzles dammit!), even with no textures or constant buffers, that would allow for a lot of magic that's just impossible right now.
 
I don't see any good reason to why one would want to complicate the texture cache for read-back of recently written data. I think a separate surface cache makes more sense, in that the read-only texture performance is not compromised.

I'm also not convinced that just having access to current pixel values in the framebuffer would be greatly useful. What do you do in the case of multisampling?

I can think of a bunch of times having a smaller extra surface cache (ie through special surface fetch/store operations) would make lots of sense however. Such as when you want to emulate something more complex than the traditional Z buffer in software. There are times when you want to use binning to solve a problem in better than log time. This means that individual threads need to have atomic load/store from divergent addresses which seems too bandwidth unfriendly right now (say with CUDA). Having a general purpose surface cache would enable this to be bandwidth friendly. Also, I'm fine with high latency on the surface cache, we already have high latency on texture reads.

Seems like DX11 is going the route of forcing ATI and Intel to get up to speed/functionaly with the CUDA model. I don't see anything there in DX11 that hints at something like a fast surface cache (enable fast divergent atomic read/write from a resource). Could be wrong here. Perhaps this functionality would just get folded into compute shaders with hardware being more global divergent memory access friendly. There was a big jump from the NVidia 8000/9000 to 2xx series in terms of divergent access performance, perhaps something big is planned for the next major arch revision?
 
Even doing this in Larrabee's tiled software renderer will be tough and/or slow, [...]
Why would you think that curiously?

Larrabee is unlikely to be competitive on performance, so they will almost certainly focus on things that Larrabee can do that traditional GPUs can't.
In the end it's all a performance question though :) If having render target read allows you to use an algorithm that's 10x more efficient than what you'd have to do on the competition, that's a big win :) It's really just where you want to pay the sorting and bandwidth costs, and there are suitable algorithms for each cost model.

I'm also not convinced that just having access to current pixel values in the framebuffer would be greatly useful.
It's super-useful! Fully programmable blending, z and stencil is just the top of the ice-burg... what about building up a K-buffer in a single pass? How about building a deep shadow map in a single pass? It's useful in precisely any situation that you need to build up a data structure based on all of the fragments that hit a given pixel.

And on a tiled renderer like Larrabee, it goes even beyond that: all of the render target data is sitting close to the processors, so you happily work away on this data in a R/W fashion and only write out the final results. Deferred shading w/ MSAA and tone-mapping with only the final, resolved 32-bit RGBA buffer ever leaving the local cache? Yes please :)

What do you do in the case of multisampling?
Same thing you do in DX10.1/11: treat a render target read input as being at sample frequency. There would even be flexibility to decide that dynamically, which is a somewhat neat feature of the current HLSL semantics.

All you're doing here is treating your render targets as little per-pixel scratch pads (or in some cases, you can write out the whole data structure), but it's a very powerful model and fits nicely with the current pipeline. Honestly I'd argue that it's *more* useful to have a little chunk of R/W data per-pixel that's persistent over pixel shader invocations on it than to have any access to neighboring pixels. The former also doesn't impose nearly as much implementation burden, particularly if medium-to-long latencies are tolerated (as with texturing).
 
Andrew, actually I agree that R/W access in fragment shaders is going to be huge.

My other comment about just having access to the current pixel values was in response to your comment, "I believe what sebbbi is asking for is just simply the ability to have the current framebuffer pixel as an input to the pixel shader. That sort of functionality wouldn't have to go through texture units."

What I was saying is that I think just having the full MRT pixel data as inputs to the fragment shader isn't general enough, and might also be bandwidth wasting in many cases. For example say you wanted to have some kind of deep framebuffer (ie multiple z values and pixels per depth) which you were simulating with multiple render targets. Getting all those render targets as input in the fragment shader might be overkill, when really you just need to read then update a single index value to know where to store a single pixel's data (instead of doing R/W access on all the MRT data). The index value being a count of the number of pixels previously written in the deep framebuffer for a given pixel.

Continuing this deep framebuffer example, in the CUDA or DX11 compute shader case, fetching and storing the index value would be coelesced global memory accesses (fast), where storing the pixel data would likely get divergent (high latency is ok, but bandwidth wasting is not). Actually divergence would probably be limited to polygon edges, so with larger triangles this might not be so bad...

Now with something like a surface cache I'd expect the divergent access to still be high latency (which is fully hidden by GPU thread model), but not bandwidth wasting assuming good data locality. So win all around.

With Larrabee, I'm expecting low latency and no main memory bandwidth (cache access to the tile in L2), but given divergence (ie vector scatter) for this deep framebuffer pixel write, will enough outstanding vector scatters be possible to not stall and loose compute capacity? If so, all sorts of crazy stuff might be possible (like massive point rendering, skipping rasterization all together).
 
What I was saying is that I think just having the full MRT pixel data as inputs to the fragment shader isn't general enough, and might also be bandwidth wasting in many cases.
Ah yes, certainly I'd like to see dynamically indexed inputs/outputs similar to indexed temporaries. That would address the ability to selectively read/write portions of the data structure as it were. Still, I'm happy with even not having that ability to start ;)

So curiously, what sorts of divergences are you expecting? Certainly something like conditionally inserting into a sorted list or similar would be expected to diverge, depending on the state of the data structure, but as you note it would also depend on the coherence of the triangles. In the K-buffer case for instance, you'd get some divergence where there was a polygon edge anywhere in the deep framebuffer, but you get the same thing with depth peeling and so forth so I'm not sure it's necessarily any worse than the alternatives. Plus it at least gives the programmer the power to deal with the scheduling/divergence as appropriate with domain-specific knowledge... this is particularly relevant in the deferred shading case, as a key benefit there is allowing more data-driven scheduling of the visible fragments.

Anyways there are quite a number of exciting possibilities really... it's one of the more interesting potential graphics API features IMHO actually, even more so than most of the stuff added in DX11.
 
Last edited by a moderator:
Why would you think that curiously?
Hmm. I guess I was thinking along the lines of what sebbi was suggesting where you put aside an in-flight fragment, just like with texture accesses. I forgot that with Larabee the polygons are already scan converted and stored that way in the bins, so it can skip over qquads and come back to them later before even starting to shade them if any underlying pixels are busy.
 
Back
Top