Currently SV_Target (render target pixel color) and SV_Depth (render target pixel depth) system semantics are write only in DX10 / DX10.1 (sm 4.1) pixel shaders. This limitation has been in place since the first (sm 1.0) shader model.
Since DX8 launch, I have waited for this this limitation to be lifted, as it would allow a lot of new algorithms to be developed using the graphics hardware (without rendering every triangle separately and pinpointing between 2 buffers), and as a side effect would also make the blending fully programmable (read SV_Target, do any custom blending, write result to SV_Target). Reading from SV_Depth would potentially speed up many post process effects (depth of field, volumetric particles, etc, etc) and allow the programmer to implement custom depth comparison modes (making many new algorithms possible).
I am not sure I if understand the GPU inner workings 100% properly, but I have always assumed that the GPU only setups / renders a single triangle at a time (all pixel shaders are processing the same triangle). There is no possiblity that the same pixel is being processed simultaneously by 2 separate threads of the same triangle (as triangles are planar surfaces), and there should be no thread synchronization issues caused by render target sampling while rendering to it. However I doubt my one triangle at a time assumtion is true on any newer hardware, as it would cause a lot of idle cycles when rendering small triangles (unless the remaining unified shader units are processing the next vertices during the rasterization). Could anyone with more hardware knowledge light me on this issue?
With things simplified a bit:
If you read SV_Target or SV_Depth in your shaders, the GPU knows this in advance and can slightly adjust the pixel rendering procedure (writing to SV_Depth is handled similarly on current hardware). Basically the only new thing the GPU would need to do is to fetch the existing render target color and the depth value before running the pixel shader, instead of after pixel shader like the current hardware does (during pixel depth test and blending). No extra render target depth / color fetches or bandwidth would be needed, as only the order of operations differ. This would not even affect the hierarchial Z operation, as it culls the pixels before the pixel shader (skipping the render target & Z data fetching in both cases). The GPU knows the affected screen pixels right after the triangle setup, and could prefetch the affected pixel color & depth values from the render target without that much extra logic needed (in comparison for normal texture fetches the texture coordinate needs to be calculated first before the fetch). This should not cost more than a simple unfiltered texture fetch (and could be optimized to be faster with a little bit of fixed function logic).
I am pretty sure that there is at least one major hardware issue I have overlooked, since this feature has been rumoured to be included in the next DirectX version for a long time, but the major hardware manufacturers have not yet implemented it on any of their chips. Custom write to SV_Depth was made possible in DX9 and it does cost some performance (hierarchial Z needs to be disabled at least). Reading from SV_Depth and SV_Target would theoretically be almost free (as both color and depth data still needs to be fetched for blending and depth testing after pixel shading). Even if reading from SV_Target / SV_Depth would cost as much as a single (unfiltered) texture fetch (each), it would still be a really widely used feature in the future games. I doubt this feature would require as much transistors as geometry shaders for example (a much less important feature currently imho).
Just my thoughts as a long time graphics engine programmer with just a little bit of hardware implementation knowledge . All information about this issue by real graphics hardware engineers would be more than welcome (I have been pondering this issue for a long time)!
Since DX8 launch, I have waited for this this limitation to be lifted, as it would allow a lot of new algorithms to be developed using the graphics hardware (without rendering every triangle separately and pinpointing between 2 buffers), and as a side effect would also make the blending fully programmable (read SV_Target, do any custom blending, write result to SV_Target). Reading from SV_Depth would potentially speed up many post process effects (depth of field, volumetric particles, etc, etc) and allow the programmer to implement custom depth comparison modes (making many new algorithms possible).
I am not sure I if understand the GPU inner workings 100% properly, but I have always assumed that the GPU only setups / renders a single triangle at a time (all pixel shaders are processing the same triangle). There is no possiblity that the same pixel is being processed simultaneously by 2 separate threads of the same triangle (as triangles are planar surfaces), and there should be no thread synchronization issues caused by render target sampling while rendering to it. However I doubt my one triangle at a time assumtion is true on any newer hardware, as it would cause a lot of idle cycles when rendering small triangles (unless the remaining unified shader units are processing the next vertices during the rasterization). Could anyone with more hardware knowledge light me on this issue?
With things simplified a bit:
If you read SV_Target or SV_Depth in your shaders, the GPU knows this in advance and can slightly adjust the pixel rendering procedure (writing to SV_Depth is handled similarly on current hardware). Basically the only new thing the GPU would need to do is to fetch the existing render target color and the depth value before running the pixel shader, instead of after pixel shader like the current hardware does (during pixel depth test and blending). No extra render target depth / color fetches or bandwidth would be needed, as only the order of operations differ. This would not even affect the hierarchial Z operation, as it culls the pixels before the pixel shader (skipping the render target & Z data fetching in both cases). The GPU knows the affected screen pixels right after the triangle setup, and could prefetch the affected pixel color & depth values from the render target without that much extra logic needed (in comparison for normal texture fetches the texture coordinate needs to be calculated first before the fetch). This should not cost more than a simple unfiltered texture fetch (and could be optimized to be faster with a little bit of fixed function logic).
I am pretty sure that there is at least one major hardware issue I have overlooked, since this feature has been rumoured to be included in the next DirectX version for a long time, but the major hardware manufacturers have not yet implemented it on any of their chips. Custom write to SV_Depth was made possible in DX9 and it does cost some performance (hierarchial Z needs to be disabled at least). Reading from SV_Depth and SV_Target would theoretically be almost free (as both color and depth data still needs to be fetched for blending and depth testing after pixel shading). Even if reading from SV_Target / SV_Depth would cost as much as a single (unfiltered) texture fetch (each), it would still be a really widely used feature in the future games. I doubt this feature would require as much transistors as geometry shaders for example (a much less important feature currently imho).
Just my thoughts as a long time graphics engine programmer with just a little bit of hardware implementation knowledge . All information about this issue by real graphics hardware engineers would be more than welcome (I have been pondering this issue for a long time)!