Depth / Stencil / Fragment Depth operation order question.

Graham

Hello :-)
Veteran
Supporter
Here is a question for those more knowledgeable on the inner workings of the graphics pipe...

When using Pixel Shader Depth Output, it's expected that the depth testing operation must then occur after the fragment shader is run. No Early-Z, Hi-Z, etc.

I was wondering when stencil testing occurs on modern video cards.
Given Stencil is so closely linked to depth, and stencil operations can depend on the depth value, I wonder if the actual stencil test still occurs before the fragment is shaded (+ Hi-Stencil)?

I'm currently mentally toying with an idea that will require fragment shader depth output, and the performance hit for lack of early z is far too high (it's very high fillrate and overdraw, without blending).

I was thinking that instead of the current depth fill pass for each object, I'd use a stencil fill pass with the vertices offset the maximum I expect the pixel depth to be offset (so zfail here would not update the stencil). (If that makes sense)... It wouldn't be as good, but it'd help a lot.

I want to know, because if the stencil test occurs after the fragment shader, then this will all be a waste of time. :(

(Note I will be using KEEP stencil ops for all geometry drawn with the depth shader)
(Also note this is targetted at DX9, so no DX11 min/max depth for me..)

Anywho. Thankeo :mrgreen:
 
As for stencil operations, on some of the DX10 generation hardware (mostly NVidia cards I have experience with), I've had great luck with drawing transparent pass sorted front to back (front first drawing, so blending destination alpha accumulates coverage) and using stencil to limit overdraw. So stencil is incremented after each pixel drawn and also used for clipping pixels once stencil is over a set overdraw limit per pixel. Seems to me, need to check nvperflib to be sure, that early stencil cull (ie course raster cull) is working on newer hardware even when writing to stencil (at least in my case, of not changing stencil rules). On some older hardware, writing to stencil can disable fast early stencil cull in course raster step.

I'd just profile to be sure on all the hardware you need to support since there are lots of sometimes undocumented (in PC space) rules as to what things effect depth or stencil cull performance. Ideal case is to insure that course cull is fast to insure the best performance. Even things like fragment discard might effect your cull performance since the course raster step cannot make assumptions that all pixels get written.

BTW, writing to depth in DX11 hardware from my understanding can be made fast if the program can insure that written depth is farther away from the triangle surface. Will be awesome this can actually be used!
 
I've had great luck with drawing transparent pass sorted front to back

Doesn't transparency need to be sorted back to front for correct composition? Or are you only using that with purely additive/modulative blend modes where order doesn't matter? Still, order matters when mixing additive/modulative and lerp blend modes, so I don't see how this can be applied in a general purpose way to translucency.
 
Doesn't transparency need to be sorted back to front for correct composition? Or are you only using that with purely additive/modulative blend modes where order doesn't matter? Still, order matters when mixing additive/modulative and lerp blend modes, so I don't see how this can be applied in a general purpose way to translucency.


If you use pre-multiplied alpha, you can combine additive and normal blend in one operation and at the same time you can correctly composite front to back via using the destination alpha.
 
There's early stencil that happens before the fragment shading. How effective it is depends on several factors. I wrote this document while I was at ATI that touches the subject a little:
http://developer.amd.com/media/gpu_assets/Depth_in-depth.pdf

Excellent paper. As is the case on Xbox360 if you use any of the FAIL or ZFAIL functions to do work the HiZ/HiStencil get switched off.
It makes sense because to support correct operation the hardware will have to do said operation in parallel over many pixels (usually a whole pixel group 8x8 or more) in a single cycle (per rejection basically).
 
There's early stencil that happens before the fragment shading. How effective it is depends on several factors. I wrote this document while I was at ATI that touches the subject a little:
http://developer.amd.com/media/gpu_assets/Depth_in-depth.pdf

That is an excellent article. I'll probably be redesigning some of our rendering I think.
However it doesn't appear to answer my exact question (either that or I'm not thinking today).

From your reply, you confirm that with shader depth out early-stencil is separated from early-z and still performed before shading?
I was mainly concerned that the two would always be done at the same time, which would seem logical given the way stencil/depth is stored, etc.

I'm not so concerned about the loss of HiZ / early-Z, as the very expensive shader I'm trying to avoid is a much bigger bottleneck :yes:

Thanks
 
Well.

I tried a quick hack test. Loads of passes, very expensive shader with depth output, and a stencil function that always fails.
No performance difference with or without the stencil test, so on this Ati card at least (x1800), the stencil test is performed after shading.

Shame.
 
I am assuming that you are doing an expensive raycast for one of the many forms of relief mapping? Probably long past the point of false economy, what about the following variation on "depth restore",

(1.) Render with depth test on, depth write off, with stencil test off, but stencil write on for pixels which pass, then render your expensive shader but write depth into a separate render target instead of actual depth. This perhaps means extra ALU work to pack Z into RGBA8. Render your geometry with the vertexes offset back in as you described before.

(2.) Render with depth write on, depth test off, stencil test on, stencil write off, then render a shader which writes actual depth by fetching it from the other render target.

Lots of variations are possible, like writing out depth only into a single channel FP32 with blend mode set to MAX if you need to render overlapping stuff which writes depth (for step 1 of the above method). Would have to later lookup that depth to do proper shading.. Or CPU side split into non-overlapping passes.
 
Back
Top