Faking dynamic branching - technical discussion

As this topic seems to be rather interesting I've decided to share my thoughts on it too.

First of all - thanks to Humus. He managed to use this old philosophy(stencil culling) in a rather new way. It's not so easy to do as some may think, this is a real invention.

Second: I wouln't call it dynamic branching(it is not :D ), but rather stencil emulation of it.

Third: no performance boost on NVs shows the leck of early stencil discard. It's a disadvantage on Nvidias hardware, althought it's not so important as well. In this case Nvidia fails, but on the I've never seen methods like this to be used commonly.

Forth and most important one: where does 2-4x performance boost on radeons come from? The demo of Humus performs dynamic stencil culling of lighted fragments based on limit attenuation cases. But: he uses several lights with rather small radiation radius each. So, each pixel will be usualy lit by 1-2 lights, not more. Here does the performance boost come from. Using SM3.0 for it will result in much better boost as I predict, because it allwos you to avoid state changed and sending geometry more then one time. Could someone port this demo to SM3.0 and see if i'm right(I own only crappy FX5600 :cry: )? If my thoghts are right, stencil emulation will use nothing or very little in real games like Far-Cry, because of different environment. In Far-Cry each fragment is usually lit by all lights. NV4x gets some boost mainly because of one-pass solution(avoiding state changes). That's the reasom why I thing this optimisation is rather useless for real applications. It could however find use in scenes with lots of small lights - it will sadly enought still be an ATI-only optimisation.
 
Zengar said:
shows the leck of early stencil discard. It's a disadvantage on Nvidias hardware, althought it's not so important as well. In this case Nvidia fails, but on the I've never seen methods like this to be used commonly.
Umh..NV2A supports early stencil culling, I verified it and used a trick very similar to what Humus proposed almost 2 years ago on the XBOX to perform some kind of post processing effect that employed a full lenght 1.0 pixel shader only on certain pixels of the image (it improved performance quite a lot). Others confirmed that early stencil culling works on other nVidia GPUs

ciao,
Marco
 
Zengar said:
Third: no performance boost on NVs shows the leck of early stencil discard. It's a disadvantage on Nvidias hardware, althought it's not so important as well. In this case Nvidia fails, but on the I've never seen methods like this to be used commonly.

!


NV stencil culls as long as you don't write to the stencil buffer. Modding the code to do that gives a big speed-up
 
I've updated the demo to include the option to choose between the two methods. Clearing the full stencil buffer is unneccesary work on ATI cards, so it'll still use zeroing on pass instead of full clear on ATI hardware, it's a bit faster that way. Everyone else will use full clear as default, but you can switch between the two and compare.

http://esprit.campus.luth.se/~humus/
 
Hyp-X said:
Mintmaster said:
- For attenuated lights, however, this could work very well. Use the same stencil buffer, and just increment the stencil of all pixels outside the lights range. Might be good for Doom3.

Well it might turn out that isn't much of a win.

Shadow volumes are closed - they are basicly capped at their far end.
This is actually required for Carmack's reverse method to work correctly.

So it's practical to cap the shadow volume at the light's range so most of the out-of-range pixels are already discarded.
I might be missing something, but after you do the stencil pass, don't you have to draw the entire scene for a lighting pass? Even if your shadow volumes are shortened, it's still not going to cull pixels beyond the light range in the lighting pass, right?

If during the stencil pass (or in a subsequent pass) you incremented the stencil of all pixels outside the light's range, then the expensive lighting pass could have a vastly reduced pixel count. Pretty much exactly what Humus' demo is doing.
 
pocketmoon66 said:
NV stencil culls as long as you don't write to the stencil buffer. Modding the code to do that gives a big speed-up
Could you please show all of your results? I take it most of the quads on the screen will either be all lit or all out of range, so it should be an ideal situation.
 
digitalwanderer said:
I was under the impression that Nalu was just one big horking shader and it was only possible on the nV4x, I guess I'm just looking for clarification and I'll try like hell to stay out of this thread and just keep me trap shut from now on I promise!
From what I remember of 991060's post about the Nalu shader, it just used a texture to determine if a pixel was a scale, skin, or blend of the two. This stencil method could separate the three branches into multiple passes. Each branch may require multiple passes, though, because NV4x can do 2 more textures per pass (i think).

All in all, it'll require a lot of passes but I think it can be done, unless they decided to use vertex texturing for some reason, which I doubt since it wouldn't do very much, and there seems to be a high vertex count. I doubt multipassing will affect performance much because they'll probably be lengthy shaders in each pass.

This is all speculation, though, because I haven't seen any code yet...
 
Mintmaster said:
pocketmoon66 said:
NV stencil culls as long as you don't write to the stencil buffer. Modding the code to do that gives a big speed-up
Could you please show all of your results? I take it most of the quads on the screen will either be all lit or all out of range, so it should be an ideal situation.
Here what he wrote:
pocketmoon66 said:
Found what was hurting NV cards :

changing
dev->SetRenderState(D3DRS_STENCILPASS, D3DSTENCILOP_ZERO);
to
dev->SetRenderState(D3DRS_STENCILPASS, D3DSTENCILOP_KEEP);

(Early Stencil kill doesn't work if your still writing to stencil??)


Oops! missed a bit out - wasn't clearing the stencil buffer between passes.
Demo now runs :


1280x960
FALSE: 51
TRUE: 185 ish <- better
DB PS2 (cmp) 54
DB PS3 (if then else) 65 ish

I'll work on combining the individual light passes into a single SM2/3 pass and see how that works out...
http://www.beyond3d.com/forum/viewtopic.php?t=13716&postdays=0&postorder=asc&start=300
 
Humus said:
Mintmaster said:
- I was thinking how the final stencil buffer (with all shadows flagged by a 1) could be modified to also flag pixels where N dot L < 0, using bump maps. It might not be worth the extra pass, though, because the original volumes would already knock off most of the dark regions.

This is a little more doubtful if there will be any gain. The cost may be higher than the benefits. What is beneficial though, that wasn't included in this demo, is to cull pixels that backface the light too. This can be packed into the same dot-product operation. It'll just change the dot3 to a dot4, so it's for free. I tried this at work today, and it gave another ~10% performance increase.
Hehe, I incorrectly assumed your demo was doing this already. In several of the NVidia papers mentioning dynamic branching, they suggested using N dot L as a branch condition to knock off dark pixels, which is why I made this suggestion in the first place.

For stencil shadowed games, e.g. Doom3, the backface is already inside a shadow volume, right? Wouldn't help there, I guess, unless the bumps created a lot of dark regions.
 
Mintmaster said:
I might be missing something, but after you do the stencil pass, don't you have to draw the entire scene for a lighting pass? Even if your shadow volumes are shortened, it's still not going to cull pixels beyond the light range in the lighting pass, right?

If during the stencil pass (or in a subsequent pass) you incremented the stencil of all pixels outside the light's range, then the expensive lighting pass could have a vastly reduced pixel count. Pretty much exactly what Humus' demo is doing.

Hehe.

Forget what I wrote it's a complete BS. :oops:
You are the first one to realize that!

Actually capping the shadow volumes would make things worse instead of better (altough it might reduce the fillrate requirement of drawing the volume, but that's a completely different thing.)
 
Mintmaster said:
From what I remember of 991060's post about the Nalu shader, it just used a texture to determine if a pixel was a scale, skin, or blend of the two. This stencil method could separate the three branches into multiple passes. Each branch may require multiple passes, though, because NV4x can do 2 more textures per pass (i think).
I somewhat doubt it. From the interviews I've read, the branching used was coherent branching from data calculated in the vertex shader.
 
Hyp-X said:
Hehe.

Forget what I wrote it's a complete BS. :oops:
You are the first one to realize that!

Actually capping the shadow volumes would make things worse instead of better (altough it might reduce the fillrate requirement of drawing the volume, but that's a completely different thing.)
I had a completely different image in mind when you wrote that statement, that the "capping" would entail drawing an outer bound to the shadow volume, effectively defining everything outside some specific range to be in shadow. That would increase the fillrate requirement of drawing the volume, and would cull pixels automatically outside some range.

For performance reasons, though, you'd want to decide on what to use for a bounding shape for the "cap."
 
Ostsol said:
Drak said:
The penalty of dynamic branching on the 6800, beyond the execution cost of the if statement itself, occurs under incoherent branching in the four pixel pipelines of a quad. The quads are still massive SIMD units instead of four independent pixel pixelines. If the condition of the if statement does not equate to the same result in most 2x2 pixel neighbourhoods in your image (like using random noise), don't expect yourself to have quads anymore.
So, in such a case where dynamic branching is used, will the execution time of the shader program on that quad be limited to the "slowest pixel"? Or is there another penalty associated with the breaking of coherency amongst pixels?
They probably fall back to predication and execute every instruction if all pixels in the SIMD don't take the same path. This is the simplest option I can think of at the moment.
 
Yeah, you'd either have to do that or break up the quad, sending the pixels through separately.
 
Chalnoth said:
Mintmaster said:
From what I remember of 991060's post about the Nalu shader, it just used a texture to determine if a pixel was a scale, skin, or blend of the two. This stencil method could separate the three branches into multiple passes. Each branch may require multiple passes, though, because NV4x can do 2 more textures per pass (i think).
I somewhat doubt it. From the interviews I've read, the branching used was coherent branching from data calculated in the vertex shader.
The following depends on the validity of 991060's information, but here's the shader in his post:
Code:
float4 main(...SNIP(cut out parameter declarations)...
                  ) : COLOR 
{ 
//  f2fConnector f2f; 
  half3 Skin, Scales; 
  Skin, Scales = 0; 

  half4 diffmap = h4tex2D(c_diffuseCol, v2f.c_texCoord);    

  if (diffmap.w < 1.) 
    Skin =  SkinShader(v2f, 
                  g_softshadows, 
                  diffmap.rgb, 
                  c_specular, 
                  c_bumpCol, 
                  subsurface_blurred, 
                  shadowDepthMap, 
                  g_caustics, 
                  g_caustics_z, 
                  g_fillcolor, 
                  g_eyePointLight0Pos, 
                  g_LColor); 

  if (diffmap.w > 0.) 
   Scales = ScalesShader(v2f, 
                  c_scaleCol, 
                  c_scaleBump, 
                  c_scaleMasks,                  
                  c_irid2DCol,    // common to all scales 
                  c_iridCube,    // common to all scales                  
                  subsurface_blurred, 
                  g_skintoneblend, 
                  g_caustics, 
                  g_caustics_z, 
                  g_fillcolor 
                  ); 
  
  float4 COL; 
  COL.rgb  = lerp(Skin.xyz, Scales.xyz, diffmap.w); 
  COL.w    = 0; 
    
  return COL; 
}

I cut out SkinShader, ScalesShader, and some other stuff. You'll see that it's doing exactly what I said. Anyway, even if the branching comparison data was from the vertex shader, then that would be just as doable for this technique.

So, Colourless, are you willing to make your OpenGL wrapper framework available to those of us willing to tackle this demo? Pretty please? :D
 
3dcgi said:
Ostsol said:
Drak said:
The penalty of dynamic branching on the 6800, beyond the execution cost of the if statement itself, occurs under incoherent branching in the four pixel pipelines of a quad. The quads are still massive SIMD units instead of four independent pixel pixelines. If the condition of the if statement does not equate to the same result in most 2x2 pixel neighbourhoods in your image (like using random noise), don't expect yourself to have quads anymore.
So, in such a case where dynamic branching is used, will the execution time of the shader program on that quad be limited to the "slowest pixel"? Or is there another penalty associated with the breaking of coherency amongst pixels?
They probably fall back to predication and execute every instruction if all pixels in the SIMD don't take the same path. This is the simplest option I can think of at the moment.

It seems to work like that although the branching cost is still there.

I've done some tests around dynamic branching in June. I got interesting but strange results. I'm waiting for an explanation from NVIDIA.

It has been said that when all pixels of a given quad don't take the same branch than both branches are computed for these pixels. Does that mean that only the required branch is computed in the other case ? My results say NO. They show another limitation. However I'm still wondering if it is an hardware limitation or a limitation coming from the current drivers.
 
Tridam,

I have got some information that the programcontrol on all CineFX chips (NV3X/NV4X) is based on batches of quads. This means you have only one instruction pointer per batch. As a result of this all quads in a batch have to run the same instructions.

I am not sure about the size of a batch but it looks like that is not fixed at all. On the other hand there are a indicator that the number of batches that can run at the same time is limited. In the case of branching short batches looks better but each batch have an overhead of one clock per pipeline pass. As example. If a batch contains 99 quads you need 100 clocks to execute one instruction (VLIW) for the whole batch. Shorter batches will burn more clocks for the batch overhead.
 
Demirug said:
Tridam,

I have got some information that the programcontrol on all CineFX chips (NV3X/NV4X) is based on batches of quads. This means you have only one instruction pointer per batch. As a result of this all quads in a batch have to run the same instructions.

Yes, that's what my results showed.

That means that if a single pixel in these 1024 quads uses a different branch then both branches will be computed for all of these pixels.

The size of the batch is huge at least with current drivers (under 1024 quads both branches are always computed). The size seems to be at least 1024 quads but seems to be expanded to the number of quads that use the same branch.

I've tried with different triangle sizes but the size of the batch doesn't seem to change. I hope that we will see some improvements with future drivers.
 
Tridam said:
It has been said that when all pixels of a given quad don't take the same branch than both branches are computed for these pixels. Does that mean that only the required branch is computed in the other case ? My results say NO. They show another limitation. However I'm still wondering if it is an hardware limitation or a limitation coming from the current drivers.
The limitation that you are wondering about, are you referring to computing both branches or the other unspecified one?
 
Back
Top