Faking dynamic branching - technical discussion

Mintmaster

Veteran
I want to start a purely technical thread about the method in Humus' latest demo. No flames or comments on the merit of ATI/NVidia's decisions in their architectures, no financial bickering, no "I hate companyX", etc.

Couple of discussion starting points:

1. Stencil shadowing
- I was thinking how the final stencil buffer (with all shadows flagged by a 1) could be modified to also flag pixels where N dot L < 0, using bump maps. It might not be worth the extra pass, though, because the original volumes would already knock off most of the dark regions.
- For attenuated lights, however, this could work very well. Use the same stencil buffer, and just increment the stencil of all pixels outside the lights range. Might be good for Doom3.

2. NV40 - which method is better?
- pocketmoon66, could you summarize all your findings?
- so, what is the actual penalty of dynamic branching on NV40?
- why did NV40 have that restriction on early stencil out?
- how much do you think a future "hierarchical stencil buffer" would improve performance? I'm thinking ATI might implement such a feature eventually.
- Anyone think we'll have zero-penalty dynamic branching in the next gen?

3. Region bounding
- I was thinking about neighbourhood transfer for spherical harmonic lighting. That would let you have one object cast shadows and reflect light onto another. 3D textures could hold the SH coefficients for a small volume around an object
- You could flag all the pixels around the object causing the radiance transfer, and subtract/add light to neighbouring objects in subsequent passes. The fillrate cost should be a lot lower this way.

4. Nalu wrapper, anyone?
- This method should handle the scales and skin border just fine
- Anyone know when we'll have access to it?

So, what do you guys think?
 
4. Nalu wrapper, anyone?
- This method should handle the scales and skin border just fine
- Anyone know when we'll have access to it?
Do you mean access to the demo? I can give a copy to any interested coders.
 
Mintmaster said:
- I was thinking how the final stencil buffer (with all shadows flagged by a 1) could be modified to also flag pixels where N dot L < 0, using bump maps. It might not be worth the extra pass, though, because the original volumes would already knock off most of the dark regions.

This is a little more doubtful if there will be any gain. The cost may be higher than the benefits. What is beneficial though, that wasn't included in this demo, is to cull pixels that backface the light too. This can be packed into the same dot-product operation. It'll just change the dot3 to a dot4, so it's for free. I tried this at work today, and it gave another ~10% performance increase.

Mintmaster said:
- how much do you think a future "hierarchical stencil buffer" would improve performance? I'm thinking ATI might implement such a feature eventually.

Since the cost is on the fully lit and shaded pixels, rather than the pixels that get culled, I don't think it will improve performance dramatically. Probably < 10%. Especially if other things like scissor rectangles and clip planes are used together with this technique.
 
I hope this doesn't break the imposed flames/fans rules by me commenting just a tad off-tech, but are you serious when you imply that the Nalu demo could be 'converted' to run on R3xx & nV3x cards?

I was under the impression that Nalu was just one big horking shader and it was only possible on the nV4x, I guess I'm just looking for clarification and I'll try like hell to stay out of this thread and just keep me trap shut from now on I promise!
 
The penalty of dynamic branching on the 6800, beyond the execution cost of the if statement itself, occurs under incoherent branching in the four pixel pipelines of a quad. The quads are still massive SIMD units instead of four independent pixel pixelines. If the condition of the if statement does not equate to the same result in most 2x2 pixel neighbourhoods in your image (like using random noise), don't expect yourself to have quads anymore.
 
Mintmaster said:
- For attenuated lights, however, this could work very well. Use the same stencil buffer, and just increment the stencil of all pixels outside the lights range. Might be good for Doom3.

Well it might turn out that isn't much of a win.

Shadow volumes are closed - they are basicly capped at their far end.
This is actually required for Carmack's reverse method to work correctly.

So it's practical to cap the shadow volume at the light's range so most of the out-of-range pixels are already discarded.
 
Hyp-X said:
So it's practical to cap the shadow volume at the light's range so most of the out-of-range pixels are already discarded.
I see many devs use capped to infinity shadow volume for directional lights (To save CPU cycles in shadow volume extraction and to reduce polygon count in the shadow volume) so under some view angle respect to the light the idea Mintmaster proposed would be quite useful in certain cases.

ciao,
Marco
 
Moved the answer here...

Humus said:
Sigma said:
It is an hack. This does not replace if statements at all. One thing it does is mess and complicate the renderer..

The solution you presented to shadow volumes may lead to problems if the passes in the stencil go over the 4 bytes. Then you would have to have to use only 3 bits for if statements, or just 2, etc...

Mess? If you consider adding about 20 rows of code a mess, then I don't know what's not messy.
No, I don't think it will lead to any problems ever. Do you really expect an overdraw factor above 16 while at the same time having more than 16 level deep nesting? I would say in 99% of the cases you're fine with one or two bits for the ifs, and 3 bits for shadows.

Yes I consider it a mess, specially if shaders start to get bigger and bigger, if you use only a zbuffer of 16bits, if you wan't to save the stencil for something else, not worrying how many bits you must save. The thecnique is good for now, because ATI doesn't support PS3.0 and shaders are fairly simple and dynamic branching, because it is in it's early stages, does have a slight impact... But since there aren't a lot of PS3.0 machines out there, why bother now? PS3.0 is for stuff comming out a year from now...

Humus said:
Sigma said:
:oops: Sorry! Occlusion. I mean occlusion...
Elaborate?

You're assuming (correctly) that the stencil out is performaned before the shading, making obviously, this tecnhique fast because the pixel is killed before entering the pixels shaders. So, not using it in the dynamic branching path, makes the tecnhique a bit biased towards the stencil way. Performing some occlusion, that every card supports, should help a lot, specially the dynamic path, putting it more in the same level of workload...
 
This discussion isn't about the why's or wherefores - keep that to the other one. Technical only please.
 
Drak said:
The penalty of dynamic branching on the 6800, beyond the execution cost of the if statement itself, occurs under incoherent branching in the four pixel pipelines of a quad. The quads are still massive SIMD units instead of four independent pixel pixelines. If the condition of the if statement does not equate to the same result in most 2x2 pixel neighbourhoods in your image (like using random noise), don't expect yourself to have quads anymore.
So, in such a case where dynamic branching is used, will the execution time of the shader program on that quad be limited to the "slowest pixel"? Or is there another penalty associated with the breaking of coherency amongst pixels?
 
Ostsol said:
So, in such a case where dynamic branching is used, will the execution time of the shader program on that quad be limited to the "slowest pixel"? Or is there another penalty associated with the breaking of coherency amongst pixels?

Details are very sketchy on this one.

Mark Harris from GPGPU believes that the "branching is simd" and yes, the execution time in that quad will be that of the slowest pipeline.

Another french site says that the quad will have to do the "then" path first then the "else" path for those quads that need both.

The pixel shader quads of the NV40 are still described as SIMD and they're definitely not fully MIMD like the vertex shaders. But the pixel pipelines can co-issue and dual-issue, so they're not really fully SIMD.

I've got to wait for a 6800U to try it and get some hard numbers. It's so difficult to get hold of one at the mo.
 
You may have already seen it, but at anandtech there is an article that analyses the new FarCry patch: http://www.anandtech.com/video/showdoc.html?i=2102

This patch adds a SM 3.0 shaderpath, which basically does the same lighting/shading operations as the SM 2.0 path, but in less passes, by use of dynamic branching.
Apparently dynamic branching is fast enough to increase performance for this game.
 
Slightly OT;

tb, the developer of the Shadermark-Benchmarks has made trials with dynBranching. The results are imho strange, cause the overhead of dynBranching seems to be very high.

The Thread with the results and shaders he used can be found here: German 3DCenter.de Forum Page4 ff

All tests were made with an NV40


Manfred
 
Scali said:
You may have already seen it, but at anandtech there is an article that analyses the new FarCry patch: http://www.anandtech.com/video/showdoc.html?i=2102

This patch adds a SM 3.0 shaderpath, which basically does the same lighting/shading operations as the SM 2.0 path, but in less passes, by use of dynamic branching.
Apparently dynamic branching is fast enough to increase performance for this game.

This is funny. I really want to know what crytek have tols anandtech. But in any case this is not the first time anandtech understand something wrong.

I did not bets a cent that patch 1.2 use branching in the pixelshader.

Because I am not sure if someone will believe me I will not bore you with details. Anyway, in a few days everyone will be able to convince itself.
 
Demirug said:
Because I am not sure if someone will believe me I will not bore you with details. Anyway, in a few days everyone will be able to convince itself.
Well, I'm not bored by details, so please share your thoughts with us.
Are you implying dynamic branches in pixel shaders (on the NV40) are so costly to be almost a no option?

ciao,
Marco
 
nAo said:
Demirug said:
Because I am not sure if someone will believe me I will not bore you with details. Anyway, in a few days everyone will be able to convince itself.
Well, I'm not bored by details, so please share your thoughts with us.
Are you implying dynamic branches in pixel shaders (on the NV40) are so costly to be almost a no option?

ciao,
Marco

I was talking about details of the 1.2 Farcry patch.

Dynamic branches & NV40 are a much more complicated thing. They reason for this are the internal working principles of the cinefx pipelines. nVidia use a unusually way that works with batches of quads. But IMHO this will drive this thread too much offtopic.
 
Back
Top