Dynamic branching on X1800GTO vs Go7800

mikegi

Newcomer
My app has a volume renderer that uses axis-aligned stacks of 2d textures. I recently updated the pixel shaders to ps_3_0 and dynamic branching to eliminate unnecessary gradient+lighting calculations. This results in a big win on the ATI X1800GTO in my dev machine (30% increase in fps on a busy volume, higher on those with more empty space) but a slight *decrease* in performance on the Go7800 in my Dell 9400 laptop. Makes no sense at all to me. The Go7800 must be executing both paths of the branch, as if it didn't support dynamic branching.

Is this something particular to the Go7800 or should I expect the same on all Nvidia 7x00 GPUs?

Another interesting performance item: replacing an opacity-correction equation in the pixel shader ( 1 - pow(1-alpha,delta) ) with a simple 2d texture lookup resulted in a 35% increase in fps on the X1800GTO.

I'm using Direct3D 9.0c and HLSL in a effect file to display a 96x96x64 scalar volume. The volume renderer is pixel shader intensive, rendering a filled 640x480 window at ~65 fps on the X1800GTO. The Go7800 only manages 30 fps in the same test.

Anyway, I'd appreciate hearing anyone's thought on the perf weirdness I'm seeing.

Thanks,
Mike
 
I think that might just be an issue of the known poor dynamic branching performance in all G70-derived GPUs.
 
Thanks. Pretty shocking threads. Why does Nvidia even claim to support dynamic branching? Is the entire 7x00 family this way? I'd like to tell my customers which cards to avoid. I'll have to add a special checkbox to my app to allow users to fall back to the ps_2_0 path on their Nvidia cards. What a hassle.

Mike
 
Why does Nvidia even claim to support dynamic branching?
Maybe because they do? Dynamic branching works quite well on G7x (and NV4x) GPUs. You may not like the performance characteristics, but that does not mean the feature does not work correctly.

Incidentally, try branching around more instructions. Around 20-50 skipped instructions should do the trick and show a noticible performance improvement. If you have less instructions to skip, the compiler is likely to just use predication (among other things) to avoid the overhead of branching.

Or you can get yourself a G80, which does better branching than any other GPU out there.
 
Interesting, you're one of the first, if the first, person I've seen state that dynamic branching works well on G7x GPUs Bob.
 
Maybe because they do? Dynamic branching works quite well on G7x (and NV4x) GPUs. You may not like the performance characteristics, but that does not mean the feature does not work correctly.
I'm using this feature (dynamic branching) to increase performance. On Nvidia cards it decreases performance. That's a bug, imho.

Incidentally, try branching around more instructions. Around 20-50 skipped instructions should do the trick and show a noticible performance improvement. If you have less instructions to skip, the compiler is likely to just use predication (among other things) to avoid the overhead of branching.
My shader doesn't have that many instructions. What I'm trying to skip are a couple of texture accesses and lighting calcs. This all works fine on ATI.

Or you can get yourself a G80, which does better branching than any other GPU out there.
Sorry, my customers have their own hardware. I've notified my users that they should avoid Nvidia hardware (anything less than the new 8x00 family).
 
I'm using this feature (dynamic branching) to increase performance. On Nvidia cards it decreases performance. That's a bug, imho.
Not really... if you use it for more coherent branching, it will be a performance win. The only difference between NV 7x00 and ATI 1x00 series' is the granularity of the threading, meaning that ATI can do better with *less* coherent data. Note that making it as coherent as possible is important on *both* cards (and all foreseeable future cards with similar threading architectures).

My shader doesn't have that many instructions. What I'm trying to skip are a couple of texture accesses and lighting calcs. This all works fine on ATI.
It seems to "work fine" on NVIDIA as well if the performance characteristics are changing. I could play devils advocate and say that it could even be that NVIDIA's "texture accesses and lighting cals" are so much faster than ATI's that you lose overall with a dynamic branch.

Now of course you're almost certainly getting bitten by the large branch granularity of the NV 7x series, but that's pretty well known (ATI had a field day about it at the launch of the X1800 series). I think it's unfair to say that the DB is "broken". It's fair to say that it's inefficient in your case.

If you really care about performance on NVIDIA cards, there are ways to make it better. If you're happy to just use ATI X1x00 or NVIDIA 8x00, that's great too!
 
Not really... if you use it for more coherent branching, it will be a performance win. The only difference between NV 7x00 and ATI 1x00 series' is the granularity of the threading, meaning that ATI can do better with *less* coherent data. Note that making it as coherent as possible is important on *both* cards (and all foreseeable future cards with similar threading architectures).
I don't have control over the data, it's real time volumetric radar scans. Here's an example animation (a tornado):

http://www.grlevelx.com/gr2analyst/ktlx_19990503_2331_tornado.gif

Users have complete control over the opacity of the data.

If you really care about performance on NVIDIA cards, there are ways to make it better. If you're happy to just use ATI X1x00 or NVIDIA 8x00, that's great too!
I'm perfectly happy with the X1800GTO. Unfortunately, my customers already have hardware and I need to work around their existing systems. Adding the DB shader slows Nvidia systems so I will need to provide a checkbox that reverts back to ps_2_0 on their systems.

Mike
 
I'm using this feature (dynamic branching) to increase performance. On Nvidia cards it decreases performance. That's a bug, imho.
I hope you realize that even on CPUs, branching is not always a win. Try branching around a simple assignment, for example, then look at the resulting disassembly. You might be surprised what the compiler does with it.

You do need to skip around some non-trivial amount of code for branching to be a win. Back in GPU land, the number of instructions you need to skip is different for different GPUs. It happens to be more on G7x than on R5xx, but less on G80 than either of those.

This is independent of branch coherence, which is another issue that needs to be considered.
 
I hope you realize that even on CPUs, branching is not always a win. Try branching around a simple assignment, for example, then look at the resulting disassembly. You might be surprised what the compiler does with it.
I've written thousands of lines of x86 asm in operating systems, graphics drivers, libraries, etc. I'm well aware of compiler output and optimization.

None of that matters because the bottom line is this: Nvidia GPUs are slower when dynamic branching is used. My apps will need to offer users a way around this problem.
 
Here's an example animation (a tornado):
That's cool! I've worked with some similar volumetric rendering stuff actually and the results are usually pretty neat :)

Adding the DB shader slows Nvidia systems so I will need to provide a checkbox that reverts back to ps_2_0 on their systems.
That seems reasonable to me. If you're using HLSL, it can probably do all of the heavy lifting for you - compiling for a 2_0 target should auto-unroll DB.
 
Another interesting performance item: replacing an opacity-correction equation in the pixel shader ( 1 - pow(1-alpha,delta) ) with a simple 2d texture lookup resulted in a 35% increase in fps on the X1800GTO.

Is that a scalar op or vector? If it's a float4 op something like that should result in 9-12 instructions just for the pow function (depending on whether delta is also float4 or not). If it's a scalar op I find such an increase for replacing with texture very surprising. Have you tried using GPU Shader Analyzer (discussed here)? It's invaluable in situations like this. It has helped me a lot in my work.
 
Maybe because they do? Dynamic branching works quite well on G7x (and NV4x) GPUs. You may not like the performance characteristics, but that does not mean the feature does not work correctly.

Incidentally, try branching around more instructions. Around 20-50 skipped instructions should do the trick and show a noticible performance improvement. If you have less instructions to skip, the compiler is likely to just use predication (among other things) to avoid the overhead of branching.

Or you can get yourself a G80, which does better branching than any other GPU out there.
I wouldn't expect G80 to branch better than R520.
 
Not really... if you use it for more coherent branching, it will be a performance win. The only difference between NV 7x00 and ATI 1x00 series' is the granularity of the threading, meaning that ATI can do better with *less* coherent data. Note that making it as coherent as possible is important on *both* cards (and all foreseeable future cards with similar threading architectures).
I think it's actually more than just the granularity. On G7x, the branching instructions actually take cycles, but on R5xx and G8x they're free. The granularity is of course a major performance factor, but if you have something like a while loop over a small amount of code (e.g. variable step raytracing) then the overhead really adds up.
 
Back
Top