static branching - what is it good for?

ram

Newcomer
VS 2.0 offers a rather primitive way of branching. You can test 16 final static boolean flags, which can be set at triangle level at minimum.

Can someone explain me for what this can be used in practice?

I guess you could use the same VS program for multiple models for example, where you can enable/disable certain VS paths. But couldn't you do the same without static branching by sending different VS programs to the GPU for the different triangles? If yes, doing it with static branching might give a performance advantage here?
 
Yes, static branching is there as a possible performance optimisation.

Each VS instruction is large - certainly more than 32 bits, maybe many times that. The whole program could be hundreds of bytes, and takes time to upload. In contrast, a single VS constant is just 32 bits.

Also, changing programs on some architectures could involve idling the shader under some circumstances. A constant is less likely to cause the same issue (although I can still envisage scenarios in which it might).

It is of course possible to translate static-branch programs into linear programs and upload them separately as well, should the hardware not choose to support static branches.
 
Think of it the same way you use #define/#ifdef in C. It's something that doesn't even need to be supported in hardware. The driver can resolve the branch at bind time. (e.g. turn it into a NOP if it is not taken, otherwise force it to be taken)

Basically, it exists because some vendor's hardware actually don't have the capability of doing conditional branch. I see little reason to waste silicon actually implemention a static branch based on a constant, since it can be preprocessed.

It would be nice if the developer could say "fall back to software vertex processing for this vertex stream if this vertex program uses a dynamic branch)
 
DemoCoder said:
I see little reason to waste silicon actually implemention a static branch based on a constant, since it can be preprocessed.

Well, as switching shaders (which is basically what happends if it's just being preprocessed by the driver) has a very high performance penalty it can prove advantageous in many situations. I don't think there's much silicon needed either to support this feature.
 
What state change would be needed that isn't already happening? Seems to me that the driver could simply poke a byte in the memory that the shader is stored into and flush the cacheline containing that instruction.

I simply doubt that there are any real performance benefits to putting it in silicon. First of all, it's going to be a rarely used feature. Secondly, you're already changing state (if you have to rebind the constant register and send down individual primitives which need that particular constant, making the batching smaller) Third, it seems there is a straightforward way to do it with by "self-modifying" the shader code in the driver.

I'm just doubtful of the need to do it in hardware. I mean, if you have a branching instruction that can read a constant register, why not just allow it to read any register? Seems like a poor restriction with no real urgent use case, whereas allowing data dependent branching has plenty of uses.
 
DemoCoder said:
I mean, if you have a branching instruction that can read a constant register, why not just allow it to read any register? Seems like a poor restriction with no real urgent use case, whereas allowing data dependent branching has plenty of uses.

Its very different because of the sensible way that Vertex Shaders are implemented. A constant is a constant over a large batch of vertices. If you allows "any" registers to be read the branch is NOT contant over a large number of vertices, the branch might be different for every vertex resulting in different program paths for different vertices in the same call. If you have SIMD hardware this is not trivial. You might reason : if I can branch based on a constant why not branch on any value... trouble is this completely depends on how the hardware is designed.

Vertex programs have to be uploaded to the hardware and this costs cycles, if you can write one larger shader upload it and use it for all the objects in your scene you'r going to gain performance. This is a very valid use IMHO.

K-
 
Well, but why can you use the "modify code in place" trick that I specified? Surely this can be done and the only special hardware needed would be the ability to invalidate a cache line. This eliminates the upload cycles needed.

The X-Box already uses one trick to rapidly switch shaders. You simply concatenate multiple shader problems together in memory (e.g. upload them to video ram once), and then all you need to change to "switch" is the initial Program Counter (e.g. point it at the right shader)


Again, I'd like to see a real compelling use case for this "constant branching" shader. Doesn't seem like a big win to me compared to simply preloading two shaders into ram, or using the "modify in place" trick I talked about.
 
ram said:
VS 2.0 offers a rather primitive way of branching. You can test 16 final static boolean flags, which can be set at triangle level at minimum.

Can someone explain me for what this can be used in practice?

I guess you could use the same VS program for multiple models for example, where you can enable/disable certain VS paths. But couldn't you do the same without static branching by sending different VS programs to the GPU for the different triangles? If yes, doing it with static branching might give a performance advantage here?

I see several advantages of an architecture supporting static branching natively:
1. Some architectures may need a pipeline flush when changing programs, but not when changing the branch constant.
2. It's a way of 'compressing' programs. The program will have a common part and some other part which is object specific and which is toggled via a constant branch. That way you reduce the thrashing of the program storage.
3. Switching to a different program, may not be as straightforward as you think: the program could be state dependent and it may require recompiling. Don't believe that VS syntax is the assembler of the graphics chip, because it may be true for one graphics vendor, but not for the rest (in all the other architectures a full compilation phase may be needed).
 
Some games like to stitch together vertex shaders out of modular pieces -- a bone piece, a transformation piece, a lighting piece, a texture generation piece, and so on. Constant registers and loop registers allow the developer to specify a whole family of related shaders at once.

Depending on how many different effects are being used, a game can have thousands, or even millions, of different shaders. Far too many to fit into the GPU's instruction memory.

With DX 8 you can stitch together shaders on the fly, but this has performance implications. Some GPU architectures benefit from allowing the driver to reorder vertex shader instructions to increase parallelism and hide latency. On those architectures, using constant registers and looping registers may allow you to optimize a whole class of shaders at once, rather than having to re-run the optimizer for each combination.
 
Back
Top