Dany - Shader 3.0 and Branching

Dave Baumann

Gamerscore Wh...
Moderator
Legend
Dany - you mentioned in you Shader 3.0 post that you think the dynamic branching in PS3.0 will both speed things up and be very handy. What uses do you see this being for and how do you see things being sped up? There is, afterall, a school of thought that says branching shouldn't even be in a graphics pipeline.
 
Dynamic branching performance

From a performance standpoint:
Assuming you have only 1 ALU do do your maths and you decide to branch, then you will have to wait until you know the outcome of conditional code before jumping. This is especially annoying on CPUs when the pipeline is very long (they invented branch prediction to prevent having to wait all the time). On graphics however, the same operation is applied to every pixel of a triangle (and generally several triangles in the same batch). Therefore, it's very simple for HW architects to allow massive multithreading (if you would assume that 1 thread = 1 pixel). In that respect, waiting for the result of the conditional code to execute doesn't really matter because you can work on other pixels while you are waiting for the conditional code to complete and then get back to this pixel thread.

Depending on how many ALUs you have in your engine and how complex your shader is, it's unlikely you will want to execute long shaders if there is no or little difference in visual output. Without dynamic branching, you may have no choice but to execute complex code per pixel to make sure you render the accurate output. I won't go into specific usage details today (confidential) but I'll let you guess what it could be.
 
Re: Dynamic branching performance

Dany Lepage said:
Therefore, it's very simple for HW architects to allow massive multithreading (if you would assume that 1 thread = 1 pixel).
This must be a new definition of the expression "very simple" of which, up until now, I'd been unaware. :D
 
The main issue that I can see with this is that currently contemporary hardware doesn’t work at a pixel level, but at a quad level. Certain functions within DX have called for this to be the case, such as dx/dy, so I don’t see any manufacturers suddenly hopping back from quads, at least not in the near term (i.e. shader 3.0 parts). You are then left with the situation where, because of branching, one pixel in your quad may take a different path than the rest of the pixels – how can the be resolved whilst still speeding things up? In this situation I can only see that you would need to calculate the quad multiple times for the number of separate pixel branches within that quad (and hope that there aren’t further sub-branches as well).
 
In a SIMD architecture you might send all 4 pixels of the quad to the pixel shader along with a pixel mask to indicate which pixels are valid. Even though each pixel has the same instructions (program) they execute on different processors so they branch individually. Then just sync them at the output if the pixels need to stay together at that point. I'm not sure how the dx/dy instruction would be implemented though so maybe it's not done this way.
 
3dcgi said:
I'm not sure how the dx/dy instruction would be implemented though so maybe it's not done this way.
IIRC, the dx/dy instruction is deemed to be invalid if the pixels have branched.
 
Simon F said:
3dcgi said:
I'm not sure how the dx/dy instruction would be implemented though so maybe it's not done this way.
IIRC, the dx/dy instruction is deemed to be invalid if the pixels have branched.

No it isn't.

The rate of change computed from the source register is an approximation on the contents of the same register in adjacent pixel(s) running the pixel shader in lock-step with the current pixel. This is designed to work even if adjacent pixels follow different paths due to flow control, because the hardware is required to run a group of lock-step pixel shaders, disabling writes as necessary when flow control goes down a path that a particular pixel does not take.
 
Demirug,
That's all well and good but, reading from how it's meant to work from the [TLA] spec, it's not going to produce valid results unless the programmer really knows what they're doing and, AFAICS, knows the hardware+compiler inside out!
 
Further note:
I've just double checked and there are a lot of restrictions of where and when you can use the ddx|ddy instructions.
 
The main advantage to branching is skipping over unecessary instructions and potentially reducing variation counts on shaders (i.e. lots of different shaders vs. one). The counter is that there are many things that make it not good:

1) In a SIMD architecture, if different pixels take different directions, you can end up with only 1 pixel executing per cycle. Constant branches are not an issue, just dynamic ones.

2) Optimizers cannot optimize code across branches, and you can lose out a bunch of performance in those cases. Or if it's using a loop on a piece of code, which could be optimized. Unrolling loops can be done in some cases, but not in a general way.

3) It requires an increase in GPRs, since you need to have GPRs, per pixel, for both sides of a branch. Some optimizations are possible in some case, but not in a general way.

So its a reasonable thing, but I don't expect (or hope) that shaders will start to execute branches as frequently or easily as CPUs.
 
The obvious disadvantage is that by devoting transistors to work around these problems, you either have have less transistors working on shader throughput, or a more expensive chip.

If developers start using branching for the heck of it when they could instead somehow separate branches in software, there will be a fair amount of inefficiency.

I personally think true dynamic branching in the shader is a bit of a bad idea for these reasons. But I don't have a good idea of what these costs are, so it's more or less an uninformed opinion.

Oh, BTW, LOD is another problem. Most cards use the texture coordinates from 4 pixels to determine the anisotropic footprint, mipmap level, etc. But that's related to the ddy and ddx arguments anyway. With all this hardware devoted to AA, severe texture aliasing is not going to be well-recieved.
 
True branching will become great for longer shaders. Provided developers are careful when to use them and/or drivers unroll small branches, then they'll never be a hindrance to performance.

As shaders get longer, and that dynamic branch ends up possibly skipping 100 instructions, then you've got a great performance saver.

What wouldn't be good, for example, would be a while loop that includes a single instruction. That would utterly kill performance. But a while loop that includes ~10 instructions or so should be just fine. 100 instructions? Even better.

Remember that branching of this form will be done instead of executing all possible instructions, and selecting the final result at the end. If the scheduling issues are less of a performance hit than doing the added instructions that don't end up having any effect on the final output, then branching is a win for performance.

Static branching will be even better for performance, as it shouldn't incur any performance hit, but would instead prevent pipeline stalls. After all, if you can roll what today takes many shaders into one larger one, the pipelines will have an easier time rendering continuously.
 
It's not nearly as simple as saying "if the branch skips 100 instructions, its a performance win". In fact, if the shader has quite a few texture lookups, then the extra GPRs defined in the instructions that you skip can completely kill performance. You could end up with 1/20 the performance (we have internal examples of that), just because the extra GPRs used will not allow for texture latency hiding. Also given that it's pretty much as easy to change a shader as to load a constant, static shaders will always be faster.

The main 1st use of dynamic branching will be to reduce the shader count complexity (this is what the GDC papers were pushing with "dynamic lighting"). It will not be a performance advantage at all; quite the opposite. Later, when shaders are almost pure ALU, then there will be a performance advantage (you need 10's of ALU per texture or perhaps 100's, depending on the architecture).
 
That's true, and it's a challenge to overcome in dynamic branching optimization. A great synthetic benchmark would be one that tests how inefficient an architecture is for the case you outlined.

As for static branching, it could well be a performance benefit if it prevents state changes. Of course, that would depend upon the circumstances.
 
How often is it that texture fetches are conditional? Most of the dynamic branch usages I've seen don't do data dependent conditional texture loads. E.g. vast majority of RenderMan shaders with dynamic branches are ALU only. In fact, I couldn't find any RM shaders with a data dependent branch texture fetch. Of those with loops, most are calling noise() (which doesn't neccessarily have to be a texture fetch)

Can't we assume, that like with DSX/DSY, developers will be informed and understand the conditions underwhich branches will be fast vs those that will be slow? No one is advocating that your whole scene should be rendered with dynamic branches everywhere and of course, the pathological conditions you have outlined will be present.

Instead, I can imagine that there are some very specific shaders which would benefit from dynamic branches (such as physics simulations in the pixel shader) which would be pathologically worse without branches. Moreover, there are some shaders that are not possible without branches because one of the branches contains instructions with side effects (e.g. fragment discard)

Can I assume NVidia is going to evangelize exactly the case where these are the win, and ATI is going to keep stonewalling against dynamic branches until you implement them? I mean it's one thing to say that the developer has to be aware of specific conditions with respect to implementing PS3.0 dynamic branches. That's no different than many aspects of 3d programming which is why Richard Huddy has to travel around giving speeches as to how to best extract maximum performance. It's quite another to start suggesting that there's no scenarios under which it will be a performance benefit. I can't ignore the interest that ATI would naturally have in downplaying these features.

I would rather contend that it's an unknown right now, regardless of your internal testing, because we don't know how developers are going to take advantage of these features over the next year or so. Serendipity.
 
Only one more comment from me on this, given the obvious bad attitude:

I've not said that branching is not warranted. There are cases were it is (and there are more, I'm sure, that I cannot foresee). The primary right now is for shader combinatorial reduction; this will not bring a performance improvement, just a code reduction.

But all I stated is that there are some non-obvious counters to its usage. It's NOT as simple as on CPUs. Developers aware of this can certainly judge for themselves what is best for them. Thinking that it's free or that it can be used everywhere you want will be sorely disappointed.
 
Well, I think we agree. I'm glad you contributed to the thread about issue of GPRs, it's something I overlooked. I got directed to this thread by the thread on Huddy's leaked powerpoint with the comments embedded. I guess I was a little more hotheaded than I should have been.

I guess the question is, would some form of branch prediction help? What if you can deduce that 90% of the time a branch with a texture load in it would execute, and 10% of the time it wouldn't. Could the driver/GPU make use of such hints to reduce the penalty?
 
I would guess that given the previous suggestions that GPU pipelines are hundreds of stages deep, you'd want much better branch prediction than that for optimal performance.

I would tend to think that GPU branch prediction would be much more conservative (if it is used), and designed such that a mispredicted branch would not force a flushing of the pipelines (i.e. by not throwing away the data necessary to instead take the branch that the hardware doesn't think it's going to use).
 
Branch prediction in a CPU is very important, since the CPU is mostly single threaded.

In a GPU, there are many pixels in flight, and you can consider each, or a group of each, to be a thread. When you hit a branch or a texture fetch, you can sleep that thread and go work on another batch of pixels.

Consequently, branch prediction in VPUs is not useful at all. What is much more important is usage of resources per thread. If each thread uses too many resources, then you can't actually put a thread to sleep when a branch or texture fetch is hit, since you've run out of resources to start another thread. Then you have to stall and wait for something. The higher the latency of items you are waiting for, the longer the wait. That then drops your efficiency. You can virtualize resources and move them into local memory, but then you still are hit by the memory latency.

There are numerous resources defined per pixel, but probably the main one is GPR usage. In a branching case, a compiler cannot know which branch will be taken, consequently a thread must be allocated all resources required for both possible branches. Some union of resources can be done, but it's only on simple true/false branches, and can't be used on loops and other constructs. Consequently, if you skip 100 instructions, it looks good. But if those 100 instructions use lots of resources, you might not be able to start a new thread. Then efficiency drops.

As well, most VPUs are SIMD, so if a branch is taken differently for a pixel vs. its neighbors, you again lose on the efficiency side.

Finally, optimizing accross branches is very difficult. In general, it's very suboptimal. That can kill efficiency too.

If enough efficiency is lost because you cannot operate SIMD or cannot be multi-threaded, then you become a simple single threaded single pixel operation. Given that a CPU runs 5~10x faster, then you've just killed your VPU and are much better running purely on the CPU. The main advantage of the VPU is its ability to run multiple things in parallel and maintain high throughput by having lots of different things to do. If you take that away, you lose.
 
In a recent leaked pdf, ATi suggested not using dynamic branching before R5XX, does that mean all of the problems you mentioned will be resolved in R5XX? Since I don't see there's any fundamental solution for the fighting between long shaders and limited resources, if possible, I'd like to hear how you guys at ATi find a way around. ;)
 
Back
Top