Performance penalty for executing all possible branches?

The NV30 could supposedly execute all possible branches, drop the impossible ones and choose the best suited path. Would this impose a performance penalty on the processor? Isn't there a limited number of alu's per pipe, each having to compute the possible branches (assuming more than 1 alu), or is there some sort of parallel execution?
 
Yeah, either it'll have to execute both paths of a branch at half the speed or it'll require more computational units, which may be costly hardwarewise and in terms of power consumption.
 
Hm, I have a memory that while the venerable 68040 is a non-superscalar design, it does have hardware in at least the initial pipeline stages to process both parts of a branch until it has determined which of the two is the correct, and then invisibly continuing down that path. How much of a saving would that approach be in today's, probably considerably deeper pipes?

*G*
 
There is no branching in pixel shaders, it's done via predication.

if(y < 10)
{
x = x + y:
}
else
{
x = x + w;
}

is not handled by skipping over the body of the if or else, in fact, it is translated to something like this

x1 = x + y;
x2 = x + w;
x = y < 10 ? x1 : x2;

That is, every line of code in a pixel shader is executed, but only the final correct value of each conditional is propagated to the next expression.

As for whether it is faster. Well, I bet it is faster than doing another pass to handle the conditional.
 
In your trivial case, predication would be faster (or at least as fast) as doing a branch.

One way the compiler could handle your code would be
SGE select, Y, 10
ADD tmp0, X, Y
ADD tmp1, X, W
LERP X, tmp0, tmp1, select

Another way (using condition codes) would be
SUBC tmp0, Y, 10
ADD tmp0, X, Y
ADD tmp1, X, W
MOV X (GE), tmp1
MOV X (LT), tmp0

Either of these cases should be faster than actually doing branching on the input. Even for larger cases, predication can be faster. It really depends on the program's characteristics whether predication or branching will be faster. For branching to be faster, you need to be able to pay off the (relatively high) constant cost for doing a comparison and branch (potentially invalidating any data in the cache) by skipping many instructions. If the branch doesn't cull many instructions, predication may (and frequently will) be faster.
 
The performance of a naive implementation of eager execution of branches deteriorates exponentially. Imagine an execution pipeline of n stages, now imagine that you execute a branch on every cycle, you now have 1/2^n performance. That's why modern processor spend alot of transistors on branch prediction.

Eager execution may prove useful in a few cases. But I suspect most of these can benefit from either conditional execution or just a conditional assignment operation.

Cheers
Gubbi
 
It would be a win if the branch stops an expensive recursive or iterative process. Since there aren't likely very many of those cases in most pixel shaders, branching won't be a big win. In the rare cases where you need to do this, you can use the multipass stencil buffer trick.

A statistical analysis of renderman shader libraries would probably yield factual evidence.
 
I think a simple way of looking at it is that avoiding conditional execution is that whenever the path of execution, or the path of memory reads/writes, is not immediately obvious, the hardware can have a very hard time keeping the pipelines moving.

In essence, it's a tradeoff between either having the pipelines always working, though sometimes on pointless instructions, or having the pipelines spend a lot of time doing nothing.
 
Predications is probably faster when all the instructions the branch would skip are calculations.
Branching is more important when skipping memory read/write (texture access in case of PS).

I think the computing power of GPUs will be increasing faster then the available memory bandwidth, so this will be more and more true.

On the other hand nv30 introduces partial derivative functions that's likely implemented accessing neighbour pipelines. It allows to mipmap/anisotropic-filter dependent reads - a big win in image quality.
This technique is not very dynamic branching friendly...
 
Do the ddx and ddy instructions have to dependend on neighboring pixel pipelines? These instructions cannot be run isolated in 1 pixel pipe?
 
You can in theory compute partial derivatives (ddx/ddy) analytically in 1 pixel pipe as long as you perform only analytic calculations ,like +-*/, dot3,sqrt,sin,cos,exp,log; this requires that the derivatives are carefully maintained through each computation, which may be expensive. For potentially discontinuous data, like the result of a conditonal assignment or a texture lookup, analytic computation is not possible, and a more empirical method - taking differences on a grid of sample points - is needed. The neighboring pixel pipe method supplies such a grid for free; developing such a grid within a pixel pipe is not impossible as long as there are no conditional jumps in the pixel shader, but still prohibitively expensive (4x or more transistor cost).
 
Back
Top