SM3 vs dynamic branching

Chalnoth said:
There's no way that DX10 specifies that each pixel needs its own instruction counter and whatnot. This is a hardware-specific implementation detail. I don't expect any graphics hardware to have single-pixel granularity on branches, ever.
I was talking about the direction things are going, not stating that DX10 will be there already.

What do you think about it? Sure, it might be hard to get it into hardware, as an evolution from the current chips. But what direction should it go?
 
Well, there's got to be a good medium-ground as far as dynamic branching performance is concerned. I think that future hardware will have no overhead on branching instructions (as the NV4x has), and will converge on somewhere in the range of 16-32 pixel granularity.

Regardless, though, due to the way the hardware works, there's just no realistic way to drop the granularity below the quad level (because of texturing). So 4-pixel granularity is the absolute minimum. Precisely where the granularity ends up in the future depends upon both the hardware and the software of the time. If you have smaller granularity, you will have better dynamic branching performance, but will have to sacrifice transistors elsewhere, dropping overall performance. If you have larger granularity, then you will have worse dynamic branching performance, but will be able to spend more transistors elsewhere to improve overall performance.

So, IHV's are just going to have to go through in-depth examinations of dynamic branching usage to determine the best balance.
 
Chalnoth said:
Regardless, though, due to the way the hardware works, there's just no realistic way to drop the granularity below the quad level (because of texturing). So 4-pixel granularity is the absolute minimum.
While there are reasons to to keep granularity at quad-level or above (which are mostly related to instruction fetch/decode and thread management), texturing isn't one. DX9 doesn't allow the calculation of gradients inside branches, and I think there is no reason for D3D10 to change that. It makes no sense to take both branches for pixels just to calculate a gradient that is meaningless. Taking random gradients would be just as accurate.
That flow control will never have one-pixel granularity is a pretty bold statement IMO.
 
Xmas said:
It makes no sense to take both branches for pixels just to calculate a gradient that is meaningless.
There will be many branching situations where the gradient will not be meaningless. Just consider a quad where two pixels are occluded by another surface. Many branches could be thought of in a similar way: you're performing one pixel shader up to some boundary, then performing some different shader.
 
Xmas said:
While there are reasons to to keep granularity at quad-level or above (which are mostly related to instruction fetch/decode and thread management), texturing isn't one.
How do you aniso then? If you don't have information from neighbouring fragments, you can only compute derivatives analytically. Although that sounds appealing at first, you need to realize that this breaks down when you have dependent texture reads, or when you perform some computation on texture coordinates before the texture fetch.

One alternative would be to allow pixel engines to run independently, but then wait for their neighbours on each texture instructions. However, this means that the pixels will be interlocked most of the time. It also means that you effectively need neighbouring pixels to run in neighbouring pixel engines. Thus all that extra hardware you added for independent instruction control/issue is effectly wasted.

(And I'm not even going to go into the extra difficulty of register file design, scheduling, serialization, flwo control, etc).
 
I'm pretty sure it would be more effective to cache texture fetches in blocks in any case, as it's simply very likely the texels around the adressed one would be needed in any case. And for AA, you need those buffers as well.
 
Last edited by a moderator:
Just thinking out loud, but here's another alternative:

Given a shader program that shades a single fragment, you produce a shader program that shades 4 fragments - (move all the HW logic for interlocking the individual pixels in a quad on any kind of gradient instruction into SW).

At the expense of code bloat, some (don't really know how much) software overhead, and probably extra threads to feed the execution units, you can get single fragment branching granularity.

The idea is to expose the fact that a shader operates on a quad of "adjacent" inputs to SW, thereby punting on the HW complexity of thread communication completely, and giving programmers free reign as to how to intermix computation inside of a quad. For example, some quantities (like texture LOD) could be computed per-quad, some per pixel, and, if the appropriate output options are available, some could be computed on a sub-pixel (e.g. per sample) basis.
 
If you can get pixel-percentage coverage (sub-pixel granularity) with unified shaders, you essentially also have REYES.
 
Bob said:
How do you aniso then? If you don't have information from neighbouring fragments, you can only compute derivatives analytically. Although that sounds appealing at first, you need to realize that this breaks down when you have dependent texture reads, or when you perform some computation on texture coordinates before the texture fetch.
In DX9 you simply can't calculate derivatives in a branch. If you need them, you have to calculate them beforehands for every pixel and use texldd in the branch. I guess it is possible to let the compiler automatically do this so you don't have to worry about it when you write HLSL code, but then it might happen that the savings from branching are smaller than you expect.

As long as you know the derivative of the function you're applying to the texture coordinates, you can still compute derivatives analytically.
 
psurge said:
The idea is to expose the fact that a shader operates on a quad of "adjacent" inputs to SW, thereby punting on the HW complexity of thread communication completely, and giving programmers free reign as to how to intermix computation inside of a quad. For example, some quantities (like texture LOD) could be computed per-quad, some per pixel, and, if the appropriate output options are available, some could be computed on a sub-pixel (e.g. per sample) basis.
Ah, but what about dependencies? You'd have to be careful that all calculations that go into any per-quad operation also have to be per-quad.
 
Chalnoth - could you clarify? The point of the idea was to move the burden of running things in the right order from the HW designer to the compiler (preferrably) but maybe also to the shader author.

I.e. instead of this (grossly simplified) model :
Code:
struct InputForOneFragment
{
    int x,y, vface; 
    vec4 whatever;
    vec4 andSoOn;
    //...
};

struct OutputForOneFragment
 {
    bool valid; 
     vec4 aResult;
    //...
 };

OutputForOneFragment out = ShaderProgram(InputForOneFragment in);
you would have
Code:
OutputForOneFragment out[N] = ShaderProgram(InputForOneFragment in[M]);
with most probably N=M=4

I agree that this would definitely broaden the available arsenal for shooting oneself in the foot...
 
Xmas said:
In DX9 you simply can't calculate derivatives in a branch. If you need them, you have to calculate them beforehands for every pixel and use texldd in the branch. I guess it is possible to let the compiler automatically do this so you don't have to worry about it when you write HLSL code, but then it might happen that the savings from branching are smaller than you expect.

As long as you know the derivative of the function you're applying to the texture coordinates, you can still compute derivatives analytically.

Yes, the compiler move instructions that are not valid inside a branch in the codepart before.
 
Chalnoth said:
There will be many branching situations where the gradient will not be meaningless. Just consider a quad where two pixels are occluded by another surface. Many branches could be thought of in a similar way: you're performing one pixel shader up to some boundary, then performing some different shader.
I have to admit there's truth to that. It creates some serious implications on which branching behaviour is faster.

Consider the following pseudocode snippets:
Code:
# first
if some_condition:
    tempcoords = coordfunction1(coords1)
    temp = sample(tempcoords)
    ret = somefunction1(temp)
else:
    tempcoords = coordfunction2(coords2)
    temp = sample(tempcoords)
    ret = somefunction2(temp)
return ret

# second
tempcoords1 = coordfunction1(coords1)
tempcoords2 = coordfunction2(coords2)
gradients1 = deriv(tempcoords1)
gradients2 = deriv(tempcoords2)
if some_condition:
    temp = samplegrad(tempcoords1, gradients1)
    ret = somefunction1(temp)
else:
    temp = samplegrad(tempcoords2, gradients2)
    ret = somefunction2(temp)
return ret
Both do the same thing. However, the latter has a bigger constant cost and saves less by not executing one branch. While the former has to take both branches for all pixels in a quad if the branch condition is not constant throughout the quad.

So if the branch condition is usually constant over a large area, the first one could potentially be much faster, while the latter will be faster if the condition varies a lot and you have per-pixel branching.
 
Last edited by a moderator:
Back
Top