Loop unrolling in NVIDIA/ATI drivers?

tcchiu · Mar 24, 2005

Does the ATI or nVidia driver unroll the loops (SM 2.0 flow control)?

Simon F · Mar 24, 2005

Re: Loop unrolling

tcchiu said:
Does the ATI or nVidia driver unroll the loops (SM 2.0 flow control)?

IIRC, given the simplicity of SM2.0 looping, it may not be necessary, perhaps apart from very small loops.

In fact, it may be counter-productive since those loops are controlled by constants which could be changed frequently. Unrolling the loop would mean you may end up reloading new code rather than just tweaking a constant.

KimB · Mar 24, 2005

Re: Loop unrolling

tcchiu said:
Does the ATI or nVidia driver unroll the loops (SM 2.0 flow control)?

If the loops can be unrolled, they are. Basically, any loop that doesn't depend upon per-vertex or per-pixel information is unrolled. This is simply because it is typically assumed that many pixels and vertices will be drawn per branch, and the branching itself will incur a performance hit. Thus it makes more sense to just eat that little bit of extra data swapping that is needed when the new shader is loaded, instead of eating some constant performance hit for each pixel/vertex.

Since hardware like the R3xx and R4xx don't support any branching at all, this means that all loops must be unrolled. Similarly, the NV3x doesn't support pixel shader branching, and any pixel shader loops must therefore be unrolled.

DemoCoder · Mar 24, 2005

It is the HLSL compiler in SM2.0 which unrolls the loops, not the driver. There is no loop instruction in SM2.0 assembly.

Ostsol · Mar 24, 2005

DemoCoder said:
It is the HLSL compiler in SM2.0 which unrolls the loops, not the driver. There is no loop instruction in SM2.0 assembly.

Well, not for pixel shaders, but for vertex shaders there seems to be. . .

VS 2.0 Instructions

MfA · Mar 24, 2005

Re: Loop unrolling

Zero overhead looping isn't exactly rocket science, there is really no need to unroll loops on a sane architecture.

DemoCoder · Mar 24, 2005

Ostsol said:
DemoCoder said:

It is the HLSL compiler in SM2.0 which unrolls the loops, not the driver. There is no loop instruction in SM2.0 assembly.

Click to expand...

Well, not for pixel shaders, but for vertex shaders there seems to be. . .

VS 2.0 Instructions

Yeah, I make the common mistake of using SM2.0 interchangeably with PS2.0. For me, the most interesting changes in shader model are in the PS. Vertex texturing was nice, but until geometry shading is added, the vertex stuff is pretty boring, since I find vertex lighting boring and skinning is commodity stuff now

Remage · Feb 8, 2007

How to disable loop unrolling (HLSL, SM3.0)?

Is there a way to disable loop unrolling (in the HLSL compiler, SM3.0)?
I'd like to achieve the shortest possible compiled shader binaries, for a 4 kbytes intro/demo. At compile time I compile the shaders with a tool that uses D3DXCompileShaderFromFile(), and store the result.
The vertex/pixel shaders contain 2-3 loops, for a fixed number of lights, texture stages, procedural texcoord generators, etc. The HLSL compiler just unrolls the whole thing, producing a "huge" shader...
D3DXSHADER_PREFER_FLOW_CONTROL doesn't help.

mhouston · Feb 8, 2007

Make the loop bounds dynamic so it can't unroll and pass in the loop bounds via constants.

For example, using this for 8 lights

for(i=0; i<8, i++)
{
..
}

for(i=0;i<lights;i++)
{
...
}

and pass in 'lights' as a shader constant at shader bind.

Humus · Feb 8, 2007

Use the [loop] attribute on loops and [branch] on branches.

Code:

float4 main(float4 texCoord: TEXCOORD0) : COLOR {
    [loop]
    for (int i = 0; i < 8; i++){
        texCoord += 3.7 * texCoord.wzyx;
    }
    return texCoord;
}

Before:

Code:

ps_3_0
def c0, 3.70000005, 0, 0, 0
dcl_texcoord v0
mad r0, v0.wzyx, c0.x, v0
mad r0, r0.wzyx, c0.x, r0
mad r0, r0.wzyx, c0.x, r0
mad r0, r0.wzyx, c0.x, r0
mad r0, r0.wzyx, c0.x, r0
mad r0, r0.wzyx, c0.x, r0
mad r0, r0.wzyx, c0.x, r0
mad oC0, r0.wzyx, c0.x, r0

After:

Code:

ps_3_0
def c0, 0, 3.70000005, 0, 0
defi i0, 8, 0, 0, 0
dcl_texcoord v0
mov r0, v0
rep i0
 mad r0, r0.wzyx, c0.y, r0
endrep
mov oC0, r0

I think you'll have to use a recent version of the SDK though for the compiler to support these [] attributes.

mhouston · Feb 8, 2007

Yes, but the vendor compilers will sometimes unroll the loop themselves if it's statically analyzable... Making it dynamic will avoid this, unless they recompile on parameter changes.

Dee.cz · Feb 8, 2007

tcchiu said:
Does the ATI or nVidia driver unroll the loops (SM 2.0 flow control)?

IMHO, GLSL fixed length loops with access to array element were broken on ATI for long time, and possibly still are, so it's necessary to unroll them manually in source code.

JohnH · Feb 8, 2007

mhouston said:
Yes, but the vendor compilers will sometimes unroll the loop themselves if it's statically analyzable... Making it dynamic will avoid this, unless they recompile on parameter changes.

Making a loop dynamic just to avoid unrolling will prevent HW that actually supports zero cost static looping from taking advantage of it. Basically not a good thing to do.

John.

mhouston · Feb 8, 2007

I agree, but the original question was how to prevent loop unrolling. In general, I perfer the compilers, be it fxc or the vendor compilers, to unroll the code and make their own decisions on predication vs branching for their own hardware. Sadly, both Nvidia and ATI routinely have compiler bugs when optimizing long shaders for their own hardware, which sometimes forces us to figure out work arounds like this to get correctness. Granted we do tend to give them massive shaders. FXC also has tons of bugs and performance issues when you start pushing the limit. For example, we have a few shaders that take *hours* to compile with fxc.

Remage · Feb 8, 2007

Some progress...

Humus said:
Use the [loop] attribute on loops and [branch] on branches.
I think you'll have to use a recent version of the SDK though for the compiler to support these [] attributes.

Thanks, that kinda helps... It seems to work, my test shader has a real loop now.
However, it doesn't work with my original shader (that compiled nicely with the dec.2006 sdk, but had unrolled loops), now I have a lot of strange error messages from the HLSL compiler...

It seems "PixelShader" and "VertexShader" became some kind of keywords (shouldn't these be case sensitive?), at least the compiler gives an "error X3000: syntax error: unexpected token 'VertexShader'" message when I name my functions like that.

Another one: "error X5300: Invalid register number: 11. Max allowed for v# register is 9.", in the line where the halfvector is calculated (pixelshader code cropped, I quote the whole shader if that helps):

Code:

	float diffuse_light = 0;
	float specular_light = 0;
	float3 view_normal = normalize( input.ViewNormal );
	[loop]
	for ( int i = 0; i < 2; i++ )
	{
		float3 view_light = normalize( input.ViewLights[ i ] );
		diffuse_light += saturate( dot( view_normal, view_light ));
		float3 halfvector = normalize( view_light - normalize( input.Position.xyz ));
		specular_light += pow( dot( view_normal, halfvector ), 64.0 );
	}

The compilation stops here, so I can't check what would the resulting shader be.

Any ideas why I get this error message?

Humus · Feb 9, 2007

mhouston said:
Yes, but the vendor compilers will sometimes unroll the loop themselves if it's statically analyzable... Making it dynamic will avoid this, unless they recompile on parameter changes.

Well, the way I understood the problem was that he wanted to keep the loops to keep the binary shader small for a 4KB demo. What the vendor compiler does would thus not be relevant.

Remage said:
Any ideas why I get this error message?

No, but if you post the whole shader I might be able to help.

Remage · Feb 9, 2007

The shader

The basic idea is to mix axis-aligned-mapped 2D textures, so any arbitrary generated geometry (organic-like: metaballs, or 4D julia sets) will have a nice material.

Code:

struct VS_INPUT
{
	float4 Position		: POSITION0;
	float3 Normal		: NORMAL0;
};

struct VS_OUTPUT
{
	float4 ProjPosition		: POSITION;
	float4 Position			: TEXCOORD0;
	float3 Normal			: TEXCOORD1;
	float3 ViewNormal		: TEXCOORD2;
	float3 ViewLights[ 2 ]	: TEXCOORD3;
	float2 TexCoords[ 3 ]	: TEXCOORD5;
};

struct PS_OUTPUT
{
	float4 Color		: COLOR0;
};

float4x4 Projection			: register( c0 );
float4x4 WorldViewTransform	: register( c4 );

float4 Material_Color1		: register( c8 );
float4 Material_Color2		: register( c9 );
float4 Material_Specular	: register( c10 );

float3 LightVectors[ 2 ]	: register( c11 );

float4 Reg13				: register( c13 );
#define Time				Reg13.x
#define Global_Ambient		Reg13.y
#define Global_Diffuse		Reg13.z
#define Global_Specular		Reg13.w

float4 TexScale				: register( c14 );

// --------------------------------------------------------------------------------------------------------------------
//	VertexShader
// --------------------------------------------------------------------------------------------------------------------

#ifdef VERTEXSHADER

VS_OUTPUT __VertexShader( VS_INPUT input )
{
	VS_OUTPUT output;

	// Position & normal

	output.Position = mul( input.Position, WorldViewTransform );
	output.ProjPosition = mul( output.Position, Projection );

	output.Normal = input.Normal;
	output.ViewNormal = mul( input.Normal, WorldViewTransform );

	// Light sources

	[unroll]
	for ( int i = 0; i < 2; i++ )
	{
		output.ViewLights[ i ] = mul( LightVectors[ i ], WorldViewTransform );
	}

	// Generate texcoords

	[unroll] // FIXME: Indexing of l-values are not supported?
	for ( int i = 0; i < 3; i++ )
	{
		output.TexCoords[ i ] = TexScale.xy * input.Position.xy;
		output.TexCoords[ i ] += TexScale.w * ( Time + input.Position.z );
		input.Position.xyz = input.Position.yzx;
		TexScale.xyz = TexScale.yzx;
	}

	return output;
}

#endif

// --------------------------------------------------------------------------------------------------------------------
//	PixelShader
// --------------------------------------------------------------------------------------------------------------------

#ifdef PIXELSHADER

sampler2D Texture1;
sampler2D Texture2;

PS_OUTPUT __PixelShader( VS_OUTPUT input )
{
	PS_OUTPUT output;

	// Light sources

	float diffuse_light = 0;
	float specular_light = 0;

	float3 normal = normalize( input.Normal );
	float3 view_normal = normalize( input.ViewNormal );

	[loop]
	for ( int i = 0; i < 2; i++ )
	{
		float3 view_light = normalize( input.ViewLights[ i ] );

		float3 halfvector = normalize( view_light - normalize( input.Position.xyz ));
		specular_light += pow( dot( view_normal, halfvector ), 64.0 );
		diffuse_light += saturate( dot( view_normal, view_light ));
	}

	// Material: texture & colors

	float4 texture_map = 0;

	[loop]
	for ( int i = 0; i < 3; i++ )
	{
		float4 texture_1 = tex2D( Texture1, input.TexCoords[ i ] );
		float4 texture_2 = tex2D( Texture2, input.TexCoords[ i ] );
		texture_map += texture_1 * texture_2 * abs( input.Normal.z );
		normal.xyz = normal.yzx;
	}

	float4 diffuse_color =
		lerp( Material_Color1, Material_Color2, texture_map );

	float4 specular_color =
		Material_Specular * texture_map * saturate( -view_normal.z );

	// Compute final color

	output.Color = 
		Global_Ambient * diffuse_color +
		Global_Diffuse * diffuse_color * diffuse_light + 
		Global_Specular * specular_color * specular_light;

	// Per-pixel distance-fog

	output.Color = lerp( output.Color, 0.933333333, saturate( abs( input.Position.z ) * 0.02f ));

	return output;
}

#endif

Humus · Feb 10, 2007

I did some attempt to get that working yesterday, but had some real struggles. This looks like it's a D3D compiler bug. It's trying to use more interpolators than there is available.

Loop unrolling in NVIDIA/ATI drivers?

tcchiu

Simon F

Tea maker

KimB

DemoCoder

Ostsol

MfA

DemoCoder

Remage

mhouston

A little of this and that

Humus

Crazy coder

mhouston

A little of this and that

Dee.cz

JohnH

mhouston

A little of this and that

Remage

Humus

Crazy coder

Remage

Humus

Crazy coder

Similar threads