Problems to make optimized logic to work?

51mon · Feb 18, 2009

Problems to make optimized logic to work? [SOLVED]

I solved the problem by changing tex2D to tex2Dproj. This behaviour is slightly undefined in my opinion (since tex2D worked fine with the if-else approach). But problem solved, so I'm happy

Hi
I have implemented a CSM system and I got a problem to make the logic work in the pixel shader. DX9 is the technology. This is the code that I was intended to implement:

Code:

	float3 comparision = In.TextureUV.www > g_vecCsmSplitDistances.xyz;
	int index = comparision.x + comparision.y + comparision.z;

	shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[index]);

But it doesn't work (but the application executes). Strange enough, following code works that's logically equivalent but much less effective:

Code:

	float3 comparision = In.TextureUV.www > g_vecCsmSplitDistances.xyz;
	int index = comparision.x + comparision.y + comparision.z;
	
	if( !index)
		shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[0]);
	else if( index == 1)
		shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[1]);
	else if( index == 2)
		shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[2]);
	else
		shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[3]);

The input to the pixel shader is defined as:

Code:

typedef float4 lightspacepos[CSM_NUM_SPLITS];
struct VS_OUTPUT
{
	float4 Position				: POSITION;
	lightspacepos LightSpacePos		: TEXCOORD0;
	float4 TextureUV			: TEXCOORD4;
};

I tried to solve the logic with different approaches such as switch-case but wasn't able. I stepped through the code in PIX and everything seamed alright. I experienced the same on both nvidia and ati cards. Does anybody know why this thing happens and how to solve it?

Thank you very much.

sebbbi · Feb 18, 2009

Selecting cascade by range does not yield the best result. A better way is to transform the coordinate first by the most detailed CSM matrix, and check if the texture coordinate is inside [0,1] boundary. For these pixels you can use the most detailed cascade. For this test I simply use "any(floor(texCoord.xyz))". You get result of zero if the texture coordinate is inside the bounds. This way you can always use the most detailed cascade for each pixel (improving the quality nicely).

Using four different matrices is not the best way either. Better is to store the coordinate shifts and scales for other cascades as pixel shader constants. It's easy to add and scale the result texcoord (one madd) and then do the "any(floor(texCoord.xyz))" test again.

And the third way to optimize this would be to use a single larger texture map instead of a sampler array (1024x4096 for example for four 1024x1024 cascades). Instead of selecting the sampler you add the texcoord.y by 0.25 depending on the texcoord tests.

51mon · Feb 19, 2009

sebbbi said:
Selecting cascade by range does not yield the best result. A better way is to transform the coordinate first by the most detailed CSM matrix, and check if the texture coordinate is inside [0,1] boundary. For these pixels you can use the most detailed cascade. For this test I simply use "any(floor(texCoord.xyz))". You get result of zero if the texture coordinate is inside the bounds. This way you can always use the most detailed cascade for each pixel (improving the quality nicely).

Using four different matrices is not the best way either. Better is to store the coordinate shifts and scales for other cascades as pixel shader constants. It's easy to add and scale the result texcoord (one madd) and then do the "any(floor(texCoord.xyz))" test again.

And the third way to optimize this would be to use a single larger texture map instead of a sampler array (1024x4096 for example for four 1024x1024 cascades). Instead of selecting the sampler you add the texcoord.y by 0.25 depending on the texcoord tests.

Hmm, interesting inputs. I will try to utilise the whole map. I'm already using one single texture containing 4 levels of CSM. The code look like this:

Code:

typedef float4 lightspacepos[CSM_NUM_SPLITS];
struct VS_OUTPUT
{
	float4 Position				: POSITION;
	lightspacepos LightSpacePos		: TEXCOORD0;
	float4 TextureUV			: TEXCOORD4;
};

VS_OUTPUT RenderSceneVS( float4 vecPos : POSITION, 
			float3 vecNormal : NORMAL,
			float2 vecTexCoord0 : TEXCOORD0)
{
	VS_OUTPUT Output;

	Output.Position = mul( vecPos, g_mtxWorldViewProjection);

	Output.TextureUV.xy = vecTexCoord0;
	Output.TextureUV.z = saturate( dot( vecNormal, g_vecSunDir));
	Output.TextureUV.w = Output.Position.w;
	
	Output.LightSpacePos[0] = mul( vecPos, mtxLightReadback[0]);
	Output.LightSpacePos[1] = mul( vecPos, mtxLightReadback[1]);
	Output.LightSpacePos[2] = mul( vecPos, mtxLightReadback[2]);
	Output.LightSpacePos[3] = mul( vecPos, mtxLightReadback[3]);

	return Output;
}


struct PS_OUTPUT
{
	float4 RGBColor : COLOR0;
};

PS_OUTPUT RenderSceneNvidiaPS( VS_OUTPUT In) 
{
	PS_OUTPUT Output;
	float shadow;
	
	float3 comparision = In.TextureUV.www > g_vecCsmSplitDistances.xyz;
	int index = comparision.x + comparision.y + comparision.z;
	
	// The lookup doesn't work even though index contains the correct value
	shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[index]);

	float4 colourRead = tex2D( MeshTextureSampler, In.TextureUV.xy);
	Output.RGBColor.rgb = shadow * colourRead * In.TextureUV.z;
    
	Output.RGBColor.a = shadow * In.TextureUV.z;
	
	return Output;
}

I don't get why the indexing doesn't work?

Humus · Feb 19, 2009

51mon said:
I don't get why the indexing doesn't work?

SM3.0 doesn't support arbitrary indexing in the pixel shader. The only dynamic indexing supported is looking up interpolators from a loop counter. You're indexing with a computed value. So it'll expand to a bunch of if-statements anyway. Should still work though, but you could probably find a better implementation yourself than relying on compiler generated emulation of indexing.

nAo · Feb 20, 2009

If your target hw supports depth bounds test you can do this:

http://pixelstoomany.wordpress.com/...r-filtering-on-deferred-cascaded-shadow-maps/

51mon · Feb 20, 2009

Humus said:
SM3.0 doesn't support arbitrary indexing in the pixel shader. The only dynamic indexing supported is looking up interpolators from a loop counter. You're indexing with a computed value. So it'll expand to a bunch of if-statements anyway. Should still work though, but you could probably find a better implementation yourself than relying on compiler generated emulation of indexing.

I did an alternative implementation by creating a mask and then do a matrix operation on the 4 texture sets to obtain the right one. Later when I compiled (fxc) the fx file into assembly in order to compare the "new" and "old" code. I found that the compiler was emulating the indexing with the same method. This is was the compiler produced:

Code:

add r0.xyz, c0, -v4.w		float3 comparision = In.TextureUV.www > g_vecCsmSplitDistances.xyz; 
cmp r0.xyz, r0, c1.x, c1.y			
add r0.x, r0.y, r0.x		int index = comparision.x + comparision.y + comparision.z;
add r0.x, r0.z, r0.x				
add r0, r0.x, -c1		float4 negIndex = index.xxxx - int4( 0, 1, 2, 3);
cmp r0, -r0_abs, c1.y, c1.x	float4 mask = -abs( vecNegIndex) >= 0.0f ?  1 : 0;

dp4 r1.x, v0, r0		float4 shadowTexCoord = mul( mask, In.LightSpacePos);
dp4 r1.y, v1, r0
dp4 r1.z, v2, r0
dp4 r1.w, v3, r0
texldp r0, r1, s0		shadow = tex2Dproj( ShadowBufferSampler, shadowTexCoord);

As long as this is the code used during run I think it's fairly efficient.

nAo said:
If your target hw supports depth bounds test you can do this:

http://pixelstoomany.wordpress.com/...r-filtering-on-deferred-cascaded-shadow-maps/

Thanks for the link. With the method described you wouldn't be able to do utilize the whole shadow map in every level as described in the first reply, right?

nAo · Feb 21, 2009

51mon said:
Thanks for the link. With the method described you wouldn't be able to do utilize the whole shadow map in every level as described in the first reply, right?

Right, if you want to use the depth bounds test your shadow partitions planes must be orthogonal to the view vector.

sebbbi · Feb 21, 2009

When you render all cascades to the same larger shadow map (by viewport change or resolve render target to a different texture rectangle) you do not need indexable samplers. With the shift+scale orthogonal projection trick you do not need to index constants (matrices). With couple of "any" + "mad" instructions to calculate the texture coordinate you do not need any dynamic branching either. No multipassing (no extra vertex load), and just a few extra ALU instructions in the shader (no extra sampling at all). Because of these things, this technique is suitable for the lowest end SM2.0 cards too. The technique described by nAo is not suitable for all DX9 cards, and only provides performance advantage if you are sampling different amount of texels on each split (a feature that would require dynamic branching otherwise). With ESM or VSM filtering you only need one sample per pixel (the shadow map itself is blurred).

Problems to make optimized logic to work?

51mon

sebbbi

51mon

Humus

Crazy coder

nAo

Nutella Nutellae

51mon

nAo

Nutella Nutellae

sebbbi

Similar threads