Problems to make optimized logic to work?

51mon

Newcomer
Problems to make optimized logic to work? [SOLVED]

I solved the problem by changing tex2D to tex2Dproj. This behaviour is slightly undefined in my opinion (since tex2D worked fine with the if-else approach). But problem solved, so I'm happy :)

Hi
I have implemented a CSM system and I got a problem to make the logic work in the pixel shader. DX9 is the technology. This is the code that I was intended to implement:
Code:
	float3 comparision = In.TextureUV.www > g_vecCsmSplitDistances.xyz;
	int index = comparision.x + comparision.y + comparision.z;

	shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[index]);

But it doesn't work (but the application executes). Strange enough, following code works that's logically equivalent but much less effective:
Code:
	float3 comparision = In.TextureUV.www > g_vecCsmSplitDistances.xyz;
	int index = comparision.x + comparision.y + comparision.z;
	
	if( !index)
		shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[0]);
	else if( index == 1)
		shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[1]);
	else if( index == 2)
		shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[2]);
	else
		shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[3]);

The input to the pixel shader is defined as:
Code:
typedef float4 lightspacepos[CSM_NUM_SPLITS];
struct VS_OUTPUT
{
	float4 Position				: POSITION;
	lightspacepos LightSpacePos		: TEXCOORD0;
	float4 TextureUV			: TEXCOORD4;
};

I tried to solve the logic with different approaches such as switch-case but wasn't able. I stepped through the code in PIX and everything seamed alright. I experienced the same on both nvidia and ati cards. Does anybody know why this thing happens and how to solve it?

Thank you very much.
 
Last edited by a moderator:
Selecting cascade by range does not yield the best result. A better way is to transform the coordinate first by the most detailed CSM matrix, and check if the texture coordinate is inside [0,1] boundary. For these pixels you can use the most detailed cascade. For this test I simply use "any(floor(texCoord.xyz))". You get result of zero if the texture coordinate is inside the bounds. This way you can always use the most detailed cascade for each pixel (improving the quality nicely).

Using four different matrices is not the best way either. Better is to store the coordinate shifts and scales for other cascades as pixel shader constants. It's easy to add and scale the result texcoord (one madd) and then do the "any(floor(texCoord.xyz))" test again.

And the third way to optimize this would be to use a single larger texture map instead of a sampler array (1024x4096 for example for four 1024x1024 cascades). Instead of selecting the sampler you add the texcoord.y by 0.25 depending on the texcoord tests.
 
Selecting cascade by range does not yield the best result. A better way is to transform the coordinate first by the most detailed CSM matrix, and check if the texture coordinate is inside [0,1] boundary. For these pixels you can use the most detailed cascade. For this test I simply use "any(floor(texCoord.xyz))". You get result of zero if the texture coordinate is inside the bounds. This way you can always use the most detailed cascade for each pixel (improving the quality nicely).

Using four different matrices is not the best way either. Better is to store the coordinate shifts and scales for other cascades as pixel shader constants. It's easy to add and scale the result texcoord (one madd) and then do the "any(floor(texCoord.xyz))" test again.

And the third way to optimize this would be to use a single larger texture map instead of a sampler array (1024x4096 for example for four 1024x1024 cascades). Instead of selecting the sampler you add the texcoord.y by 0.25 depending on the texcoord tests.

Hmm, interesting inputs. I will try to utilise the whole map. I'm already using one single texture containing 4 levels of CSM. The code look like this:

Code:
typedef float4 lightspacepos[CSM_NUM_SPLITS];
struct VS_OUTPUT
{
	float4 Position				: POSITION;
	lightspacepos LightSpacePos		: TEXCOORD0;
	float4 TextureUV			: TEXCOORD4;
};

VS_OUTPUT RenderSceneVS( float4 vecPos : POSITION, 
			float3 vecNormal : NORMAL,
			float2 vecTexCoord0 : TEXCOORD0)
{
	VS_OUTPUT Output;

	Output.Position = mul( vecPos, g_mtxWorldViewProjection);

	Output.TextureUV.xy = vecTexCoord0;
	Output.TextureUV.z = saturate( dot( vecNormal, g_vecSunDir));
	Output.TextureUV.w = Output.Position.w;
	
	Output.LightSpacePos[0] = mul( vecPos, mtxLightReadback[0]);
	Output.LightSpacePos[1] = mul( vecPos, mtxLightReadback[1]);
	Output.LightSpacePos[2] = mul( vecPos, mtxLightReadback[2]);
	Output.LightSpacePos[3] = mul( vecPos, mtxLightReadback[3]);

	return Output;
}


struct PS_OUTPUT
{
	float4 RGBColor : COLOR0;
};

PS_OUTPUT RenderSceneNvidiaPS( VS_OUTPUT In) 
{
	PS_OUTPUT Output;
	float shadow;
	
	float3 comparision = In.TextureUV.www > g_vecCsmSplitDistances.xyz;
	int index = comparision.x + comparision.y + comparision.z;
	
	// The lookup doesn't work even though index contains the correct value
	shadow = tex2D( ShadowBufferSampler, In.LightSpacePos[index]);

	float4 colourRead = tex2D( MeshTextureSampler, In.TextureUV.xy);
	Output.RGBColor.rgb = shadow * colourRead * In.TextureUV.z;
    
	Output.RGBColor.a = shadow * In.TextureUV.z;
	
	return Output;
}

I don't get why the indexing doesn't work?
 
I don't get why the indexing doesn't work?

SM3.0 doesn't support arbitrary indexing in the pixel shader. The only dynamic indexing supported is looking up interpolators from a loop counter. You're indexing with a computed value. So it'll expand to a bunch of if-statements anyway. Should still work though, but you could probably find a better implementation yourself than relying on compiler generated emulation of indexing.
 
SM3.0 doesn't support arbitrary indexing in the pixel shader. The only dynamic indexing supported is looking up interpolators from a loop counter. You're indexing with a computed value. So it'll expand to a bunch of if-statements anyway. Should still work though, but you could probably find a better implementation yourself than relying on compiler generated emulation of indexing.

I did an alternative implementation by creating a mask and then do a matrix operation on the 4 texture sets to obtain the right one. Later when I compiled (fxc) the fx file into assembly in order to compare the "new" and "old" code. I found that the compiler was emulating the indexing with the same method. This is was the compiler produced:

Code:
add r0.xyz, c0, -v4.w		float3 comparision = In.TextureUV.www > g_vecCsmSplitDistances.xyz; 
cmp r0.xyz, r0, c1.x, c1.y			
add r0.x, r0.y, r0.x		int index = comparision.x + comparision.y + comparision.z;
add r0.x, r0.z, r0.x				
add r0, r0.x, -c1		float4 negIndex = index.xxxx - int4( 0, 1, 2, 3);
cmp r0, -r0_abs, c1.y, c1.x	float4 mask = -abs( vecNegIndex) >= 0.0f ?  1 : 0;

dp4 r1.x, v0, r0		float4 shadowTexCoord = mul( mask, In.LightSpacePos);
dp4 r1.y, v1, r0
dp4 r1.z, v2, r0
dp4 r1.w, v3, r0
texldp r0, r1, s0		shadow = tex2Dproj( ShadowBufferSampler, shadowTexCoord);

As long as this is the code used during run I think it's fairly efficient.



Thanks for the link. With the method described you wouldn't be able to do utilize the whole shadow map in every level as described in the first reply, right?
 
Thanks for the link. With the method described you wouldn't be able to do utilize the whole shadow map in every level as described in the first reply, right?
Right, if you want to use the depth bounds test your shadow partitions planes must be orthogonal to the view vector.
 
When you render all cascades to the same larger shadow map (by viewport change or resolve render target to a different texture rectangle) you do not need indexable samplers. With the shift+scale orthogonal projection trick you do not need to index constants (matrices). With couple of "any" + "mad" instructions to calculate the texture coordinate you do not need any dynamic branching either. No multipassing (no extra vertex load), and just a few extra ALU instructions in the shader (no extra sampling at all). Because of these things, this technique is suitable for the lowest end SM2.0 cards too. The technique described by nAo is not suitable for all DX9 cards, and only provides performance advantage if you are sampling different amount of texels on each split (a feature that would require dynamic branching otherwise). With ESM or VSM filtering you only need one sample per pixel (the shadow map itself is blurred).
 
Back
Top