ShaderX5: CSM and pixel shader register indexing

Hi all,

I'm trying to optimize my CSM implementation and I've got a question about pixel shader register constant indexing. In ShaderX5 CSM article, wolf shows how to make the shadowing pass of all splits (max. 4) at the same time indexing the proper light mvp in the pixel shader:

float4x4 lmvp[N]; // N is the number of splits

float zGreater = startShadows < CamDistance;
float mapToUse = dot(zGreater,1.0);
float4 shadowCoord = mul( lmvp[(int)(mapToUse-1)], positionLocalSpace);

The problem is that constant register indexing in pixel shader is not 1 assembler instruction, but 5 instructions per matrix row (4 cmp -used to copy the proper constant- and 1 dp4 -matrix row per vector mul-). This duplicates the number of arithmetic instructions of my pixel shader (from 19 to 38).

I've been thinking in an algorithm to reduce the number of instructions in the pixel shader:

1. Make N lmvp*positionLocalSpace in the vertex shader and put the results in N vertex shader output streams.
2. And later, to index the proper pixel shader input stream in the pixel shader.

My questions are:

1. Is there any way to reduce the number of arithmetic instructions in the pixel shader?
2. Is it possible to index pixel shader input streams?

Thanks in advance,

Sergi
 
perhaps yes and no.

Vertex stream can not be indexed, but it can also use cmp to go over all streams and use the right one.

I dont know why you have to do the transform per-pixel, if you can do the job per-vertex, pixel shader length could be significantly reduced.
 
I dont know why you have to do the transform per-pixel, if you can do the job per-vertex, pixel shader length could be significantly reduced.
Because a single triangle could span potentially *every* single shadow map split, and each portion would need to have a different transformation applied. This can be done fairly efficiently with a geometry shader, but doing it per-pixel isn't actually that bad in my experience (a few dots are nothing nowadays).

I'm curious as to why indexing a constant array is producing so many operations, but even so, is it actually performing poorly? I wouldn't get too hung up on instruction counts until after you've profiled and found hot spots.
 
Because a single triangle could span potentially *every* single shadow map split, and each portion would need to have a different transformation applied. This can be done fairly efficiently with a geometry shader, but doing it per-pixel isn't actually that bad in my experience (a few dots are nothing nowadays).

I not quite familiar with parallel split algorithm, but my thought is that as the transform matrices for each split are globally constant, the transforms can be done in the vertex shader and just select the result for each portion per-pixel. Please correct me if I'm wrong :)

I'm curious as to why indexing a constant array is producing so many operations, but even so, is it actually performing poorly? I wouldn't get too hung up on instruction counts until after you've profiled and found hot spots.

Indexing in constant register is not supported in ps_3_0, so compiler has to generate many inefficient instructions to emulate them, and selecting a matrix is much more costly than the transformation. Real indexing in constant buffer should be supported in d3d10.
 
I not quite familiar with parallel split algorithm, but my thought is that as the transform matrices for each split are globally constant, the transforms can be done in the vertex shader and just select the result for each portion per-pixel. Please correct me if I'm wrong :)
Yes you can do it like that, except you have to transform and interpolate the projections for *all* splits, so that the proper one can be selected in the fragment shader. Whether this is faster or slower than just doing it in the fragment shader will depend on a lot of things of course, but it'd probably be faster in DX9 (probably not in DX10 is my guess).

Indexing in constant register is not supported in ps_3_0, so compiler has to generate many inefficient instructions to emulate them, and selecting a matrix is much more costly than the transformation. Real indexing in constant buffer should be supported in d3d10.
Interesting... however I don't know how such a feature can be "emulated" given the number of constant registers available! 4 cmp's certainly wouldn't seem to be enough at first glance.
 
Yes you can do it like that, except you have to transform and interpolate the projections for *all* splits, so that the proper one can be selected in the fragment shader. Whether this is faster or slower than just doing it in the fragment shader will depend on a lot of things of course, but it'd probably be faster in DX9 (probably not in DX10 is my guess).


Interesting... however I don't know how such a feature can be "emulated" given the number of constant registers available! 4 cmp's certainly wouldn't seem to be enough at first glance.

We ran into the same problem of having many instructions in the Pixel shader (19 to calculate the right texcoord if I remember well).

I *believe* (didn't implement it yet, but I think it's do-able) that if you work with directional light (and orthographic projections) you can use 3 TexCoords:

(x1/w1, x2/w2-x1/w1, x3/w3 - x2/w2, x4/w4-x3/w3), same for y and z,
xi, yi, zi, and wi being the coordinates of the vertex in light space i.

Then one cmp and 3 dp4 should allow you to have the Light space coordinates you're looking for.

Of course you need to have at least 2 Texcoords available (or 3 for points lights / lights needing perspective projection). We do :)

Ben
 
We ran into the same problem of having many instructions in the Pixel shader (19 to calculate the right texcoord if I remember well).

I *believe* (didn't implement it yet, but I think it's do-able) that if you work with directional light (and orthographic projections) you can use 3 TexCoords:

(x1/w1, x2/w2-x1/w1, x3/w3 - x2/w2, x4/w4-x3/w3), same for y and z,
xi, yi, zi, and wi being the coordinates of the vertex in light space i.

Then one cmp and 3 dp4 should allow you to have the Light space coordinates you're looking for.

Of course you need to have at least 2 Texcoords available (or 3 for points lights / lights needing perspective projection). We do :)

Ben


Err, I meant 2 more TexCoords than with "classic" Shadow Mapping :)
 
Thanks for the replies, and specially to pthiben for sharing his method :LOL: . The code is working with a pixel shader of 28 instructions (with normal mapping, PSSM+variance shadow maps and fadeout shadows).

The code for indexing is the following:

// vertex shader

// pssm_rows[0] = row 0 of LMVP of first light
// pssm_rows[1] = row 0 of LMVP of second light
// ...
// pssm_rows[4] = row 1 of LMVP of first light
// pssm_rows[5] = row 1 of LMVP of second light

float4 pp[3];
pp[0].x = dot(pssm_rows[ 0],tposition);
pp[0].y = dot(pssm_rows[ 1],tposition);
pp[0].z = dot(pssm_rows[ 2],tposition);
pp[0].w = dot(pssm_rows[ 3],tposition);
pp[1].x = dot(pssm_rows[ 4],tposition);
pp[1].y = dot(pssm_rows[ 5],tposition);
pp[1].z = dot(pssm_rows[ 6],tposition);
pp[1].w = dot(pssm_rows[ 7],tposition);
pp[2].x = dot(pssm_rows[ 8],tposition);
pp[2].y = dot(pssm_rows[ 9],tposition);
pp[2].z = dot(pssm_rows[10],tposition);
pp[2].w = dot(pssm_rows[11],tposition);

pp[0].yzw -= pp[0].xyz;
pp[1].yzw -= pp[1].xyz;
pp[2].yzw -= pp[2].xyz;

o.positionLightSpaceX = pp[0];
o.positionLightSpaceY = pp[1];
o.positionLightSpaceZ = pp[2];

// pixel shader

float4 zGreater = (-pssm_start+camDistance) > 0.0;
float3 posLightSpace;
posLightSpace.x = dot(zGreater,i.positionLightSpaceX);
posLightSpace.y = dot(zGreater,i.positionLightSpaceY);
posLightSpace.z = dot(zGreater,i.positionLightSpaceZ);

Cheers,

Sergi
 
Thanks for the replies, and specially to pthiben for sharing his method :LOL: . The code is working with a pixel shader of 28 instructions (with normal mapping, PSSM+variance shadow maps and fadeout shadows).

The code for indexing is the following:

[...]

Cheers,

Sergi

Good to hear it's working. I guess I'll give it a try when I have time to optimize our shadow method :p
 
Back
Top