hlsl flowcontrol & texture lookups

hunter3738

Newcomer
Hi

I'm using hlsl with dx9 and ps3.0 and have a problem with my pixelshader when i turn on flow control: the tex2Dlod function doesn't return correct values. The code looks something like this:

float4 pedata = tex2D(texPeDataSampler, float2(colpos, rowpos));
for (int i = 0; i < pedata[3] && i < 6; i++)
{
// following call goes OK
dendata = tex2Dlod(texDenSampler, float4(float2(dencolpos, denrowpos),0,0));
// here's where it starts failing
if (srcreadtime1 == 0){ srcpedata = tex2Dlod(samparr[0],float4(float2(srcpecolpos, srcperowpos), 0, 0)); }
if (srcreadtime1 == 1){ srcpedata = tex2Dlod(samparr[1],float4(float2(srcpecolpos, srcperowpos), 0, 0)); }
.
.
.
if (srcreadtime1 == 13){ srcpedata = tex2Dlod(samparr[13],float4(float2(srcpecolpos, srcperowpos), 0, 0)); }
}

The code compiles well and the problem disappears when i turn off flow control (making my app 10* slower). The values returned are [infinity, 0, 0, 0] (at all times) and i tested this on 2 sm3.O cards (nvidia & ati). I also tried to write it all with slightly different syntax, without succes.

My questions:
- Is there a better way of doing this? How?
- Is there any hope this will work with dx10 & sm4.0?
- Should I delve into the asm to solve this or are there other options?

any help and comments are highly appreciated...
 
My knowledge is pretty sketchy, but I think this is an allowed limitation in DX (assuming this refers to the derivative insruction that takes the deltas across 2x2 pixel groups). It's probably best to start looking the DirectX docs.
 
In particular I believe this is the page you should be most interested in.

Read both the nesting depth sections and the "Interaction of Per-Pixel Flow Control With Screen Gradients" section at the bottom.
 
I don't see why screen gradients should be of any relevance, given that all the conditional texture lookups in the code example use 'tex2Dlod' (which takes LOD as an explicit argument instead of auto-computing it); other than that, I am a bit at loss to see why the code should fail.

As for improvements to the code:
Code:
for (int i = 0; i < pedata[3] && i < 6; i++)
has a somewhat heavy exit condition; try
Code:
int iters = min( pedata[3], 6 );
for(int i = 0; i < iters; i++)
(which may or may not make a difference; good compilers may perform this kind of optimization automatically, however, the presence or absence of compiler optimizations is in general not something I would want to rely on.)

Also, I would think it would be possible to replace samparr[] with a single 3D texture; this should work just fine under SM3.0 and should be a whole lot faster than doing 14 conditional lookups.
 
Definitely use a 2D texture array (in DX10), a 3D texture with various "slices", or even a texture atlas for something like this!

That said, I'd make sure that you're samplers have aniso disabled (samples = 1) when using them with tex2Dlod. It shouldn't matter, but at this point it's something to try. Do your textures even have mipmaps by the way? If not, try a standard instruction (tex2D) although the compiler might get "clever" and make that a derivative texture lookup for you. Just to help figure out what's going on, you may also want to try a derivative texture lookup with explicit derivatives actually.
 
Thanks for all the replies! I'll try to provide some more information and answer some questions:

- srcreadtime1 is an integer that determines which texture should be sampled, since it's a variable i have to use the if structures to get to the sampler[].
- if i use tex2D all works well but the compiler unrolls the loop and effectively removes all flow control statements (making it really slow). Andy TX, how hould i do this with explicit derivatives (i assume you're referring to tex2Dproj), what values to pass in for the derivatives?
- the samplers are declared as:

sampler samparr[14] : register(s0) = {sampler_state { texture = <texPe0> ;magfilter = LINEAR; minfilter = LINEAR; mipfilter=LINEAR; AddressU = mirror; AddressV = mirror;}, ... };

any alternatives here?

- it has something to do with the textures i'm using, only rendertargets with flow control (tex2Dlod) seem to cause the problem. Here's how i create my textures (the ones that get sampled by the sampler[]):

petexs = new Texture(DXManager.device, mdimpes, mdimpes, 1, Usage.RenderTarget, Format.A32B32G32R32F, Pool.Default);

- i don't think i can use 3D textures, the textures should be used as rendertargets and in subsequent passes their content should be read back and used as input (ping-pong technique), or is there a way to set one slice of a 3D-texture as the render target?

- here's the asm output of the compiled code:

//
// Generated by Microsoft (R) HLSL Shader Compiler 9.19.949.2111
//
// fxc /T ps_3_0 /E OurFirstPixelShader /Fc test.fx0 test.fx
//
//
// Parameters:
//
// sampler2D samparr[14];
// float src0;
// float src1;
// sampler2D texDenSampler;
// sampler2D texPeDataSampler;
// int time;
//
//
// Registers:
//
// Name Reg Size
// ---------------- ----- ----
// time c0 1
// src0 c1 1
// src1 c2 1
// samparr s0 14
// texPeDataSampler s14 1
// texDenSampler s15 1
//
//
// Default values:
//
// time
// c0 = { 0, 0, 0, 0 };
//
// src0
// c1 = { 0, 0, 0, 0 };
//
// src1
// c2 = { 0, 0, 0, 0 };
//

ps_3_0
def c3, 1, -1, 0.00390625, 1.00390625
def c4, 1, 0, 14, -13
def c5, -1, -2, -3, -4
def c6, -5, -6, -7, -8
def c7, -9, -10, -11, -12
def c8, -34, 0.001953125, 0, 0
defi i0, 6, 0, 0, 0
dcl_texcoord v0.xy
dcl_2d s0
dcl_2d s1
dcl_2d s2
dcl_2d s3
dcl_2d s4
dcl_2d s5
dcl_2d s6
dcl_2d s7
dcl_2d s8
dcl_2d s9
dcl_2d s10
dcl_2d s11
dcl_2d s12
dcl_2d s13
dcl_2d s14
dcl_2d s15
mad r0.xy, v0, c3, c3.zwzw
texld r0, r0, s14
add r0.z, r0.z, c8.x
mov r1.y, c8.y
mov r2.xyz, c8.w
mov r1.z, c8.w
mov r3.xy, r0
mov r1.w, c8.w
rep i0
mov r2.w, r0.w
break_ge r1.w, r2.w
mul r4.xz, r3.y, c4.xyyw
mov r4.y, r3.x
texldl r5, r4.xyzz, s15
cmp r2.w, -r5.z, c4.y, c4.x
if_ne r2.w, -r2.w
frc r2.w, r5.y
add r3.z, r5.y, -r2.w
cmp r3.w, r5.y, c4.y, c4.x
cmp r2.w, -r2.w, c4.y, c4.x
mad r2.w, r3.w, r2.w, r3.z
add r2.w, -r2.w, c0.x
add r3.z, r2.w, c4.z
cmp r2.x, r2.w, r2.w, r3.z
cmp r2.w, -r2_abs.x, c4.x, c4.y
if_ne r2.w, -r2.w
mul r6, r5.wzww, c4.xxyy
texldl r6, r6, s0
mov r2.yz, r6.xzww
else
mov r2.yz, c8.w
mov r6.x, c8.w
endif
add r7, r2.x, c5
cmp r7, -r7_abs, c4.x, c4.y
if_ne r7.x, -r7.x
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s1
mov r2.yz, r6.xzww
endif
if_ne r7.y, -r7.y
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s2
mov r2.yz, r6.xzww
endif
if_ne r7.z, -r7.z
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s3
mov r2.yz, r6.xzww
endif
if_ne r7.w, -r7.w
mul r7, r5.wzww, c4.xxyy
texldl r6, r7, s4
mov r2.yz, r6.xzww
endif
add r7, r2.x, c6
cmp r7, -r7_abs, c4.x, c4.y
if_ne r7.x, -r7.x
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s5
mov r2.yz, r6.xzww
endif
if_ne r7.y, -r7.y
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s6
mov r2.yz, r6.xzww
endif
if_ne r7.z, -r7.z
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s7
mov r2.yz, r6.xzww
endif
if_ne r7.w, -r7.w
mul r7, r5.wzww, c4.xxyy
texldl r6, r7, s8
mov r2.yz, r6.xzww
endif
add r7, r2.x, c7
cmp r7, -r7_abs, c4.x, c4.y
if_ne r7.x, -r7.x
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s9
mov r2.yz, r6.xzww
endif
if_ne r7.y, -r7.y
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s10
mov r2.yz, r6.xzww
endif
if_ne r7.z, -r7.z
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s11
mov r2.yz, r6.xzww
endif
if_ne r7.w, -r7.w
mul r7, r5.wzww, c4.xxyy
texldl r6, r7, s12
mov r2.yz, r6.xzww
endif
add r2.w, r2.x, c4.w
cmp r2.w, -r2_abs.w, c4.x, c4.y
if_ne r2.w, -r2.w
mul r7, r5.wzww, c4.xxyy
texldl r6, r7, s13
mov r2.yz, r6.xzww
endif
else
mov r2.w, c8.w
cmp r2.w, -r5_abs.w, c1.x, r2.w
add r3.z, r5.w, c3.y
cmp r6.x, -r3_abs.z, c2.x, r2.w
endif
mul r2.w, r5.x, r6.x
mad r2.w, r2.w, r2.w, r1.z
mad r3.z, r6.x, r5.x, r1.z
add r4.w, r3.y, c3.z
add r3.w, -r4.w, c3.x
add r1.x, r3.x, c3.z
cmp r3.xy, r3.w, r4.ywzw, r1
add r1.w, r1.w, c3.x
cmp r1.z, -r0_abs.z, r2.w, r3.z
endrep
mov oC0.yzw, r2.xxyz
mov oC0.x, r1.z

// approximately 174 instruction slots used (31 texture, 143 arithmetic)

thanks again for the help so far, if i don't get this working i can also make a small app to reproduce the problem...
 
Andy TX, how hould i do this with explicit derivatives (i assume you're referring to tex2Dproj), what values to pass in for the derivatives?
Actually I'm referring to tex2Dgrad. You should be able to pass something small (or zero) for the derivatives to force mipmap level 0. Note that tex2Dgrad is slightly more expensive than tex2D or tex2Dlod because I believe it has to operate at the pixel level rather than the quad level, but it's something to try.

sampler samparr[14] : register(s0) = {sampler_state { texture = <texPe0> ;magfilter = LINEAR; minfilter = LINEAR; mipfilter=LINEAR; AddressU = mirror; AddressV = mirror;}, ... };
That's trilinear filtering. Since you're not using mipmaps, just use simple bilinear (mipfilter = NONE or similar I believe was valid in DX9).

- i don't think i can use 3D textures, the textures should be used as rendertargets and in subsequent passes their content should be read back and used as input (ping-pong technique), or is there a way to set one slice of a 3D-texture as the render target?
In DX10 you can definitely render to slices of 3D texture and I'd be surprised if you couldn't in DX9. There should be some way to specify the "z slice" of the target 3D texture when you're creating a render target... then you just create one render target per slice. The compiler may complain about RW hazards if you have the 3D texture bound while rendering to a slice of it... you can potentially use one 2D offscreen texture for ping-ponging to avoid this at the cost of an additional copy.
 
I found it, some parameters were swapped (!) at every iteration of the for loop. I changed the code to calculate the coordinates at the beginning of the loop and it finally works now (yay).

I'll look into the 3D texture rendering now, i'm sure it'll be way more efficient (if i can get it to work with managed DX9).

thx all for the time and trouble of looking into this...
 
I think the key here is to get rid of all these flow control statements while keeping performance up, i.e. performing a single texture fetch for each loop iteration.
As mentioned above a 3D texture should fit this requirement for DX9 while texture arrays can be used in DX10.
You could also implement a "virtual" texture array by simply packing your 14 textures (or render target) into a larger texture. E.g. if each of your 14 texture is 50x50 in size then you could allocate a single 700x50 texture and use u coordinate offsetting to actually decide on which one to sample. The code would look like the following (assuming the same filter mode is used on all 14 textures):

float4 TextureSize;
const float4 VirtualArrayTextureSize(14*TextureSize.x, TextureSize.y);

for (int i = 0; i < pedata[3] && i < 6; i++)
{
// following call goes OK
dendata = tex2Dlod(texDenSampler, float4(float2(dencolpos, denrowpos),0,0));

srcpedata = tex2Dlod(mysinglelargetexture, float4( srcpecolpos.x/14.0 + srcreadtime1/14.0, srcpecolpos.y/14.0, 0, 0) )
}

If your textures are all render targets you can still apply this trick; you only need to make sure your viewports are set up correctly when rendering into your large virtual texture (render target) array. With modern hardware supporting fairly large render target size (4kx4k or even 8kx8k) this should hopefully be enough to accomodate your 14 textures. You may need to use both X and Y resolution for optimal packing though.

Btw there is no point using tex2Dgrad if your textures are non-MIPMapped. Tex2Dlod will do (and is faster).

Hope this helps!
 
Supercow, i didn't think it was possible to read from the same texture as the one i'm rendering on (i already got some bsods trying to do just that), but i'll give it another try and let you know how it goes...
 
Supercow, i didn't think it was possible to read from the same texture as the one i'm rendering on (i already got some bsods trying to do just that), but i'll give it another try and let you know how it goes...
You're right the D3D API is not allowing you to do that (write hazards). What made you think I was recommending this method though? You'd have to write to your render targets in one pass, and the read from them using the technique mentioned in another (separate passes are fine since you're not writing/reading from the same texture at the same time).
 
ok, 3D texture rendering seems to be a dead end, there's no 'getsurfacelevel' method in the volumetexture class. I don't think the 'virtual' texture array is an option either since i need to be able to both read and write from the same texture. The code is now 300% faster than the cpu equivalent but i was hoping for something like 1000-4000%. Guess i'll have to rethink my logic and get rid of all the loops/ifs to push it further...
 
Back
Top