hlsl flowcontrol & texture lookups

hunter3738 · Mar 7, 2008

Hi

I'm using hlsl with dx9 and ps3.0 and have a problem with my pixelshader when i turn on flow control: the tex2Dlod function doesn't return correct values. The code looks something like this:

float4 pedata = tex2D(texPeDataSampler, float2(colpos, rowpos));
for (int i = 0; i < pedata[3] && i < 6; i++)
{
// following call goes OK
dendata = tex2Dlod(texDenSampler, float4(float2(dencolpos, denrowpos),0,0));
// here's where it starts failing
if (srcreadtime1 == 0){ srcpedata = tex2Dlod(samparr[0],float4(float2(srcpecolpos, srcperowpos), 0, 0)); }
if (srcreadtime1 == 1){ srcpedata = tex2Dlod(samparr[1],float4(float2(srcpecolpos, srcperowpos), 0, 0)); }
.
.
.
if (srcreadtime1 == 13){ srcpedata = tex2Dlod(samparr[13],float4(float2(srcpecolpos, srcperowpos), 0, 0)); }
}

The code compiles well and the problem disappears when i turn off flow control (making my app 10* slower). The values returned are [infinity, 0, 0, 0] (at all times) and i tested this on 2 sm3.O cards (nvidia & ati). I also tried to write it all with slightly different syntax, without succes.

My questions:
- Is there a better way of doing this? How?
- Is there any hope this will work with dx10 & sm4.0?
- Should I delve into the asm to solve this or are there other options?

any help and comments are highly appreciated...

Simon F · Mar 9, 2008

My knowledge is pretty sketchy, but I think this is an allowed limitation in DX (assuming this refers to the derivative insruction that takes the deltas across 2x2 pixel groups). It's probably best to start looking the DirectX docs.

Ilfirin · Mar 9, 2008

In particular I believe this is the page you should be most interested in.

Read both the nesting depth sections and the "Interaction of Per-Pixel Flow Control With Screen Gradients" section at the bottom.

Xmas · Mar 9, 2008

hunter3738 said:
- Is there a better way of doing this? How?

Likely, but it depends on what srcreadtime1 is.

- Is there any hope this will work with dx10 & sm4.0?

With DX10 you could use texture arrays instead of arrays of samplers (as long as all the textures are the same size).

arjan de lumens · Mar 9, 2008

I don't see why screen gradients should be of any relevance, given that all the conditional texture lookups in the code example use 'tex2Dlod' (which takes LOD as an explicit argument instead of auto-computing it); other than that, I am a bit at loss to see why the code should fail.

As for improvements to the code:

Code:

for (int i = 0; i < pedata[3] && i < 6; i++)

has a somewhat heavy exit condition; try

Code:

int iters = min( pedata[3], 6 );
for(int i = 0; i < iters; i++)

(which may or may not make a difference; good compilers may perform this kind of optimization automatically, however, the presence or absence of compiler optimizations is in general not something I would want to rely on.)

Also, I would think it would be possible to replace samparr[] with a single 3D texture; this should work just fine under SM3.0 and should be a whole lot faster than doing 14 conditional lookups.

Andrew Lauritzen · Mar 10, 2008

Definitely use a 2D texture array (in DX10), a 3D texture with various "slices", or even a texture atlas for something like this!

That said, I'd make sure that you're samplers have aniso disabled (samples = 1) when using them with tex2Dlod. It shouldn't matter, but at this point it's something to try. Do your textures even have mipmaps by the way? If not, try a standard instruction (tex2D) although the compiler might get "clever" and make that a derivative texture lookup for you. Just to help figure out what's going on, you may also want to try a derivative texture lookup with explicit derivatives actually.

hunter3738 · Mar 10, 2008

Thanks for all the replies! I'll try to provide some more information and answer some questions:

- srcreadtime1 is an integer that determines which texture should be sampled, since it's a variable i have to use the if structures to get to the sampler[].
- if i use tex2D all works well but the compiler unrolls the loop and effectively removes all flow control statements (making it really slow). Andy TX, how hould i do this with explicit derivatives (i assume you're referring to tex2Dproj), what values to pass in for the derivatives?
- the samplers are declared as:

sampler samparr[14] : register(s0) = {sampler_state { texture = <texPe0> ;magfilter = LINEAR; minfilter = LINEAR; mipfilter=LINEAR; AddressU = mirror; AddressV = mirror;}, ... };

any alternatives here?

- it has something to do with the textures i'm using, only rendertargets with flow control (tex2Dlod) seem to cause the problem. Here's how i create my textures (the ones that get sampled by the sampler[]):

petexs = new Texture(DXManager.device, mdimpes, mdimpes, 1, Usage.RenderTarget, Format.A32B32G32R32F, Pool.Default);

- i don't think i can use 3D textures, the textures should be used as rendertargets and in subsequent passes their content should be read back and used as input (ping-pong technique), or is there a way to set one slice of a 3D-texture as the render target?

- here's the asm output of the compiled code:

//
// Generated by Microsoft (R) HLSL Shader Compiler 9.19.949.2111
//
// fxc /T ps_3_0 /E OurFirstPixelShader /Fc test.fx0 test.fx
//
//
// Parameters:
//
// sampler2D samparr[14];
// float src0;
// float src1;
// sampler2D texDenSampler;
// sampler2D texPeDataSampler;
// int time;
//
//
// Registers:
//
// Name Reg Size
// ---------------- ----- ----
// time c0 1
// src0 c1 1
// src1 c2 1
// samparr s0 14
// texPeDataSampler s14 1
// texDenSampler s15 1
//
//
// Default values:
//
// time
// c0 = { 0, 0, 0, 0 };
//
// src0
// c1 = { 0, 0, 0, 0 };
//
// src1
// c2 = { 0, 0, 0, 0 };
//

ps_3_0
def c3, 1, -1, 0.00390625, 1.00390625
def c4, 1, 0, 14, -13
def c5, -1, -2, -3, -4
def c6, -5, -6, -7, -8
def c7, -9, -10, -11, -12
def c8, -34, 0.001953125, 0, 0
defi i0, 6, 0, 0, 0
dcl_texcoord v0.xy
dcl_2d s0
dcl_2d s1
dcl_2d s2
dcl_2d s3
dcl_2d s4
dcl_2d s5
dcl_2d s6
dcl_2d s7
dcl_2d s8
dcl_2d s9
dcl_2d s10
dcl_2d s11
dcl_2d s12
dcl_2d s13
dcl_2d s14
dcl_2d s15
mad r0.xy, v0, c3, c3.zwzw
texld r0, r0, s14
add r0.z, r0.z, c8.x
mov r1.y, c8.y
mov r2.xyz, c8.w
mov r1.z, c8.w
mov r3.xy, r0
mov r1.w, c8.w
rep i0
mov r2.w, r0.w
break_ge r1.w, r2.w
mul r4.xz, r3.y, c4.xyyw
mov r4.y, r3.x
texldl r5, r4.xyzz, s15
cmp r2.w, -r5.z, c4.y, c4.x
if_ne r2.w, -r2.w
frc r2.w, r5.y
add r3.z, r5.y, -r2.w
cmp r3.w, r5.y, c4.y, c4.x
cmp r2.w, -r2.w, c4.y, c4.x
mad r2.w, r3.w, r2.w, r3.z
add r2.w, -r2.w, c0.x
add r3.z, r2.w, c4.z
cmp r2.x, r2.w, r2.w, r3.z
cmp r2.w, -r2_abs.x, c4.x, c4.y
if_ne r2.w, -r2.w
mul r6, r5.wzww, c4.xxyy
texldl r6, r6, s0
mov r2.yz, r6.xzww
else
mov r2.yz, c8.w
mov r6.x, c8.w
endif
add r7, r2.x, c5
cmp r7, -r7_abs, c4.x, c4.y
if_ne r7.x, -r7.x
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s1
mov r2.yz, r6.xzww
endif
if_ne r7.y, -r7.y
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s2
mov r2.yz, r6.xzww
endif
if_ne r7.z, -r7.z
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s3
mov r2.yz, r6.xzww
endif
if_ne r7.w, -r7.w
mul r7, r5.wzww, c4.xxyy
texldl r6, r7, s4
mov r2.yz, r6.xzww
endif
add r7, r2.x, c6
cmp r7, -r7_abs, c4.x, c4.y
if_ne r7.x, -r7.x
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s5
mov r2.yz, r6.xzww
endif
if_ne r7.y, -r7.y
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s6
mov r2.yz, r6.xzww
endif
if_ne r7.z, -r7.z
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s7
mov r2.yz, r6.xzww
endif
if_ne r7.w, -r7.w
mul r7, r5.wzww, c4.xxyy
texldl r6, r7, s8
mov r2.yz, r6.xzww
endif
add r7, r2.x, c7
cmp r7, -r7_abs, c4.x, c4.y
if_ne r7.x, -r7.x
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s9
mov r2.yz, r6.xzww
endif
if_ne r7.y, -r7.y
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s10
mov r2.yz, r6.xzww
endif
if_ne r7.z, -r7.z
mul r8, r5.wzww, c4.xxyy
texldl r6, r8, s11
mov r2.yz, r6.xzww
endif
if_ne r7.w, -r7.w
mul r7, r5.wzww, c4.xxyy
texldl r6, r7, s12
mov r2.yz, r6.xzww
endif
add r2.w, r2.x, c4.w
cmp r2.w, -r2_abs.w, c4.x, c4.y
if_ne r2.w, -r2.w
mul r7, r5.wzww, c4.xxyy
texldl r6, r7, s13
mov r2.yz, r6.xzww
endif
else
mov r2.w, c8.w
cmp r2.w, -r5_abs.w, c1.x, r2.w
add r3.z, r5.w, c3.y
cmp r6.x, -r3_abs.z, c2.x, r2.w
endif
mul r2.w, r5.x, r6.x
mad r2.w, r2.w, r2.w, r1.z
mad r3.z, r6.x, r5.x, r1.z
add r4.w, r3.y, c3.z
add r3.w, -r4.w, c3.x
add r1.x, r3.x, c3.z
cmp r3.xy, r3.w, r4.ywzw, r1
add r1.w, r1.w, c3.x
cmp r1.z, -r0_abs.z, r2.w, r3.z
endrep
mov oC0.yzw, r2.xxyz
mov oC0.x, r1.z

// approximately 174 instruction slots used (31 texture, 143 arithmetic)

thanks again for the help so far, if i don't get this working i can also make a small app to reproduce the problem...

Andrew Lauritzen · Mar 10, 2008

hunter3738 said:
Andy TX, how hould i do this with explicit derivatives (i assume you're referring to tex2Dproj), what values to pass in for the derivatives?

Actually I'm referring to tex2Dgrad. You should be able to pass something small (or zero) for the derivatives to force mipmap level 0. Note that tex2Dgrad is slightly more expensive than tex2D or tex2Dlod because I believe it has to operate at the pixel level rather than the quad level, but it's something to try.

hunter3738 said:
sampler samparr[14] : register(s0) = {sampler_state { texture = <texPe0> ;magfilter = LINEAR; minfilter = LINEAR; mipfilter=LINEAR; AddressU = mirror; AddressV = mirror;}, ... };

That's trilinear filtering. Since you're not using mipmaps, just use simple bilinear (mipfilter = NONE or similar I believe was valid in DX9).

hunter3738 said:
- i don't think i can use 3D textures, the textures should be used as rendertargets and in subsequent passes their content should be read back and used as input (ping-pong technique), or is there a way to set one slice of a 3D-texture as the render target?

In DX10 you can definitely render to slices of 3D texture and I'd be surprised if you couldn't in DX9. There should be some way to specify the "z slice" of the target 3D texture when you're creating a render target... then you just create one render target per slice. The compiler may complain about RW hazards if you have the 3D texture bound while rendering to a slice of it... you can potentially use one 2D offscreen texture for ping-ponging to avoid this at the cost of an additional copy.

hunter3738 · Mar 10, 2008

I found it, some parameters were swapped (!) at every iteration of the for loop. I changed the code to calculate the coordinates at the beginning of the loop and it finally works now (yay).

I'll look into the 3D texture rendering now, i'm sure it'll be way more efficient (if i can get it to work with managed DX9).

thx all for the time and trouble of looking into this...

SuperCow · Mar 12, 2008

I think the key here is to get rid of all these flow control statements while keeping performance up, i.e. performing a single texture fetch for each loop iteration.
As mentioned above a 3D texture should fit this requirement for DX9 while texture arrays can be used in DX10.
You could also implement a "virtual" texture array by simply packing your 14 textures (or render target) into a larger texture. E.g. if each of your 14 texture is 50x50 in size then you could allocate a single 700x50 texture and use u coordinate offsetting to actually decide on which one to sample. The code would look like the following (assuming the same filter mode is used on all 14 textures):

float4 TextureSize;
const float4 VirtualArrayTextureSize(14*TextureSize.x, TextureSize.y);

for (int i = 0; i < pedata[3] && i < 6; i++)
{
// following call goes OK
dendata = tex2Dlod(texDenSampler, float4(float2(dencolpos, denrowpos),0,0));

srcpedata = tex2Dlod(mysinglelargetexture, float4( srcpecolpos.x/14.0 + srcreadtime1/14.0, srcpecolpos.y/14.0, 0, 0) )
}

If your textures are all render targets you can still apply this trick; you only need to make sure your viewports are set up correctly when rendering into your large virtual texture (render target) array. With modern hardware supporting fairly large render target size (4kx4k or even 8kx8k) this should hopefully be enough to accomodate your 14 textures. You may need to use both X and Y resolution for optimal packing though.

Btw there is no point using tex2Dgrad if your textures are non-MIPMapped. Tex2Dlod will do (and is faster).

Hope this helps!

Xmas · Mar 12, 2008

SuperCow said:
Btw there is no point using tex2Dgrad if your textures are non-MIPMapped. Tex2Dlod will do (and is faster).

That's not entirely true, you might want to set the gradients for anisotropic filtering.

SuperCow · Mar 12, 2008

It would be unconventional to use aniso without MIPMap but the point you made is certainly true.

hunter3738 · Mar 17, 2008

Supercow, i didn't think it was possible to read from the same texture as the one i'm rendering on (i already got some bsods trying to do just that), but i'll give it another try and let you know how it goes...

SuperCow · Mar 17, 2008

hunter3738 said:
Supercow, i didn't think it was possible to read from the same texture as the one i'm rendering on (i already got some bsods trying to do just that), but i'll give it another try and let you know how it goes...

You're right the D3D API is not allowing you to do that (write hazards). What made you think I was recommending this method though? You'd have to write to your render targets in one pass, and the read from them using the technique mentioned in another (separate passes are fine since you're not writing/reading from the same texture at the same time).

hunter3738 · Mar 21, 2008

ok, 3D texture rendering seems to be a dead end, there's no 'getsurfacelevel' method in the volumetexture class. I don't think the 'virtual' texture array is an option either since i need to be able to both read and write from the same texture. The code is now 300% faster than the cpu equivalent but i was hoping for something like 1000-4000%. Guess i'll have to rethink my logic and get rid of all the loops/ifs to push it further...

hlsl flowcontrol & texture lookups

hunter3738

Simon F

Tea maker

Ilfirin

Xmas

Porous

arjan de lumens

Andrew Lauritzen

Moderator

hunter3738

Andrew Lauritzen

Moderator

hunter3738

SuperCow

Xmas

Porous

SuperCow

hunter3738

SuperCow

hunter3738

Similar threads