Dsx and dsy

Nick

Veteran
What exactly do the ps 2.0+ dsx and dsy instructions compute?

The DirectX 9 SDK documentation is a bit shady about it:
The rate of change computed from the source register is an approximation on the contents of the same register in adjacent pixel(s) running the pixel shader in lock-step with the current pixel.
So does it work on any register or just texture registers? To compute du/dx we need:

du/dx = d(u'/w)/dx
du/dx = 1/w² * (w * du'/dx - u' * dw/dx)

with u' the homogeneous texture coordinate. Of course it can also be rewritten as a function of the position.

Is this correct? If it is then what is the use of it? I know that it is required to compute the mipmap level but that's already done automatically. And what does the "running in lock-step" mean?
The exact formula used to compute the gradient varies deoebgubg nb hardware but should be consistent with the way the hardware does the same operations as part of the LOD calculation process for texture sampling.
Somebody here understand Zulu? I believe this confirms that it is related to mipmap level computations, but it isn't very concrete. How could the above formula be implementation dependent?

I just would like to know what these instructions are for. In the whole ps 2.0+ instructions set they are the most obsucure and undocumented of all...
 
dsx/dsy work on any register, not just data from the texture coordinate iterators (where the derivatives could be deduced from the formulas you gave). They are normally implemented by comparing the register's value between adjacent pixels and just taking the difference values that result (most modern GPUs work on pixels in 2x2 pixel blocks at a time, so for each pixel, one adjacent pixel is always available along both X and Y axes). Texture mappers generally implicitly use functionality similar to this to compute mipmap levels - unlike other methods for computing mipmap levels, it works with dependent texturing.

I suppose the dsx/dsy instructions can be used to reduce aliasing in procedurally-generated texture maps, but I haven't seen code examples actually doing this yet.

Does anyone know how dsx/dsy interact with the dynamic flow control of PS3.0, or what happens if some pixels in the 2x2 block fall outside the current polygon?
 
arjan de lumens said:
Does anyone know how dsx/dsy interact with the dynamic flow control of PS3.0, or what happens if some pixels in the 2x2 block fall outside the current polygon?

Two seperate things here...
If a pixel in the 2x2 block falls outside of the current polygon then it is assumed that the value is correct as the plane eqn for the poly is still valid so rate of change remains valid.

If one or more of the pipelines end up executing a different dynamic condition path then the result is really undefined, for ps3.0 it is stated that in these circumstances dsx/dsy should return 0. Basically you really need to know what you're doing if you're mixing these with dynamic flow control!

A similar problem exist for texldd...

John.
 
JohnH said:
Two seperate things here...
If a pixel in the 2x2 block falls outside of the current polygon then it is assumed that the value is correct as the plane eqn for the poly is still valid so rate of change remains valid.

If one or more of the pipelines end up executing a different dynamic condition path then the result is really undefined, for ps3.0 it is stated that in these circumstances dsx/dsy should return 0. Basically you really need to know what you're doing if you're mixing these with dynamic flow control!

A similar problem exist for texldd...

John.
Thanks. Been wondering about those for a while now. I presume that when combining dsx/dsy and dynamic flow control, you just have to do all dsx/dsy's before the first flow control instruction is reached?
 
arjan de lumens said:
Thanks. Been wondering about those for a while now. I presume that when combining dsx/dsy and dynamic flow control, you just have to do all dsx/dsy's before the first flow control instruction is reached?
I would have thought that you'd just need to have all the ds[xy] in a 2x2 block either enabled or disabled. I would expect you could also have them after conditionals or perhaps even inside provided all 2x2 pixels were 'executing'.
 
If you are going to have dsx/dsy after a conditional branch, it would seem to me that you would need to signal some sort of synchronization point/barrier in the pixel shader, as the 2x2 pipelines would likely be out of sync after the branch otherwise. This would be needed for any mipmapped texture lookup also, as far as I can see. Are there provisions for this in the PS3.0 standard?
 
Thanks arjan, that clarified a lot!

But it sucks. Dsx and dsy put limitations on the implementation and architecture. This could bring lots of difficulties for new chip designs. For example with new anti-aliasing shemes.

My actual problem is that I would like to implement ps 3.0 support for swShader (see sig). It does one pixel at a time, so there's no such thing as 2x2 blocks and pipelines. For texture coordinates I can use the above formula, right? But for temporary registers I see no efficient solution.

One idea is to run the shader twice. At every dsx and dsy instruction I could store the current register value in a buffer, and just continue with 0. Then in a second pass I can use the data in the buffer to have the correct outcome of the dsx and dsy instructions. Inefficient, but consistent and relatively easy to implement.

Another idea is to do compute 2x2 pixels at once. Aside from the very difficult design changes, this also brings another practical problem. I only have 8 SSE registers. When rendering four pixels in parallel this means that I can only store two shader registers in SSE registers. That's not even enough to do one three-argument instruction. Theoretically it's no problem because swShader has an automatic register allocator, but it would poduce a lot of extra memory load/store instructions.

These instructions are quite frustrating. JohnH, where did you read that it should return 0 when pixels are in different control paths? Is there maybe a specification that mentions architectures that do not use the 2x2 pixel method? I'd rather just throw these instructions out of my renderer, but then I can't claim DirectX 9 compatibility...
 
I think I understand now :eek:

When using non-linear texture coordinate interpolation, i.e. you compute them yourself using temporary registers instead of using texture registers, you don't have any mipmap coordinates yet. With texdd you can specify the texture gradients yourself, but to avoid long computations you approximate them by looking at the texture coordinates from the surrounding pixels.

That's quite clever but at the same time gives us these implications and limitations...
 
This is a frequent question, believe it or not.

The "shady" (pun intended?) documentation of dsx and dsy in shader/shading documentation continues a tradition.


FWIW, "The OpenGL Shading Langauge" Specification (Kessenich, Baldwin and Rost) documents dFdx and dFdy more thoroughly.

See Section 8.8, pp 60-61. (Disclosure - I contributed to that section, will be happy to take questions/comments/arrows.)


Another (historically) interesting document, "BMRT: A Global Illumination Implementation of the RenderMan Standard" (Gritz and Hahn).

Also, "The RenderMan Shading Langauge" Specification very briefly documents the functions.

While both "The Renderman Companion" (Upstill) and "Advanced RenderMan" (Apodoca and Gritz) discusses them briefly, but more importantly, discloses some of the limitations.

Finally, there was some discussion of this at opengl.org forums last year.

http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/006806.html
http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/007330.html

-mr. bill
 
Nick said:
Thanks arjan, that clarified a lot!

But it sucks. Dsx and dsy put limitations on the implementation and architecture. This could bring lots of difficulties for new chip designs. For example with new anti-aliasing shemes.
Yes indeed. It also makes it hard to scale chips down to <4 pipelines, and you cannot decouple the pipelines for higher performance either, etc.
My actual problem is that I would like to implement ps 3.0 support for swShader (see sig). It does one pixel at a time, so there's no such thing as 2x2 blocks and pipelines. For texture coordinates I can use the above formula, right? But for temporary registers I see no efficient solution.
The formula you gave above should work for texture coordinates straight out of the iterators, but not otherwise. In general, the alternative to comparing adjacent pixels is to compute derivative values analytically through each instruction, such as:
Code:
a = b*c
da/dx = c*db/dx + b*dc/dx
which is straightforward (just apply standard derivation rules; may be annoyingly expensive, however) for arithmetic operations, but becomes rather hairy once you try to do it on the results of a texture lookup - I guess you could try to multiply the rate of change of texture coordinates with the rate of change of data within the texture itself, but this requires you to do a lot more in your texture mapper code than just applying bi/trilinear interpolation.
One idea is to run the shader twice. At every dsx and dsy instruction I could store the current register value in a buffer, and just continue with 0. Then in a second pass I can use the data in the buffer to have the correct outcome of the dsx and dsy instructions. Inefficient, but consistent and relatively easy to implement.
Could work, in the case where a dsx/dsy instruction never depends on another one in any way. While running dsx on the result of a dsx will give you 0, there may be valid reasons to run dsy on the result of a dsx or vice versa.
Another idea is to do compute 2x2 pixels at once. Aside from the very difficult design changes, this also brings another practical problem. I only have 8 SSE registers. When rendering four pixels in parallel this means that I can only store two shader registers in SSE registers. That's not even enough to do one three-argument instruction. Theoretically it's no problem because swShader has an automatic register allocator, but it would poduce a lot of extra memory load/store instructions.
If you do choose to work on 2x2 pixel blocks, you should be able to get better SSE register utilization than that: except for the dsx/dsy instructions, just don't schedule pixel operations in lockstep - schedule as many instructions as you can for pixel 1, then for pixel 2, then for pixel 3, and finally for pixel 4.
These instructions are quite frustrating. JohnH, where did you read that it should return 0 when pixels are in different control paths? Is there maybe a specification that mentions architectures that do not use the 2x2 pixel method? I'd rather just throw these instructions out of my renderer, but then I can't claim DirectX 9 compatibility...
 
From DX9 docs:

The rate of change computed from the source register is an approximation on the contents of the same register in adjacent pixel(s) running the pixel shader in lock-step with the current pixel. This is designed to work even if adjacent pixels follow different paths due to flow control, because the hardware is required to run a group of lock-step pixel shaders, disabling writes as necessary when flow control goes down a path that a particular pixel does not take.

The dsx, dsy instructions compute their result by looking at the current contents of the source register (per component) for the various pixels in the local area executing in the lock-step. The exact formula used to compute the gradient varies depending on the hardware but should be consistent with the way the hardware does the same operations as part of the LOD calculation process for texture sampling.

Not very clear definition and even the code in refrast doesn't seem to adhere to this totally as it sets the gradient to 0 if either of the used pixels (it uses 2x2 block to calculate the gradient substracting the top-left pixel from top-right or bottom-right) is inactive due to flow control.

Comment in the source:

// If any of the pixels being used to compute the gradient are not currently active
// (due to dynamic flow control), set the gradient to 0

Pixels outside the polygon in the 2x2 block are calculated anyway and results discarded later so the ones inside the polygon get somewhat valid gradients, depending on how the input values behave outside the polygon.
 
Nick said:
Another idea is to do compute 2x2 pixels at once. Aside from the very difficult design changes, this also brings another practical problem. I only have 8 SSE registers. When rendering four pixels in parallel this means that I can only store two shader registers in SSE registers. That's not even enough to do one three-argument instruction. Theoretically it's no problem because swShader has an automatic register allocator, but it would poduce a lot of extra memory load/store instructions.
Firstly, it is much more efficient to use the load-operate forms of instructions from your register file (so you don't actually need any registers allocated for things that haven't been modified since they were last in your register file). You will find that load hoisting means the loads really are free assuming the target is in the cache (which it is likely to be for your register file and constant store) and load-store forwarding is in operation.

Secondly, stores are very very nearly free in SSE assuming the targets hit the cache.

It's far more efficient to process a 2x2 block anyway, because you certainly don't want to have XYZW in one SSE register, you should have XXXX, YYYY, ZZZZ and WWWW. It will be well over twice as fast!
 
arjan de lumens said:
If you are going to have dsx/dsy after a conditional branch, it would seem to me that you would need to signal some sort of synchronization point/barrier in the pixel shader, as the 2x2 pipelines would likely be out of sync after the branch otherwise. This would be needed for any mipmapped texture lookup also, as far as I can see. Are there provisions for this in the PS3.0 standard?

It is extremely (at least in first gen 3.0 parts) unlikely that HW will do anything other than execute all required paths across all pixels in a 2x2 block, disabling register write backs where code is invalid. So doing a dsx/dsy post dynamic conditional "bracket" will probably work.

Generally the SDK doc does need cleaning up in this area...

John.
 
JohnH said:
If a pixel in the 2x2 block falls outside of the current polygon then it is assumed that the value is correct as the plane eqn for the poly is still valid so rate of change remains valid.

so dsx/dsy work with interpolated regs only? sorry, but i'm still trying to figure that out.


Nick said:
One idea is to run the shader twice. At every dsx and dsy instruction I could store the current register value in a buffer, and just continue with 0. Then in a second pass I can use the data in the buffer to have the correct outcome of the dsx and dsy instructions. Inefficient, but consistent and relatively easy to implement.

i assume your rasterizer goes by scanlines, no? if so then you can discriminate between dsx and dsy:
* for dsx you can simply keep the values of the regs you're interested in from the previous pixel (starting with a dummy start_x - 1 pixel)
* for dsy you can run a dummy span at y +/- 1 and fill in a buffer with the values of the regs of interest, as you've originally suggested.
 
darkblu said:
JohnH said:
If a pixel in the 2x2 block falls outside of the current polygon then it is assumed that the value is correct as the plane eqn for the poly is still valid so rate of change remains valid.

so dsx/dsy work with interpolated regs only? sorry, but i'm still trying to figure that out.

Temp registers are OK as well as they will be derived from either a constant or an iterated value (texcoord or colour) that was valid.

John.
 
Dio said:
Firstly, it is much more efficient to use the load-operate forms of instructions from your register file (so you don't actually need any registers allocated for things that haven't been modified since they were last in your register file). You will find that load hoisting means the loads really are free assuming the target is in the cache (which it is likely to be for your register file and constant store) and load-store forwarding is in operation.
Certainly I can use load-operate instructions. That's what I already mostly do. But still, only being able to store two ps 3.0 registers in SSE registers is really very little. In my current implementation there is really a very limited amount of spilling code, and a priority method is used to make sure that frequently used registers are spilled last. Extra load/store operations indeed don't take much time thanks to forwarding, but they have side effects that use resources that could have been used more efficiently. On a Pentium 4 they break up in a fair number of micro-instructions that fill the pipeline when you need it the most...
It's far more efficient to process a 2x2 block anyway, because you certainly don't want to have XYZW in one SSE register, you should have XXXX, YYYY, ZZZZ and WWWW. It will be well over twice as fast!
I don't believe so. Many pixel shader instructions can be done efficiently using SSE. Only operations like dot product are an exception, but this will change with PNI and shuffle operations aren't high-latency anyway. So using the SoA system won't give much of an advantage. And with the limited number of registers it certainly doesn't win us any performance. Implementing it just so that dsx and dsy are easier isn't worth it.

Besides, this would imply that every pixel takes the same control path. My code is compiled dynamically but I can't simply change from 'lock-step' to independent execution...
 
darkblu said:
i assume your rasterizer goes by scanlines, no? if so then you can discriminate between dsx and dsy:
* for dsx you can simply keep the values of the regs you're interested in from the previous pixel (starting with a dummy start_x - 1 pixel)
* for dsy you can run a dummy span at y +/- 1 and fill in a buffer with the values of the regs of interest, as you've originally suggested.
Yes currently I simply do pixel per pixel, scanline per scanline.

So your suggestion is to make the polygon one pixel bigger 'on the top' and 'on the left'? That seems pretty simple to implement but it still quite inelegant.

Isn't there a mathematical way to solve this? Like when computing the cosine of the u coordinate you also directly compute the 'cosine of its derivative' like this:

d(cos(u))/dx = d(cos(u))/du * du/dx = -sin(u) * du/dx

Since you compute sincos anyway this seems reasonably fast. I think this is the only 'correct' method anyway. Dsx and dsy are nice approximations but why not let the user determine how to approximate the gradients. chances are that they can be approximated very nicely with a linear equation.
 
Nick said:
On a Pentium 4 they break up in a fair number of micro-instructions that fill the pipeline when you need it the most...
If you profile it, you'll find it's not significant. On a P4 the execution time of any code is almost exactly determined by the longest chain of ALU operations. Count the latencies, and check it yourself - you'll see the execution time's almost exactly the same. Even the first load and last store go away because of loop-wrap-around effects.

I don't believe so. Many pixel shader instructions can be done efficiently using SSE. Only operations like dot product are an exception, but this will change with PNI and shuffle operations aren't high-latency anyway. So using the SoA system won't give much of an advantage.
I thought the same thing as well. I absolutely refused to believe that SoA could be faster, for exactly the same reasons, and then someone proved it to me. I can't give you the proof, but I suggest you trust me. It is faster :).
 
Dio said:
If you profile it, you'll find it's not significant. On a P4 the execution time of any code is almost exactly determined by the longest chain of ALU operations. Count the latencies, and check it yourself - you'll see the execution time's almost exactly the same. Even the first load and last store go away because of loop-wrap-around effects.
I took the test... I wrote a chain of 4 dependent addps instructions, using two registers. Before and after these instructions the registers respectively load and store their results to memory. A loop then executes this 1 billion times. The result on my brother's P4 2400 with 533 FSB was 14 seconds. Removing the load and store instructions resulted in an execution time of 7 seconds. So it's a fact that instruction count and memory operations do have a big influence.
I thought the same thing as well. I absolutely refused to believe that SoA could be faster, for exactly the same reasons, and then someone proved it to me. I can't give you the proof, but I suggest you trust me. It is faster :).
SoA can be much faster than AoS, but only with an application that is suited for it. A well known example is vertex transformation. Doing them one-by-one means doing dot products which require some shuffle operations. With the SoA method these shuffle operations can be eliminated. On the other hand, SoA requires reorganization (ironically using shuffle operations) of data and lots of registers.

But a pixel shader is not a matrix transformation. Many instructions work component-wise, so no shuffles are required. So SoA would mean a performance loss because data reorganization and the limited number of registers require extra load/store operations which I verified are slow.
 
Nick said:
I took the test... I wrote a chain of 4 dependent addps instructions, using two registers. Before and after these instructions the registers respectively load and store their results to memory.
Can you paste the tested loop in?

But a pixel shader is not a matrix transformation. Many instructions work component-wise, so no shuffles are required.
You are right that swizzles are rare in pixel shaders. But how many operations actually need all the components? Typically, there are some calculations which are scalar, and many which don't require the alpha component. That's 3/4 and 1/4 of the SSE ALU (respectively) that's doing nothing useful...
 
Back
Top