Dsx and dsy

Dio said:
Can you paste the tested loop in?
I don't see how I could have been any more explicit. Anyway...
Code:
#include <xmmintrin.h>

void main()
{
    __m128 x;
    __m128 y;

    for(int i = 0; i < 1000000000; i++)
    {
        __asm
        {
            movaps xmm0, x
            movaps xmm1, y
            addps xmm0, xmm1
            addps xmm1, xmm0
            addps xmm0, xmm1
            addps xmm1, xmm0
            movaps x, xmm0
            movaps y, xmm1
        }        
    }
}
You are right that swizzles are rare in pixel shaders. But how many operations actually need all the components? Typically, there are some calculations which are scalar, and many which don't require the alpha component. That's 3/4 and 1/4 of the SSE ALU (respectively) that's doing nothing useful...
I scanned through swShader/Shader/ps20Assembler.cpp and found 272 *ps instructions and 93 *ss instructions. So I think it's reasonable to say that in average I use 80% of the ALU. Ok that could have been 100% by using SoA but that 20% isn't going to make up for the data reordering and register spilling.[/code]
 
Hmmm... that code doesn't explicitly initialise X and Y to 0 before the loop. Since X and Y are on the stack, eventually they may become denormal, infinity, or maybe some NaN. Without FTZ and DAZ set, denormals cause severe performance penalties (I have seen 10x). Of course, it is possible (maybe likely) that these are zero anyway, and therefore the result is always in range.

It does look like the latent chain of that code is right. 4 adds are about 16 cycles (cf. your measured 17) and 4 adds + 1 store + 1 load + 2 cycles (I think that's the dependency chain here) is about 30 cycles (cf. your measured 34). So if results are being skewed, they probably aren't beins skewed much.

Perhaps that's irrelevant. I think the point I'm looking for is this. It doesn't matter anyway. In my experience, when I had to spill a register to memory, it was so I could perform some other ALU operation - usually, at least three or four ALU operations. The latency of those hides the store/load latencies. I'd be surprised if you can examine the code and find stores and loads closer than 12 cycles apart.

All I can say is: I was convinced AoS was faster for exactly your reasons. I did it AoS. Someone else showed me SoA could be faster. I redid it SoA. And SoA was never slower and usually 2-3x faster. It wasn't a close call.

Of course, your mileage may vary...
 
Nick said:
I scanned through swShader/Shader/ps20Assembler.cpp and found 272 *ps instructions and 93 *ss instructions. So I think it's reasonable to say that in average I use 80% of the ALU.
My experience is that in AoS forms packed instructions are typically only operating on 3d vectors (the alpha component being rarely used except as a scalar), which would imply 62% utilisation.

How does the frequency of occurrence in this source file correlate with the frequency of occurrence in assembled output?
 
Nick said:
darkblu said:
i assume your rasterizer goes by scanlines, no? if so then you can discriminate between dsx and dsy:
* for dsx you can simply keep the values of the regs you're interested in from the previous pixel (starting with a dummy start_x - 1 pixel)
* for dsy you can run a dummy span at y +/- 1 and fill in a buffer with the values of the regs of interest, as you've originally suggested.
Yes currently I simply do pixel per pixel, scanline per scanline.

So your suggestion is to make the polygon one pixel bigger 'on the top' and 'on the left'? That seems pretty simple to implement but it still quite inelegant.

well, not exactly. one pixel bigger on the left is correct, but one pixel bigger on the top does not suit you in the general case. problem is the span above the one you're trying to perform dsy's on may not be long enough, i.e. may not cover the needed, well err, span. BTW, if you have time you may look at the dudy/dvdy calculations for the mipmap LOD selection in THURP (see sig), in the Span4UV_*_nearest_*() routines found in the rendSpan.cpp file -- those calcs are implemented in the way i'm suggesting you do the dsy, but w/o a storage (as dudy/dvdy can be interpolated on spot).

Isn't there a mathematical way to solve this? Like when computing the cosine of the u coordinate you also directly compute the 'cosine of its derivative' like this:

d(cos(u))/dx = d(cos(u))/du * du/dx = -sin(u) * du/dx

Since you compute sincos anyway this seems reasonably fast. I think this is the only 'correct' method anyway. Dsx and dsy are nice approximations but why not let the user determine how to approximate the gradients. chances are that they can be approximated very nicely with a linear equation.

i, for one, can't think of any analytical way to solve this.

JohnH said:
Temp registers are OK as well as they will be derived from either a constant or an iterated value (texcoord or colour) that was valid.

derived from iterated/constants they would be, but the derivation may be such that the result may no loneger represent a gradient-devisiable quantity. ex: linear_qty * linear_qty != linear_qty
or did i misunderstand you?
 
darkblu said:
JohnH said:
Temp registers are OK as well as they will be derived from either a constant or an iterated value (texcoord or colour) that was valid.

derived from iterated/constants they would be, but the derivation may be such that the result may no loneger represent a gradient-devisiable quantity. ex: linear_qty * linear_qty != linear_qty
or did i misunderstand you?

The code executed to produce a value contained in a temp register can applying any function you like. However dsx and dsy are simply calculating a the rate of change across a 2x2 pixel block so the nature of the input function doesn't really matter. If you don't want a rate of change that is "sampled" in this manner, then your only choice is to directly implement the derivative functions, but generally dsx and dsy are going to be good enough.

John.
 
JohnH said:
The code executed to produce a value contained in a temp register can applying any function you like. However dsx and dsy are simply calculating a the rate of change across a 2x2 pixel block so the nature of the input function doesn't really matter. If you don't want a rate of change that is "sampled" in this manner, then your only choice is to directly implement the derivative functions, but generally dsx and dsy are going to be good enough.

then how am i to interpret 'rate of chage remains valid' in your original post?

JohnH said:
arjan de lumens said:
what happens if some pixels in the 2x2 block fall outside the current polygon?

If a pixel in the 2x2 block falls outside of the current polygon then it is assumed that the value is correct as the plane eqn for the poly is still valid so rate of change remains valid.

[ed] doh, i realized what you mean right after sumbitting this post. you meant that the shader code still gets executed even at locations outside the triangle. exactly that execution was the thing that was escaping me. my bad.
 
Dio said:
Hmmm... that code doesn't explicitly initialise X and Y to 0 before the loop. Since X and Y are on the stack, eventually they may become denormal, infinity, or maybe some NaN. Without FTZ and DAZ set, denormals cause severe performance penalties (I have seen 10x). Of course, it is possible (maybe likely) that these are zero anyway, and therefore the result is always in range.
Ok, I took the time to initialize them. The measurements were exactly the same. And even if denormalization happened, my measurements would still clearly indicate that the load/store operations have about the same latency as the additions.
It does look like the latent chain of that code is right. 4 adds are about 16 cycles (cf. your measured 17) and 4 adds + 1 store + 1 load + 2 cycles (I think that's the dependency chain here) is about 30 cycles (cf. your measured 34). So if results are being skewed, they probably aren't beins skewed much.
To test the effect of dependency chains I unrolled the loop 10 times. The results were 16 seconds and 7 seconds. To remove the dependencies in the load/store operations I used xmm2/xmm3 instead of xmm0/xmm1. The results was now 13 seconds. So no matter what you do, load/store operations have a significant latency!
Perhaps that's irrelevant. I think the point I'm looking for is this. It doesn't matter anyway. In my experience, when I had to spill a register to memory, it was so I could perform some other ALU operation - usually, at least three or four ALU operations. The latency of those hides the store/load latencies. I'd be surprised if you can examine the code and find stores and loads closer than 12 cycles apart.
If you execute the swShader demo you will find that it generates a listing file named Shader.asm. It also documents where registers are allocated and spilled. I'm counting 70 memory accesses there. Performance measurements show that the code executes in roughly 200 clock cycles. So there's not much 'hiding' of these operations behind the ALU operations.
All I can say is: I was convinced AoS was faster for exactly your reasons. I did it AoS. Someone else showed me SoA could be faster. I redid it SoA. And SoA was never slower and usually 2-3x faster. It wasn't a close call.
What kind of application are you talking about?
 
Dio said:
My experience is that in AoS forms packed instructions are typically only operating on 3d vectors (the alpha component being rarely used except as a scalar), which would imply 62% utilisation.
Ok, let's suppose that's closer to reality...

Now tell me, how can I use SoA for ps 3.0 shaders with branches?
 
darkblu said:
well, not exactly. one pixel bigger on the left is correct, but one pixel bigger on the top does not suit you in the general case. problem is the span above the one you're trying to perform dsy's on may not be long enough, i.e. may not cover the needed, well err, span. BTW, if you have time you may look at the dudy/dvdy calculations for the mipmap LOD selection in THURP (see sig), in the Span4UV_*_nearest_*() routines found in the rendSpan.cpp file -- those calcs are implemented in the way i'm suggesting you do the dsy, but w/o a storage (as dudy/dvdy can be interpolated on spot).
Thanks! Unfortunately the link in your signature didn't work... Besides, my mipmapping code is working correctly, and dsx/dsy doesn't work that way. But thanks anyway.
 
Nick said:
darkblu said:
well, not exactly. one pixel bigger on the left is correct, but one pixel bigger on the top does not suit you in the general case. problem is the span above the one you're trying to perform dsy's on may not be long enough, i.e. may not cover the needed, well err, span. BTW, if you have time you may look at the dudy/dvdy calculations for the mipmap LOD selection in THURP (see sig), in the Span4UV_*_nearest_*() routines found in the rendSpan.cpp file -- those calcs are implemented in the way i'm suggesting you do the dsy, but w/o a storage (as dudy/dvdy can be interpolated on spot).
Thanks! Unfortunately the link in your signature didn't work... Besides, my mipmapping code is working correctly, and dsx/dsy doesn't work that way. But thanks anyway.

/* strange thing about the link, works fine for me. anyway, gonna change hosts by this weekend, hope this time things work as expected */

i suppose your mipmapping code does the du/dv derivations analytically. but those derivations can be done non-analytically - in the same way as dsx/dsy (supposedly) provide their results - through calculating the difference from the direct closure. and that's how it's done in THURP, hence my referrencing you to it.
 
darkblu said:
/* strange thing about the link, works fine for me. anyway, gonna change hosts by this weekend, hope this time things work as expected */
I get a "Server unreachable" message...
i suppose your mipmapping code does the du/dv derivations analytically. but those derivations can be done non-analytically - in the same way as dsx/dsy (supposedly) provide their results - through calculating the difference from the direct closure. and that's how it's done in THURP, hence my referrencing you to it.
Yes, I do it analytically. It only requires one multiplication, rounding to integer and a bsr instruction. So it's really fast. Currently only nearest-point mipmapping is supported though, but I think it's not too hard to extend it to linear mipmapping.

As far as I know using du/dx etc. requires several multiplications and the correct formula even requires a square root. It is probably acceptable to approximate it by using max operations but it still seems more expensive. Or am I wrong?

So do you use the method of rendering an extra row/column of pixels? Can it be implemented elegantly and efficiently? Thanks!
 
darkblu said:
JohnH said:
The code executed to produce a value contained in a temp register can applying any function you like. However dsx and dsy are simply calculating a the rate of change across a 2x2 pixel block so the nature of the input function doesn't really matter. If you don't want a rate of change that is "sampled" in this manner, then your only choice is to directly implement the derivative functions, but generally dsx and dsy are going to be good enough.

then how am i to interpret 'rate of chage remains valid' in your original post?

JohnH said:
arjan de lumens said:
what happens if some pixels in the 2x2 block fall outside the current polygon?

If a pixel in the 2x2 block falls outside of the current polygon then it is assumed that the value is correct as the plane eqn for the poly is still valid so rate of change remains valid.

[ed] doh, i realized what you mean right after sumbitting this post. you meant that the shader code still gets executed even at locations outside the triangle. exactly that execution was the thing that was escaping me. my bad.

No probs.

As mentioned elsewhere this does all fall to peices when you throw dynamic flow control into the mix..

John.
 
JohnH said:
As mentioned elsewhere this does all fall to peices when you throw dynamic flow control into the mix..
John.

But it really doesn't fall to pieces since the partial derivative of a function in the neighborhood of a singularity is undefined. An varying "if" is (usually) a singularity.

Shader writers over the past decade have learned that it hurts when you do certain things, so they don't do them. Derivatives within varying conditionals, texture fetches within varying conditionals, derivatives of derivatives are all undefined, therefore, don't do them.


BTW, sampling neighbor pixels outside of a primitive when calculating a derivative is a permitted (and even common) implementation.

But it is still less desireable than sampling inside the primitive.

For a simple of example why:

Think of texture coordinates across a triangle ranging from zero to one. Think of a function that is only defined between zero and one and undefined less than zero and greater than one. Sampling outside the triangle might lead to unexpected results. They can end up sticking out like a sore pixel.

For a software implementation you can choose. Pick the neighbor east or west that is *inside* the triangle when calculating DSX, if possible. (It's not always possible.) (Same thing north or south with DSY.)

Or you could decide to KISS since real time shaders will probably evolve quickly not to run into trouble at the border.

-mr. bill
 
Nick said:
Ok, I took the time to initialize them. The measurements were exactly the same.
As expected, but it was still worth checking.
To remove the dependencies in the load/store operations I used xmm2/xmm3 instead of xmm0/xmm1. The results was now 13 seconds.
Very interesting. So the code was now load xmm2+3, add xmm0+1 twice, store xmm2+3 back to the load location, but it's still taking 31 cycles per loop iteration?
If you execute the swShader demo you will find that it generates a listing file named Shader.asm. It also documents where registers are allocated and spilled. I'm counting 70 memory accesses there. Performance measurements show that the code executes in roughly 200 clock cycles. So there's not much 'hiding' of these operations behind the ALU operations.
Hmm. The figures from your experiments seem to show that a load or store takes at least 4-5 cycles - arguably, as much as 8 cycles - and operate reasonably serially. That would imply that 70 memory accesses can't occur in less than around 280 cycles. So somewhere there is something wrong...

You've got me really interested now :). If I can find time I'll try to write some code to work out what's going on.
What kind of application are you talking about?
Something reasonably similar.
 
Nick said:
Now tell me, how can I use SoA for ps 3.0 shaders with branches?
Ah. Now, that's the big question, isn't it? It's a comparable problem to 'how on earth are we going to do this in hardware?' Unfortunately one of the side effects of that is that I probably can't talk about it.
 
OK, this loop
Code:
movaps xmm2, x
movaps xmm3, y
addps xmm0, xmm1
addps xmm1, xmm0
addps xmm0, xmm1
addps xmm1, xmm0
movaps x, xmm2
movaps y, xmm3
executes in 17.4 cycles/run, cf this loop
Code:
movaps xmm0, x
movaps xmm1, y
addps xmm0, xmm1
addps xmm1, xmm0
addps xmm0, xmm1
addps xmm1, xmm0
movaps x, xmm0
movaps y, xmm1
Which executes in 34.0 clocks per run, and this loop
Code:
addps xmm0, xmm1
addps xmm1, xmm0
addps xmm0, xmm1
addps xmm1, xmm0
which executes in 16.0 clocks per run.

This confirms what I claimed originally, which is that the stores and loads parallelise completely inside the add latencies if there is no dependent chain, and explains why the 70 memory accesses don't cost 300-400 cycles minimum.

One interesting point is that the release build on msvc was 3 clocks slower (consistently) than the debug builds. This is because the release build uses a dec to do the loop counter while the debug build uses add eax, 1. dec is bad for performance because it causes a partial resolve and therefore is a loop-serialising operation.
 
Nick said:
darkblu said:
/* strange thing about the link, works fine for me. anyway, gonna change hosts by this weekend, hope this time things work as expected */
I get a "Server unreachable" message...

there, got a new host already.

As far as I know using du/dx etc. requires several multiplications and the correct formula even requires a square root. It is probably acceptable to approximate it by using max operations but it still seems more expensive. Or am I wrong?

you aren't. although it's not the tex. coord derivative computation, it's the 'rho' computation (i.e. the texture local scale factor). i used its 'max' approximation in THURP until i decided i needed higher reference value of the output and switched to the real equation (that with a square root).

So do you use the method of rendering an extra row/column of pixels? Can it be implemented elegantly and efficiently? Thanks!

well, as i told you since the tex. coords derivatives can be done analytically, it's not really big deal, but the approach can be used for any per-pixep quantity - you can store it somewhere and use it later for calculating its 1st derivative.
 
mrbill said:
JohnH said:
As mentioned elsewhere this does all fall to peices when you throw dynamic flow control into the mix..
John.

But it really doesn't fall to pieces since the partial derivative of a function in the neighborhood of a singularity is undefined. An varying "if" is (usually) a singularity.

Shader writers over the past decade have learned that it hurts when you do certain things, so they don't do them. Derivatives within varying conditionals, texture fetches within varying conditionals, derivatives of derivatives are all undefined, therefore, don't do them.

Although its another new thing to learn, and its an area where different HW might behaves differently e.g. post conditional dsx/dsy may or may not work. Actually, if you watch the MS DirectXDev forum at all you'll see that the simplest of things come up over and over again, and this stuff is becoming a lot more complicated than the things many have problems with now...
Anyway these type of things can be fixed by tightening up the API and validation, so hopefully may turn into a non an issue.

BTW, sampling neighbor pixels outside of a primitive when calculating a derivative is a permitted (and even common) implementation.

But it is still less desireable than sampling inside the primitive.

For a simple of example why:

Think of texture coordinates across a triangle ranging from zero to one. Think of a function that is only defined between zero and one and undefined less than zero and greater than one. Sampling outside the triangle might lead to unexpected results. They can end up sticking out like a sore pixel.
This is all true e.g. see similar issues with MSAA. But anything else is unlikley to happen in HW.

For a software implementation you can choose. Pick the neighbor east or west that is *inside* the triangle when calculating DSX, if possible. (It's not always possible.) (Same thing north or south with DSY.)

That will introduce inconsistencies in the results as you are moving the where you are sampling for DSX and DSY depending on proximity to an edge, so although its an option its not necessarily any better, and could be worse in some circumstances.

Later,
John.
 
Dio said:
Hmm. The figures from your experiments seem to show that a load or store takes at least 4-5 cycles - arguably, as much as 8 cycles - and operate reasonably serially. That would imply that 70 memory accesses can't occur in less than around 280 cycles. So somewhere there is something wrong...
Most of those 70 memory accesses are read operations, with very little dependencies. In the experiment there is a constant chain of dependencies. I think that's where the difference lies. If you put RAW operations far enough from each other they probably don't have any impact.

So I still think that the excessive register spilling caused by SoA will have a significant performance impact, because for amost every shader instruction you would have to load/store on or more ps registers from memory.
 
Back
Top