I don't see how I could have been any more explicit. Anyway... Code: #include <xmmintrin.h> void main() { __m128 x; __m128 y; for(int i = 0; i < 1000000000; i++) { __asm { movaps xmm0, x movaps xmm1, y addps xmm0, xmm1 addps xmm1, xmm0 addps xmm0, xmm1 addps xmm1, xmm0 movaps x, xmm0 movaps y, xmm1 } } } I scanned through swShader/Shader/ps20Assembler.cpp and found 272 *ps instructions and 93 *ss instructions. So I think it's reasonable to say that in average I use 80% of the ALU. Ok that could have been 100% by using SoA but that 20% isn't going to make up for the data reordering and register spilling.[/code]
Hmmm... that code doesn't explicitly initialise X and Y to 0 before the loop. Since X and Y are on the stack, eventually they may become denormal, infinity, or maybe some NaN. Without FTZ and DAZ set, denormals cause severe performance penalties (I have seen 10x). Of course, it is possible (maybe likely) that these are zero anyway, and therefore the result is always in range. It does look like the latent chain of that code is right. 4 adds are about 16 cycles (cf. your measured 17) and 4 adds + 1 store + 1 load + 2 cycles (I think that's the dependency chain here) is about 30 cycles (cf. your measured 34). So if results are being skewed, they probably aren't beins skewed much. Perhaps that's irrelevant. I think the point I'm looking for is this. It doesn't matter anyway. In my experience, when I had to spill a register to memory, it was so I could perform some other ALU operation - usually, at least three or four ALU operations. The latency of those hides the store/load latencies. I'd be surprised if you can examine the code and find stores and loads closer than 12 cycles apart. All I can say is: I was convinced AoS was faster for exactly your reasons. I did it AoS. Someone else showed me SoA could be faster. I redid it SoA. And SoA was never slower and usually 2-3x faster. It wasn't a close call. Of course, your mileage may vary...
My experience is that in AoS forms packed instructions are typically only operating on 3d vectors (the alpha component being rarely used except as a scalar), which would imply 62% utilisation. How does the frequency of occurrence in this source file correlate with the frequency of occurrence in assembled output?
well, not exactly. one pixel bigger on the left is correct, but one pixel bigger on the top does not suit you in the general case. problem is the span above the one you're trying to perform dsy's on may not be long enough, i.e. may not cover the needed, well err, span. BTW, if you have time you may look at the dudy/dvdy calculations for the mipmap LOD selection in THURP (see sig), in the Span4UV_*_nearest_*() routines found in the rendSpan.cpp file -- those calcs are implemented in the way i'm suggesting you do the dsy, but w/o a storage (as dudy/dvdy can be interpolated on spot). i, for one, can't think of any analytical way to solve this. derived from iterated/constants they would be, but the derivation may be such that the result may no loneger represent a gradient-devisiable quantity. ex: linear_qty * linear_qty != linear_qty or did i misunderstand you?
The code executed to produce a value contained in a temp register can applying any function you like. However dsx and dsy are simply calculating a the rate of change across a 2x2 pixel block so the nature of the input function doesn't really matter. If you don't want a rate of change that is "sampled" in this manner, then your only choice is to directly implement the derivative functions, but generally dsx and dsy are going to be good enough. John.
then how am i to interpret 'rate of chage remains valid' in your original post? [ed] doh, i realized what you mean right after sumbitting this post. you meant that the shader code still gets executed even at locations outside the triangle. exactly that execution was the thing that was escaping me. my bad.
Ok, I took the time to initialize them. The measurements were exactly the same. And even if denormalization happened, my measurements would still clearly indicate that the load/store operations have about the same latency as the additions. To test the effect of dependency chains I unrolled the loop 10 times. The results were 16 seconds and 7 seconds. To remove the dependencies in the load/store operations I used xmm2/xmm3 instead of xmm0/xmm1. The results was now 13 seconds. So no matter what you do, load/store operations have a significant latency! If you execute the swShader demo you will find that it generates a listing file named Shader.asm. It also documents where registers are allocated and spilled. I'm counting 70 memory accesses there. Performance measurements show that the code executes in roughly 200 clock cycles. So there's not much 'hiding' of these operations behind the ALU operations. What kind of application are you talking about?
Ok, let's suppose that's closer to reality... Now tell me, how can I use SoA for ps 3.0 shaders with branches?
Thanks! Unfortunately the link in your signature didn't work... Besides, my mipmapping code is working correctly, and dsx/dsy doesn't work that way. But thanks anyway.
/* strange thing about the link, works fine for me. anyway, gonna change hosts by this weekend, hope this time things work as expected */ i suppose your mipmapping code does the du/dv derivations analytically. but those derivations can be done non-analytically - in the same way as dsx/dsy (supposedly) provide their results - through calculating the difference from the direct closure. and that's how it's done in THURP, hence my referrencing you to it.
I get a "Server unreachable" message... Yes, I do it analytically. It only requires one multiplication, rounding to integer and a bsr instruction. So it's really fast. Currently only nearest-point mipmapping is supported though, but I think it's not too hard to extend it to linear mipmapping. As far as I know using du/dx etc. requires several multiplications and the correct formula even requires a square root. It is probably acceptable to approximate it by using max operations but it still seems more expensive. Or am I wrong? So do you use the method of rendering an extra row/column of pixels? Can it be implemented elegantly and efficiently? Thanks!
No probs. As mentioned elsewhere this does all fall to peices when you throw dynamic flow control into the mix.. John.
But it really doesn't fall to pieces since the partial derivative of a function in the neighborhood of a singularity is undefined. An varying "if" is (usually) a singularity. Shader writers over the past decade have learned that it hurts when you do certain things, so they don't do them. Derivatives within varying conditionals, texture fetches within varying conditionals, derivatives of derivatives are all undefined, therefore, don't do them. BTW, sampling neighbor pixels outside of a primitive when calculating a derivative is a permitted (and even common) implementation. But it is still less desireable than sampling inside the primitive. For a simple of example why: Think of texture coordinates across a triangle ranging from zero to one. Think of a function that is only defined between zero and one and undefined less than zero and greater than one. Sampling outside the triangle might lead to unexpected results. They can end up sticking out like a sore pixel. For a software implementation you can choose. Pick the neighbor east or west that is *inside* the triangle when calculating DSX, if possible. (It's not always possible.) (Same thing north or south with DSY.) Or you could decide to KISS since real time shaders will probably evolve quickly not to run into trouble at the border. -mr. bill
As expected, but it was still worth checking. Very interesting. So the code was now load xmm2+3, add xmm0+1 twice, store xmm2+3 back to the load location, but it's still taking 31 cycles per loop iteration? Hmm. The figures from your experiments seem to show that a load or store takes at least 4-5 cycles - arguably, as much as 8 cycles - and operate reasonably serially. That would imply that 70 memory accesses can't occur in less than around 280 cycles. So somewhere there is something wrong... You've got me really interested now . If I can find time I'll try to write some code to work out what's going on. Something reasonably similar.
Ah. Now, that's the big question, isn't it? It's a comparable problem to 'how on earth are we going to do this in hardware?' Unfortunately one of the side effects of that is that I probably can't talk about it.
OK, this loop Code: movaps xmm2, x movaps xmm3, y addps xmm0, xmm1 addps xmm1, xmm0 addps xmm0, xmm1 addps xmm1, xmm0 movaps x, xmm2 movaps y, xmm3 executes in 17.4 cycles/run, cf this loop Code: movaps xmm0, x movaps xmm1, y addps xmm0, xmm1 addps xmm1, xmm0 addps xmm0, xmm1 addps xmm1, xmm0 movaps x, xmm0 movaps y, xmm1 Which executes in 34.0 clocks per run, and this loop Code: addps xmm0, xmm1 addps xmm1, xmm0 addps xmm0, xmm1 addps xmm1, xmm0 which executes in 16.0 clocks per run. This confirms what I claimed originally, which is that the stores and loads parallelise completely inside the add latencies if there is no dependent chain, and explains why the 70 memory accesses don't cost 300-400 cycles minimum. One interesting point is that the release build on msvc was 3 clocks slower (consistently) than the debug builds. This is because the release build uses a dec to do the loop counter while the debug build uses add eax, 1. dec is bad for performance because it causes a partial resolve and therefore is a loop-serialising operation.
there, got a new host already. you aren't. although it's not the tex. coord derivative computation, it's the 'rho' computation (i.e. the texture local scale factor). i used its 'max' approximation in THURP until i decided i needed higher reference value of the output and switched to the real equation (that with a square root). well, as i told you since the tex. coords derivatives can be done analytically, it's not really big deal, but the approach can be used for any per-pixep quantity - you can store it somewhere and use it later for calculating its 1st derivative.
Although its another new thing to learn, and its an area where different HW might behaves differently e.g. post conditional dsx/dsy may or may not work. Actually, if you watch the MS DirectXDev forum at all you'll see that the simplest of things come up over and over again, and this stuff is becoming a lot more complicated than the things many have problems with now... Anyway these type of things can be fixed by tightening up the API and validation, so hopefully may turn into a non an issue. This is all true e.g. see similar issues with MSAA. But anything else is unlikley to happen in HW. That will introduce inconsistencies in the results as you are moving the where you are sampling for DSX and DSY depending on proximity to an edge, so although its an option its not necessarily any better, and could be worse in some circumstances. Later, John.
Most of those 70 memory accesses are read operations, with very little dependencies. In the experiment there is a constant chain of dependencies. I think that's where the difference lies. If you put RAW operations far enough from each other they probably don't have any impact. So I still think that the excessive register spilling caused by SoA will have a significant performance impact, because for amost every shader instruction you would have to load/store on or more ps registers from memory.