Dsx and dsy

Dio said:
OK, this loop
...
executes in 17.4 cycles/run, cf this loop
...
Which executes in 34.0 clocks per run, and this loop
...
which executes in 16.0 clocks per run.

This confirms what I claimed originally, which is that the stores and loads parallelise completely inside the add latencies if there is no dependent chain, and explains why the 70 memory accesses don't cost 300-400 cycles minimum.
Is that with the 10 times unrolling? Else you're measuring significally different values from what I have...
Just to be clear: timing is by rdtsc, on a P4 2.66.
I used my mobile phone's stopwatch... But I'm certain it's accurate to within half a second. My gameboy thumb has a good reflex ;)
 
Unrolling it ten times made it 37.0 clocks per run for the slow case and 22.0 clocks per run for the fast case. I'm baffled to as why it's slower, I would guess some strange interaction with the architecture.

One other interesting one: doing the same thing twice on two independent values (adding a w and z and repeating the same code - even using the same registers, because of register renaming) adds only 4 clocks - parallelism really works on the P4. Again, this is a reason not to be afraid of loads and stores - without them, you don't get the bonuses of register renaming!

rdtsc is so easy to use - it makes collecting the timing trivial and very accurate, and you can use only a few million loop iterations and collect the data much faster -but I agree that over ten seconds or more the difference is insignificant. I can hand you this snippet
Code:
_inline __int64 GetTSC(void)
{
    __int64 res;
    int i;

    __asm {
        lock sub [i], 1     // serialise, or at least try to
        rdtsc
        lea edi, res
        mov [edi], eax
        mov [edi+4], edx
    }
    return(res);
}
 
darkblu said:
there, got a new host already.
I hate to tell you this, but now I get a DNS error...
you aren't. although it's not the tex. coord derivative computation, it's the 'rho' computation (i.e. the texture local scale factor). i used its 'max' approximation in THURP until i decided i needed higher reference value of the output and switched to the real equation (that with a square root).
I'm trying to stick to DirectX 9 specifications, so max is good enough. But I think it woud be too slow. My tex instructions are already the longest, and this method would require four memory loads, four subtractions, three max instructions and one bsr. That's four times slower than what I currently use, and is probably less accurate.
well, as i told you since the tex. coords derivatives can be done analytically, it's not really big deal, but the approach can be used for any per-pixep quantity - you can store it somewhere and use it later for calculating its 1st derivative.
How efficient is this "storing somewhere"? BTW, how do you deal with depth occlusion? With that method you can't have early-z.

THURP is as the name implies not meant to be real-time, but swShader is. So I'm afraid we're a bit on a different wavelength here...
 
Nick said:
darkblu said:
there, got a new host already.
I hate to tell you this, but now I get a DNS error...

:? i'll will look into it tonight

I'm trying to stick to DirectX 9 specifications, so max is good enough. But I think it woud be too slow. My tex instructions are already the longest, and this method would require four memory loads, four subtractions, three max instructions and one bsr. That's four times slower than what I currently use, and is probably less accurate.

most likely. though i don't recall seeing some significant differences in the output when switching from 'max' to the full-fledged distance calculation.

How efficient is this "storing somewhere"?

can't tell you right away, as i don't have a ready scheme at hand (THURP does not try storing the u/v's, i referred to it because it uses that same difference-from-neighbours approach dsx and dsy use).

BTW, how do you deal with depth occlusion? With that method you can't have early-z.

yes. early-z may preclude you from actually obtaining anything to store. but generally, for a scanline i, you'd pursue something along:

Code:
scanline i - 1 :    ......xxxxxx........
scanline i     :    oooooooooooooooooooo

where 'x's get produced as part of the previous scanline span, and '.'s get deliberately calculated for the current-scanline span. in this case a max-span-long array may do well for a storage stucture. so at the beginning of a new span, to provide your last-scanline span quantities, you go like:

1) calculate data for the '.'-runs from last scanline
2) do the current-scanline span

this logic may also expand to incorporate for runs from the previous scanline which were not produced due to early-z rejection.

THURP is as the name implies not meant to be real-time, but swShader is. So I'm afraid we're a bit on a different wavelength here...

not completely. although swShader is following a strict real-world, real-time implementation, and THURP is more of a classic (pre-shaders) rasterizer abstraction, it, too, does try not to be annoyingly slow. AAMOF, its structural framework is pretty much real-time-targetted, just that span scan-converters are plain, non-SIMD, heavily-float-utilizing C code.
 
darkblu said:
THURP does not try storing the u/v's, i referred to it because it uses that same difference-from-neighbours approach dsx and dsy use.
I think I finally understand. This comment in your source sais it all:
Code:
// calculate the uv-mapping of a fictitious pixel above this one
I know THURP is not meant to be optimized but that's got to be the least efficient way to calculate mipmapping levels. No wonder software rendering has a reputation of being slow. :rolleyes:

You can algebraically rewrite the x-compression formula to look like this:

Mx = 1/w²*sqrt((Cu*y+Ux)²+(Cv*y+Vx)²)

Where Cu, Ux, Cv and Vx are constant per polygon. Then you also compute My and use rho = max(Mx, My). So for the stuff in the square roots you only need two additions per pixel (y is constant per scanline). The 1/w² can be computed with a single muliplication because you already have 1/w for perspective correction.

The only approximation I use is to compute rho only per vertex and interpolate it (using 1/w²). The difference is truely unnoticable, and the final cost per pixel is four operations (reduced to three because I interpolate rho in parallel with the texture coordinates).
 
back to topic after a seaside weekend.

Nick said:
I know THURP is not meant to be optimized but that's got to be the least efficient way to calculate mipmapping levels. No wonder software rendering has a reputation of being slow. :rolleyes:

You can algebraically rewrite the x-compression formula to look like this:

Mx = 1/w²*sqrt((Cu*y+Ux)²+(Cv*y+Vx)²)

Where Cu, Ux, Cv and Vx are constant per polygon. Then you also compute My and use rho = max(Mx, My). So for the stuff in the square roots you only need two additions per pixel (y is constant per scanline). The 1/w² can be computed with a single muliplication because you already have 1/w for perspective correction.

ok, a sidenote first, so we can avoid unnecessary use of rollies in the future.
the idea behind THURP is to have a classic, multi-pass-based rasterizer which (a) generally targets max precision, yet (b) has an execution flow which does not prevent the lib from getting subsequently re-targetted to real-time performance w/o the need for radical changes. otherwise many a calculation in THURP is performance-suboptimal as of pesent; the mere fact that all calculations are carried out in floats regardless of targetted dynamic range is indicative, i'd presume.

now, on the dsx/dsy approach vs the analytical approach.

although you're right in saying that getting the derivatives analytically is faster compared to what THURP presently does, in the general case getting them as difference from the neighbours is a) more robust (can be applied for any sort of quantity, including dependent texutring), b) can be real fast if the y-direction derivatives are implemented by the scanline scheme i mentioned in the prev post. after all, you can hardly beat a single subtraction per derivative, can you? now, the fact that presently THURP's mipmapping code does not implemet such a scheme does not imply such is not intended for later on.

The only approximation I use is to compute rho only per vertex and interpolate it (using 1/w²). The difference is truely unnoticable, and the final cost per pixel is four operations (reduced to three because I interpolate rho in parallel with the texture coordinates).

interpolating rho is not an option for the mainstream version of THURP. might be an option for a more performance-targetted branch, though. anyway, by-interpolations optimizations are not on the present tasklist (covering functionality to perform per-pixel phong is)
 
darkblu said:
in the general case getting them as difference from the neighbours is a) more robust (can be applied for any sort of quantity, including dependent texutring)
I'm not very sure about this definition of 'robust'. The 'derivative' of four samples of an arbitrary texture is unlikely to be meaningful. Of course, if the texture implements a function, then the derivative may or may not be reasonable (for example, if the value crosses a singularity). However, it is on-spec and one hopes that developers are not daft enough to think this has more meaning than its strict definition (i.e. it's a comparison with the corresponding register's value in surrounding pixels).

I note that the DX9 specification seems to actually specify an implementation. An algorithmic method does not meet that implementation - therefore, if you want to meet spec, you have to go with that implementation.
 
darkblu said:
back to topic after a seaside weekend.
I hope it was relaxing! That remembers me it's been four year or so that I've seen the sea, even if it's only 50 km far... Bah, got to study for re-examinations.
the idea behind THURP is to have a classic, multi-pass-based rasterizer which (a) generally targets max precision, yet (b) has an execution flow which does not prevent the lib from getting subsequently re-targetted to real-time performance w/o the need for radical changes. otherwise many a calculation in THURP is performance-suboptimal as of pesent; the mere fact that all calculations are carried out in floats regardless of targetted dynamic range is indicative, i'd presume.
But for maximum precision and consistency we already have the reference rasterizer, not? And there are other libraries for the really 'scientific' rendering. So, although software rendering is always a cool project, I don't fully understand the purpose of THURP yet. I chose to just leave the highest quality rendering to specialized software and hardware, and focus on supporting as many features as possible at acceptable precision and performance. Image quality is of importance to me, but if I have an approximation which doesn't cause artifacts then I'm happy with it. What are your future plans with THURP?
although you're right in saying that getting the derivatives analytically is faster compared to what THURP presently does, in the general case getting them as difference from the neighbours is a) more robust (can be applied for any sort of quantity, including dependent texutring), b) can be real fast if the y-direction derivatives are implemented by the scanline scheme i mentioned in the prev post. after all, you can hardly beat a single subtraction per derivative, can you? now, the fact that presently THURP's mipmapping code does not implemet such a scheme does not imply such is not intended for later on.
What worried me most performance-wise was not the subtractions, but the perspective divides. This is one of the slowest things when texturing, even with SSE optimizations. But with your method you do it three times per pixel. Of course the method used by hardware with 2x2 blocks is not necessarily slower and certainly more robust than the analytical method.

So, I'm really very interested in the dsx/dsy method, but it's impractical with my current idea for the ps 3.0 design. The dynamic flow control makes it nearly impossible and I have no idea how to deal with it. Does anyone know exactly how the hardware gets the mipmap LOD for conditionally executed texld instructions? Anything the hardware can, I can do too even if I have to take a different approach...

I have one new idea. I can render 2x2 blocks, but every pixel sequentially. For every dsx/dsy instruction (or tex instruction for that matter), I can split the shader. Temporary registers get stored in memory, together with the texture coordinates. This is easy with my build in automatic register allocator. When all four pixels have finished the first part of the shader, and the texture coordinates are known, I can continue with the second part, etc. The only possible performance losses are the extra pixels being calculated at the edges, and the saving/restoring of the registers. But if I got to believe Dio then the latter won't make a difference ;)

interpolating rho is not an option for the mainstream version of THURP. might be an option for a more performance-targetted branch, though. anyway, by-interpolations optimizations are not on the present tasklist (covering functionality to perform per-pixel phong is)

That brings me to another deficiency of THURP. Your rendSpan.cpp file is huge, and nearly unmanagable. Every time you add a new feature you'll have to rename your functions and the file will double in size. The latter porblem can be solved elegantly by using templates:
Code:
template<bool phongEnabled>
void renderSpan()
{
    // general setup
    
    if(phongEnabled)
    {
        // phong implementation
    }
    else
    {
        // gouraud
    }

    // generic code
}
You get the idea... But it doesn't solve the problem of typing all the function names, and you'll need a gigantic switch statement (even bigger than what you're already using). And even if that was managable by using macros and such, adding even more rendering options will exponentially grow your executable, not to mention the compilation time.

Run-time compilation looks quite similar to the templated code, but is fully flexible. In my case I use my own run-time assembler SoftWire, but you can just as well let the C++ compiler do the job. In that case it's also referred to as 'code stitching'. This is rather easy to implement by using 'naked' functions (__declspec(naked) in Visual C++). Just remember only to use static data, so you can relocate code without much trouble. SoftWire is more efficient because it can do automatic register allocations and peephole optimizations (and I'm planning a scheduler), but for THURP a stitcher would be perfect...
 
Dio said:
darkblu said:
in the general case getting them as difference from the neighbours is a) more robust (can be applied for any sort of quantity, including dependent texutring)
I'm not very sure about this definition of 'robust'. The 'derivative' of four samples of an arbitrary texture is unlikely to be meaningful. Of course, if the texture implements a function, then the derivative may or may not be reasonable (for example, if the value crosses a singularity). However, it is on-spec and one hopes that developers are not daft enough to think this has more meaning than its strict definition (i.e. it's a comparison with the corresponding register's value in surrounding pixels).

maybe 'robust' is not the best word to be used in such a context, but anyhow, i meant the capability to always get the local differences of a given quantity and interpret those as 'derivatives' whenever sensible.
oh, and of course developers are never daft enough (as in 'they can always be more daft :)

I note that the DX9 specification seems to actually specify an implementation. An algorithmic method does not meet that implementation - therefore, if you want to meet spec, you have to go with that implementation.

failing to follow you here. maybe you meant an 'analytic method'?
 
Nick said:
So, I'm really very interested in the dsx/dsy method, but it's impractical with my current idea for the ps 3.0 design. The dynamic flow control makes it nearly impossible and I have no idea how to deal with it.
That's why it's specced that way for ps 3.0 - so it works with flow control. If you want ps 3.0 compatibility, you really have to do it this way.

darkblu said:
failing to follow you here. maybe you meant an 'analytic method'?
Probably...
 
Nick said:
darkblu said:
back to topic after a seaside weekend.
I hope it was relaxing! That remembers me it's been four year or so that I've seen the sea, even if it's only 50 km far... Bah, got to study for re-examinations.

it was relaxing indeed. although i generally prefer high altitudes to sea level, this one sea vacation was very nice. take your exams and head over to the sea (make sure in a good company too)

the idea behind THURP is to have a classic, multi-pass-based rasterizer which (a) generally targets max precision, yet (b) has an execution flow which does not prevent the lib from getting subsequently re-targetted to real-time performance w/o the need for radical changes. otherwise many a calculation in THURP is performance-suboptimal as of pesent; the mere fact that all calculations are carried out in floats regardless of targetted dynamic range is indicative, i'd presume.
But for maximum precision and consistency we already have the reference rasterizer, not? And there are other libraries for the really 'scientific' rendering. So, although software rendering is always a cool project, I don't fully understand the purpose of THURP yet.

we surely have dx's refrast, and then a myriad of software renderers and rasterizers. and those will keep growing in numbers in the future - the key is in the variety - there's a principle in evolution that requires for redundancy. of course the authors of that sw don't do it to fulfill some divine cosmic principle - no, they usually do it for the fun of it. something you know very well ;) as about THURP, aside from the fun, it exists so i (and others) can try new multi-pass rendering techniques in action w/o the need for actual graphics hw or some adoption by an API (so it would one day show up in its 'refrast').

I chose to just leave the highest quality rendering to specialized software and hardware, and focus on supporting as many features as possible at acceptable precision and performance. Image quality is of importance to me, but if I have an approximation which doesn't cause artifacts then I'm happy with it. What are your future plans with THURP?

for the time being bringing it to a state where it would be able to carry out per-pixel phong (as in the new Carmack's engine). i don't have further plans for it. yet.

although you're right in saying that getting the derivatives analytically is faster compared to what THURP presently does, in the general case getting them as difference from the neighbours is a) more robust (can be applied for any sort of quantity, including dependent texutring), b) can be real fast if the y-direction derivatives are implemented by the scanline scheme i mentioned in the prev post. after all, you can hardly beat a single subtraction per derivative, can you? now, the fact that presently THURP's mipmapping code does not implemet such a scheme does not imply such is not intended for later on.
What worried me most performance-wise was not the subtractions, but the perspective divides. This is one of the slowest things when texturing, even with SSE optimizations. But with your method you do it three times per pixel. Of course the method used by hardware with 2x2 blocks is not necessarily slower and certainly more robust than the analytical method.

the excess of perspecive divides in THURP's mipmapping is due to the fact that it does not have yet that particular scanlines storages scheme we talked about. when it has it in place one day, dudy & dvdy will cost a subtraction op each (just as dudx & dvdx cost a single subtraction op now)

So, I'm really very interested in the dsx/dsy method, but it's impractical with my current idea for the ps 3.0 design. The dynamic flow control makes it nearly impossible and I have no idea how to deal with it. Does anyone know exactly how the hardware gets the mipmap LOD for conditionally executed texld instructions? Anything the hardware can, I can do too even if I have to take a different approach...

honestly, the dynamic flow control is beyond my scope FTTB, as THURP is a classic, multi-stages rasterizer, and as such does not have shaders' flow. but of course the question you ask is interesting in iteself. i believe mr.bill posted some account on the matter earlier in this thread.

I have one new idea. I can render 2x2 blocks, but every pixel sequentially. For every dsx/dsy instruction (or tex instruction for that matter), I can split the shader. Temporary registers get stored in memory, together with the texture coordinates. This is easy with my build in automatic register allocator. When all four pixels have finished the first part of the shader, and the texture coordinates are known, I can continue with the second part, etc. The only possible performance losses are the extra pixels being calculated at the edges, and the saving/restoring of the registers. But if I got to believe Dio then the latter won't make a difference ;)

yep, your idea is quite viable. a dsx/dsy op would server as a sync point across the pixels in a 2x2 block. hey, actually that's even multi-threadable! ;)

interpolating rho is not an option for the mainstream version of THURP. might be an option for a more performance-targetted branch, though. anyway, by-interpolations optimizations are not on the present tasklist (covering functionality to perform per-pixel phong is)

That brings me to another deficiency of THURP. Your rendSpan.cpp file is huge, and nearly unmanagable. Every time you add a new feature you'll have to rename your functions and the file will double in size. The latter porblem can be solved elegantly by using templates:
<snip>
You get the idea... But it doesn't solve the problem of typing all the function names, and you'll need a gigantic switch statement (even bigger than what you're already using). And even if that was managable by using macros and such, adding even more rendering options will exponentially grow your executable, not to mention the compilation time.

yes, that's an issue with the approach taken in THURP. and it's not so much in the size of the spanners' source file and its compile time (as i can split that file in portions and do multi-process compile on my SMP machine) as in the size of the top-level control switch statements and the extremely low code reuse across the spanner routines. but actually it's a matter of human managability vs machine managability - the compiler handles gigantic switch statemetns very well, so i've made the choice to leave that task to the cc. the approach you suggest does not quite solve the problem as the templatization is a compile time action, whereas the render-state controls flow is run-time.

Run-time compilation looks quite similar to the templated code, but is fully flexible. In my case I use my own run-time assembler SoftWire, but you can just as well let the C++ compiler do the job. In that case it's also referred to as 'code stitching'. This is rather easy to implement by using 'naked' functions (__declspec(naked) in Visual C++). Just remember only to use static data, so you can relocate code without much trouble. SoftWire is more efficient because it can do automatic register allocations and peephole optimizations (and I'm planning a scheduler), but for THURP a stitcher would be perfect...

i find swShader approach extremely smart with its use of run-time assembly. and yes, i may consider implmenting stitching for THURP in a later phase.
 
we surely have dx's refrast, and then a myriad of software renderers and rasterizers. and those will keep growing in numbers in the future - the key is in the variety - there's a principle in evolution that requires for redundancy. of course the authors of that sw don't do it to fulfill some divine cosmic principle - no, they usually do it for the fun of it. something you know very well ;) as about THURP, aside from the fun, it exists so i (and others) can try new multi-pass rendering techniques in action w/o the need for actual graphics hw or some adoption by an API (so it would one day show up in its 'refrast').
Ah yes, the possibility to do any kind of experiments is an excellent reason for the existence of a software renderer. That was once one of my goals too, but I soon realized that all my idea's already existed or were simply not the best solution. Still trying though...
for the time being bringing it to a state where it would be able to carry out per-pixel phong (as in the new Carmack's engine). i don't have further plans for it. yet.
My latest plan for swShader is to compile my own d3d9.dll. I already managed to build the 'skeleton' and now I have to fill in the blank spots.
the excess of perspecive divides in THURP's mipmapping is due to the fact that it does not have yet that particular scanlines storages scheme we talked about. when it has it in place one day, dudy & dvdy will cost a subtraction op each (just as dudx & dvdx cost a single subtraction op now)
I see. Let me know if it works!
honestly, the dynamic flow control is beyond my scope FTTB, as THURP is a classic, multi-stages rasterizer, and as such does not have shaders' flow. but of course the question you ask is interesting in iteself. i believe mr.bill posted some account on the matter earlier in this thread.
I recently read part of an article discussing the possiblity to translate RenderMan shaders to multi-pass rendering. The idea was to do operation per operation for the whole polygon, instead of all operations at once per pixel. It was slower because of the increased memory bandwidth needs, but still interesting...
yep, your idea is quite viable. a dsx/dsy op would server as a sync point across the pixels in a 2x2 block. hey, actually that's even multi-threadable! ;)
Yeah, every thread could handle a 2x2 block. I guess I'll have to wait for Extended Hyper-Threading then. ;)
yes, that's an issue with the approach taken in THURP. and it's not so much in the size of the spanners' source file and its compile time (as i can split that file in portions and do multi-process compile on my SMP machine) as in the size of the top-level control switch statements and the extremely low code reuse across the spanner routines. but actually it's a matter of human managability vs machine managability - the compiler handles gigantic switch statemetns very well, so i've made the choice to leave that task to the cc. the approach you suggest does not quite solve the problem as the templatization is a compile time action, whereas the render-state controls flow is run-time.
Templates do solve the code reuse and managability of the othewise gigantic file...
 
Nick said:
yes, that's an issue with the approach taken in THURP. and it's not so much in the size of the spanners' source file and its compile time (as i can split that file in portions and do multi-process compile on my SMP machine) as in the size of the top-level control switch statements and the extremely low code reuse across the spanner routines. but actually it's a matter of human managability vs machine managability - the compiler handles gigantic switch statemetns very well, so i've made the choice to leave that task to the cc. the approach you suggest does not quite solve the problem as the templatization is a compile time action, whereas the render-state controls flow is run-time.
Templates do solve the code reuse and managability of the othewise gigantic file...

yes, sorry, i realized i had misinterpreted your point shortly after posting my reply, but i was already away from the pc.
the reason no spanners are currenty templatized in THURP is quite a dull one -- the compiler does not allow me to. see, i use gcc 2.9, whose template support is not quite full-blown -- it does not support template-templates in particular, which prevents me from writing

Code:
template <template <bool PARAM_T> void SPANNER_T(vertex, vertex)>
function f()

i'm expecting to have gcc 3.2 available under beos soon, which would fix the problem.
 
Back
Top