Anyone got a NV40...?

Xmas

Porous
Veteran
Supporter
... to check the execution time of this little PS snippet? :D
Code:
 ps_2_x

dcl_2d s0
dcl_2d s1
dcl t0

texld r2, t0, s0
dsx r0.xy, t0
dsy r1.xy, t0
add r2.xy, t0, r2
texldd r0, r2, s1, r0, r1
mov oC0, r0
NVShaderPerf says 6 cycles for CineFX1 and 5 for CineFX2, but unfortunately it doesn't provide numbers or NV40 yet. My guess is that it should either take 3 or four cycles, but I'm not sure of that.


edit: added write mask to add.
 
Well, if it can co-issue the dsx/dsy instructions, and execute them at the same time as that first texld instruction, it seems like it should take 2 cycles. Otherwise I would expect 3 cycles.

Anyway, I've been keeping my eyes open for a 6800, and I'm hoping to get one in just over a week.
 
I'm pretty sure it can co-issue two 2D-dsx/dsy, but I'd expect them to take up SU1. The add takes place in SU1, too, so texldd starts in cycle 3. And texldd takes two cycles on NV3x.
 
Ah. . . In that case it should take 3 cycles, from what I understand from the info released.

1. tex, vec2 -- vec2 coissue
2. add
3. tex, mov
 
Xmas said:
I'm pretty sure it can co-issue two 2D-dsx/dsy, but I'd expect them to take up SU1. The add takes place in SU1, too, so texldd starts in cycle 3. And texldd takes two cycles on NV3x.
I had thought that the first shader unit was the special function unit, and the second one is the one that does the adds. Anyway, it will be something to test.
 
I modified the code to:
Code:
 ps_2_x 

dcl_2d s0 
dcl t0 

texld r2, t0, s0 
dsx r0.xy, t0 
dsy r1.xy, t0 
add r2.xy, t0, r2 
texldd r0, r2, s0, r0, r1 
mov oC0, r0
because the fillrate tester my friend used doesn't have 2 textures.

The result he obtained showed that the code needs roughly 11 clocks to finish!!! This is kinda weird.
 
That would be a result of the hardware not being able to hide the latency of the dependent texture read. What hardware took 11 clocks?
 
What apps/tools are you guys using to test these shaders?

Would you be able to post a link to the program(s) here (as I would like to test a few myself).
 
Xmas said:
I'm pretty sure it can co-issue two 2D-dsx/dsy, but I'd expect them to take up SU1. The add takes place in SU1, too, so texldd starts in cycle 3. And texldd takes two cycles on NV3x.

It doesn't seem to be the case. It seems that only one 2D dsx/dsy can be done per cycle and that it requires the 2 ALUs.
 
Xmas said:
... to check the execution time of this little PS snippet? :D
Code:
 ps_2_x

dcl_2d s0
dcl_2d s1
dcl t0

texld r2, t0, s0
dsx r0.xy, t0
dsy r1.xy, t0
add r2.xy, t0, r2
texldd r0, r2, s1, r0, r1
mov oC0, r0
NVShaderPerf says 6 cycles for CineFX1 and 5 for CineFX2, but unfortunately it doesn't provide numbers or NV40 yet. My guess is that it should either take 3 or four cycles, but I'm not sure of that.


edit: added write mask to add.

Here is what seems to be done :

Code:
cycle 1 :
dsx r0.xy, t0

cycle 2 :
dsy r1.xy, t0

cycle 3 :
texld r2, t0, s0
add r2.xy, t0, r2

cycles 4-12
texldd r0, r2, s1, r0, r1

I think that there will be some improvements with a better compiler.
 
Chalnoth said:
How did you test that, Tridam?

I've run some variations of this shader. The result of each variation doesn't say anything but all the results together give a clearer picture of what's going on.
 
Chalnoth said:
Xmas said:
I'm pretty sure it can co-issue two 2D-dsx/dsy, but I'd expect them to take up SU1. The add takes place in SU1, too, so texldd starts in cycle 3. And texldd takes two cycles on NV3x.
I had thought that the first shader unit was the special function unit, and the second one is the one that does the adds. Anyway, it will be something to test.
Yes the first unit is SF/MUL/TEX, but that would be SU0 ;). dsx/dsy, being a SUB, should run in SU1.

Tridam said:
Here is what seems to be done :
Code:
cycles 4-12
texldd r0, r2, s1, r0, r1

I think that there will be some improvements with a better compiler.
:oops: :oops:
 
By the way, as far as the compiler is concerned, I really wouldn't expect much benefit for short shaders like this one. The benefit will be in longer, more complex shaders where latency hiding can be done.
 
Back
Top