X800 vs GF6800 fp texture/buffer performance

nAo said:
I'm confused..who is getting this thing right? :?:
Is a NV40 pixel pipeline able to execute pure ALU ops while it's waiting for a texture fetch?
It might depend on the shader compiler. If the ALU ops require data from the texture fetch, then it's impossible to execute them. Depending on the shader, the compiler should be able to find some ALU ops that use data unrelated to the texture fetch. Of course, there's also the possibility that the pipeline stalls until the texture fetch is complete. . .
 
From some previous post it seemed NV40 architecure wasn't able to execute even non-dependent instructions after a texture fetch has not been completed.
From some other post now it seems that's indeed possible (to execute non-dependent instructions!).
Where's the truth? ;)
 
nAo said:
From some previous post it seemed NV40 architecure wasn't able to execute even non-dependent instructions after a texture fetch has not been completed.
From some other post now it seems that's indeed possible (to execute non-dependent instructions!).
Where's the truth? ;)
Damien's results seem to prove that independent instructions can be executed in parallel. The GPUbench results show what happens if you only have dependent instructions.
So the whole pipeline stalls if the ALU tries to get a result from the TMU which is not ready.


Tridam said:
BTW, the sampling latency for different FP32 texture sizes :

2 : 3 cycles
...
512 : 3 cycles
1024 : 14 cycles
2048 : 91 cycles

Textures draw on 2 triangles in 1600x1200.

If the sampling needs more than 3 cycles because of a memory/cache bottleneck the shader compiler/scheduler can't know that in advance so only 4 (3+1) math instructions can be executed during the sampling. NVIDIA can probably tweak that in the driver for specific cases.

ATI hasn't this problem of course.
91 cycles? :oops:
Ok, 2048x2048 mapped to 1600x1200 isn't particularly cache-friendly, but 14 cycles for a fetch from a 1024² texture plain sucks.

The compiler certainly tries to put as much independent instructions after the fetch as possible, so trilinear and AF have less impact. But with short shaders, there might be no independent ops. Maybe that's a reason why AF tends to cost less on ATI chips.
 
I was thinking that it should be possible to find out by using a low level shading language, so as to eliminate the compiler as a factor.
 
Ostsol said:
I was thinking that it should be possible to find out by using a low level shading language, so as to eliminate the compiler as a factor.

I'm using ASM shaders but there is also a compiler/scheduler in the driver.
 
Some more texture sampling latency results (rgb) :

6800GT

INT8 bilinear
2 : 1 cycle
...
512 : 1 cycle
1024 : 1.5 cycles
2048 : 3 cycles
4096 : 11 cycles

FP16 point
2 : 2 cycles
...
512 : 2 cycles
1024 : 6 cycles
2048 : 25 cycles

FP16 bilinear
2 : 2 cycles
...
512 : 2 cycles
1024 : 6.5 cycles
2048 : 28 cycles

FP32 point
2 : 3 cycles
...
512 : 3 cycles
1024 : 14 cycles
2048 : 91 cycles


X800XT

INT8 bilinear
2 : 1.5 cycles
...
256 : 1.5 cycles
512 : 2 cycles
1024 : 3 cycles
2048 : 7 cycles

FP16 point
2 : 2 cycles
...
128 : 2 cycles
256 : 2.5 cycles
512 : 3 cycles
1024 : 4.5 cycles
2048 : 9 cycles

FP32 point
2 : 4 cycles
...
128 : 4 cycles
256 : 4.5 cycles
512 : 5.5 cycles
1024 : 7.5 cycles
2048 : texture corruption
 
The 91 cycles (NV40, FP32, 2048^2) seems like every texture instruction is hitting main memory.

Are the results being obtained by rendering screen aligned quads? If so, what happens if you render a bunch of smaller squares that tile the screen (say 16x16 or 32x32 pixels)? I'm wondering because doing this might cause quads to be produced and processed in a more cache friendly sequence...
 
Damien, could you please do those same tests for 1, 2 and 4-channel textures? (not just 1 or 2 channels used via write mask)
 
Xmas said:
Damien, could you please do those same tests for 1, 2 and 4-channel textures? (not just 1 or 2 channels used via write mask)

NVIDIA doesn't support G32R32F.

NV40 FP32 2048x2048

3 components : 91 cycles (4 components texture used, but because I don't use the alpha channel the driver automatically disable it's sampling)
1 component : 7.5 cycles
2x 1 component : 11 cycles
3x 1 component : 14.5 cycles (!)

NV40 FP16 2048x2048

3 components : 28 cycles
2 components : 3.5 cycles
2x 2 components : 5.5 cycles (!)
 
psurge said:
The 91 cycles (NV40, FP32, 2048^2) seems like every texture instruction is hitting main memory.

Are the results being obtained by rendering screen aligned quads? If so, what happens if you render a bunch of smaller squares that tile the screen (say 16x16 or 32x32 pixels)? I'm wondering because doing this might cause quads to be produced and processed in a more cache friendly sequence...

2 triangles drawn on the screen so the rendering is done with 64xXX tiles. Unfortunately I don't have enough free time to make my app render on more triangles to force rendering on smaller tiles.
 
Back
Top