X800 vs GF6800 fp texture/buffer performance

Tridam said:
It doesn't make sense :?
You mean that if a FP32 texture read needs 3-4 cycles, you can't do math during these cycles ???

Yes, and this get even more worse if you need AF with many samples. The reason for this is that the TMU is in full sync with the ALU/FPU. I am not sure if this is a nVidia design or if the used the old rampage design.

The NV3X have the same problem but to make it more complex only if you use PS >= 2.0 (maybe 1.4). If you use 1.1 shaders you can hide the read cycles with math.

Tridam said:
FP32 texture reads could be cut into 3-4 single component texture reads.

Yes but only if you don't need the values for the next math instructions.
 
Demirug said:
Tridam said:
It doesn't make sense :?
You mean that if a FP32 texture read needs 3-4 cycles, you can't do math during these cycles ???

Yes, and this get even more worse if you need AF with many samples. The reason for this is that the TMU is in full sync with the ALU/FPU. I am not sure if this is a nVidia design or if the used the old rampage design.

The NV3X have the same problem but to make it more complex only if you use PS >= 2.0 (maybe 1.4). If you use 1.1 shaders you can hide the read cycles with math.
I disagree (at least in the case of FP textures). Here is a quick example with a 4 components FP32 texture :

Code:
texld r0, t0, s0
and
Code:
texld r0, t0, s0
mul r1, c0, c0
mad r1, r1, c1, c1
mad r1, r1, c0, c0
mad r0, r0, r1, c0

Both examples take the same number of cycles. Math is free because of the texture sampling latency.
However the driver compiler/scheduler could be inneficient if it doesn't know the texture read details.


Demirug said:
Tridam said:
FP32 texture reads could be cut into 3-4 single component texture reads.

Yes but only if you don't need the values for the next math instructions.
Of course.
 
Tridam said:
I disagree (at least in the case of FP textures). Here is a quick example with a 4 components FP32 texture :

Code:
texld r0, t0, s0
and
Code:
texld r0, t0, s0
mul r1, c0, c0
mad r1, r1, c1, c1
mad r1, r1, c0, c0
mad r0, r0, r1, c0

Both examples take the same number of cycles. Math is free because of the texture sampling latency.
However the driver compiler/scheduler could be inneficient if it doesn't know the texture read details.
Bad example.
Code:
texld r0, t0, s0
mad r0, r0, c2, c0
is single pipeline pass,
with c2 = c0 * (c0 * c0 * c1 + c1 + 1).
 
You're right. I should not write ASM shaders when eating a pizza :oops:
However I knew what the results should be so I just checked with a quick example.

The number of cycles is of course the same with input registers instead of constant ones.
 
Tridam said:
You're right. I should not write ASM shaders when eating a pizza :oops:
However I knew what the results should be so I just checked with a quick example.

The number of cycles is of course the same with input registers instead of constant ones.
You sure aren't bandwidth limited? Because that's neither in line with the GPUbench results I posted earlier nor with what I would expect from such a pipeline architecture.
 
Xmas said:
Tridam said:
You're right. I should not write ASM shaders when eating a pizza :oops:
However I knew what the results should be so I just checked with a quick example.

The number of cycles is of course the same with input registers instead of constant ones.
You sure aren't bandwidth limited? Because that's neither in line with the GPUbench results I posted earlier nor with what I would expect from such a pipeline architecture.

If I'm bandwidth limited the only reason could be the texture read. So it increases even more the texture read latency.

I always got similar results even the first time I checked that point (one year ago when reviewing the 6800). However I never tried it with GLSL. I always write ASM shaders for synthetic tests. Maybe NVIDIA's GLSL compiler is not very smart yet.
 
I added the tests we talked about earlier. The result on my m11 isn't very surprising:
14ldtest.png


Hopefully someone will test an nv40 based card soon.
 
I left my workplace until next week so I can't test. All I have here with me is Intel Extremelyslow graphics :p
 
Xmas said:
Tridam said:
Xmas said:
I doubt it's an API issue. What cycle counts do you get for those shaders?

14 cycles with a 1024x1024 texture.
That's a lot for a single texture fetch that's supposed to take 4 cycles.

The texture is big. Adding the 4 math instructions doesn't change the number of cycles but adding a single int texture read adds an extra cycle. It's clear to me that the texture sampling is the bottleneck and that math instructions can be executed during the texture sampling.

I'll do the same test next week with a very small FP32 texture. It'll be clearer.
 
PeterT said:
Hopefully someone will test an nv40 based card soon.

6800GT@Ultra AXP-M 2.5Ghz - have you added the new tests to ORC as yet?

Code:
Results for BufferCreateINT16: msecs: 109 || ms/i: 18.1667 || i/s: 55.0459
Results for BufferCreateFP16: msecs: 63 || ms/i: 10.5 || i/s: 95.2381
Results for BufferCreateFP32: msecs: 62 || ms/i: 10.3333 || i/s: 96.7742
Results for JustCopy: msecs: 172 || ms/i: 0.086 || i/s: 11627.9
Results for SimpleSmooth: msecs: 203 || ms/i: 0.1015 || i/s: 9852.22
Results for TexNoise: msecs: 219 || ms/i: 0.1095 || i/s: 9132.42
Results for 3x3Conv: msecs: 156 || ms/i: 0.156 || i/s: 6410.26
Results for TEncode: msecs: 125 || ms/i: 0.125 || i/s: 8000
Results for TDecode: msecs: 140 || ms/i: 0.14 || i/s: 7142.86
Results for LinDiffINT: msecs: 203 || ms/i: 0.1015 || i/s: 9852.22
Results for LinDiffINT16: msecs: 204 || ms/i: 0.102 || i/s: 9803.92
Results for LinDiffFP16: msecs: 203 || ms/i: 0.1015 || i/s: 9852.22
Results for LinDiffFP32: msecs: 203 || ms/i: 0.1015 || i/s: 9852.22
Results for LD_INT->FP16: msecs: 109 || ms/i: 0.109 || i/s: 9174.31
Results for LD_INT->FP32: msecs: 94 || ms/i: 0.094 || i/s: 10638.3
Results for LD_FP16->INT: msecs: 94 || ms/i: 0.094 || i/s: 10638.3
Results for LD_FP32->INT: msecs: 109 || ms/i: 0.109 || i/s: 9174.31
Results for PMTEncoded: msecs: 375 || ms/i: 0.375 || i/s: 2666.67
Results for PMStandard: msecs: 328 || ms/i: 0.328 || i/s: 3048.78
Results for PMBuffered: msecs: 47 || ms/i: 0.094 || i/s: 10638.3

Testing 64x64 image:
Results for BufferCreateINT: msecs: 78 || ms/i: 13 || i/s: 76.9231
Results for BufferCreateINT16: msecs: 94 || ms/i: 15.6667 || i/s: 63.8298
Results for BufferCreateFP16: msecs: 63 || ms/i: 10.5 || i/s: 95.2381
Results for BufferCreateFP32: msecs: 62 || ms/i: 10.3333 || i/s: 96.7742
Results for JustCopy: msecs: 157 || ms/i: 0.0785 || i/s: 12738.9
Results for SimpleSmooth: msecs: 172 || ms/i: 0.086 || i/s: 11627.9
Results for TexNoise: msecs: 156 || ms/i: 0.078 || i/s: 12820.5
Results for 3x3Conv: msecs: 94 || ms/i: 0.094 || i/s: 10638.3
Results for TEncode: msecs: 78 || ms/i: 0.078 || i/s: 12820.5
Results for TDecode: msecs: 94 || ms/i: 0.094 || i/s: 10638.3
Results for LinDiffINT: msecs: 203 || ms/i: 0.1015 || i/s: 9852.22
Results for LinDiffINT16: msecs: 203 || ms/i: 0.1015 || i/s: 9852.22
Results for LinDiffFP16: msecs: 203 || ms/i: 0.1015 || i/s: 9852.22
Results for LinDiffFP32: msecs: 203 || ms/i: 0.1015 || i/s: 9852.22
Results for LD_INT->FP16: msecs: 94 || ms/i: 0.094 || i/s: 10638.3
Results for LD_INT->FP32: msecs: 93 || ms/i: 0.093 || i/s: 10752.7
Results for LD_FP16->INT: msecs: 110 || ms/i: 0.11 || i/s: 9090.91
Results for LD_FP32->INT: msecs: 93 || ms/i: 0.093 || i/s: 10752.7
Results for PMTEncoded: msecs: 328 || ms/i: 0.328 || i/s: 3048.78
Results for PMStandard: msecs: 328 || ms/i: 0.328 || i/s: 3048.78
Results for PMBuffered: msecs: 47 || ms/i: 0.094 || i/s: 10638.3

Testing 128x128 image:
Results for BufferCreateINT: msecs: 78 || ms/i: 13 || i/s: 76.9231
Results for BufferCreateINT16: msecs: 94 || ms/i: 15.6667 || i/s: 63.8298
Results for BufferCreateFP16: msecs: 63 || ms/i: 10.5 || i/s: 95.2381
Results for BufferCreateFP32: msecs: 62 || ms/i: 10.3333 || i/s: 96.7742
Results for JustCopy: msecs: 157 || ms/i: 0.0785 || i/s: 12738.9
Results for SimpleSmooth: msecs: 156 || ms/i: 0.078 || i/s: 12820.5
Results for TexNoise: msecs: 156 || ms/i: 0.078 || i/s: 12820.5
Results for 3x3Conv: msecs: 79 || ms/i: 0.079 || i/s: 12658.2
Results for TEncode: msecs: 78 || ms/i: 0.078 || i/s: 12820.5
Results for TDecode: msecs: 94 || ms/i: 0.094 || i/s: 10638.3
Results for LinDiffINT: msecs: 235 || ms/i: 0.1175 || i/s: 8510.64
Results for LinDiffINT16: msecs: 234 || ms/i: 0.117 || i/s: 8547.01
Results for LinDiffFP16: msecs: 219 || ms/i: 0.1095 || i/s: 9132.42
Results for LinDiffFP32: msecs: 422 || ms/i: 0.211 || i/s: 4739.34
Results for LD_INT->FP16: msecs: 109 || ms/i: 0.109 || i/s: 9174.31
Results for LD_INT->FP32: msecs: 109 || ms/i: 0.109 || i/s: 9174.31
Results for LD_FP16->INT: msecs: 94 || ms/i: 0.094 || i/s: 10638.3
Results for LD_FP32->INT: msecs: 203 || ms/i: 0.203 || i/s: 4926.11
Results for PMTEncoded: msecs: 313 || ms/i: 0.313 || i/s: 3194.89
Results for PMStandard: msecs: 625 || ms/i: 0.625 || i/s: 1600
Results for PMBuffered: msecs: 125 || ms/i: 0.25 || i/s: 4000

Testing 256x256 image:
Results for BufferCreateINT: msecs: 62 || ms/i: 10.3333 || i/s: 96.7742
Results for BufferCreateINT16: msecs: 110 || ms/i: 18.3333 || i/s: 54.5455
Results for BufferCreateFP16: msecs: 62 || ms/i: 10.3333 || i/s: 96.7742
Results for BufferCreateFP32: msecs: 63 || ms/i: 10.5 || i/s: 95.2381
Results for JustCopy: msecs: 156 || ms/i: 0.078 || i/s: 12820.5
Results for SimpleSmooth: msecs: 172 || ms/i: 0.086 || i/s: 11627.9
Results for TexNoise: msecs: 188 || ms/i: 0.094 || i/s: 10638.3
Results for 3x3Conv: msecs: 218 || ms/i: 0.218 || i/s: 4587.16
Results for TEncode: msecs: 78 || ms/i: 0.078 || i/s: 12820.5
Results for TDecode: msecs: 140 || ms/i: 0.14 || i/s: 7142.86
Results for LinDiffINT: msecs: 219 || ms/i: 0.1095 || i/s: 9132.42
Results for LinDiffINT16: msecs: 484 || ms/i: 0.242 || i/s: 4132.23
Results for LinDiffFP16: msecs: 500 || ms/i: 0.25 || i/s: 4000
Results for LinDiffFP32: msecs: 1563 || ms/i: 0.7815 || i/s: 1279.59
Results for LD_INT->FP16: msecs: 110 || ms/i: 0.11 || i/s: 9090.91
Results for LD_INT->FP32: msecs: 110 || ms/i: 0.11 || i/s: 9090.91
Results for LD_FP16->INT: msecs: 234 || ms/i: 0.234 || i/s: 4273.5
Results for LD_FP32->INT: msecs: 735 || ms/i: 0.735 || i/s: 1360.54
Results for PMTEncoded: msecs: 735 || ms/i: 0.735 || i/s: 1360.54
Results for PMStandard: msecs: 2437 || ms/i: 2.437 || i/s: 410.341
Results for PMBuffered: msecs: 688 || ms/i: 1.376 || i/s: 726.744

Testing 512x512 image:
Results for BufferCreateINT: msecs: 78 || ms/i: 13 || i/s: 76.9231
Results for BufferCreateINT16: msecs: 94 || ms/i: 15.6667 || i/s: 63.8298
Results for BufferCreateFP16: msecs: 62 || ms/i: 10.3333 || i/s: 96.7742
Results for BufferCreateFP32: msecs: 63 || ms/i: 10.5 || i/s: 95.2381
Results for JustCopy: msecs: 187 || ms/i: 0.187 || i/s: 5347.59
Results for SimpleSmooth: msecs: 266 || ms/i: 0.266 || i/s: 3759.4
Results for TexNoise: msecs: 250 || ms/i: 0.25 || i/s: 4000
Results for 3x3Conv: msecs: 359 || ms/i: 0.718 || i/s: 1392.76
Results for TEncode: msecs: 94 || ms/i: 0.188 || i/s: 5319.15
Results for TDecode: msecs: 235 || ms/i: 0.47 || i/s: 2127.66
Results for LinDiffINT: msecs: 329 || ms/i: 0.329 || i/s: 3039.51
Results for LinDiffINT16: msecs: 859 || ms/i: 0.859 || i/s: 1164.14
Results for LinDiffFP16: msecs: 860 || ms/i: 0.86 || i/s: 1162.79
Results for LinDiffFP32: msecs: 2813 || ms/i: 2.813 || i/s: 355.492
Results for LD_INT->FP16: msecs: 156 || ms/i: 0.312 || i/s: 3205.13
Results for LD_INT->FP32: msecs: 188 || ms/i: 0.376 || i/s: 2659.57
Results for LD_FP16->INT: msecs: 391 || ms/i: 0.782 || i/s: 1278.77
Results for LD_FP32->INT: msecs: 1328 || ms/i: 2.656 || i/s: 376.506
Results for PMTEncoded: msecs: 1359 || ms/i: 2.718 || i/s: 367.918
Results for PMStandard: msecs: 4468 || ms/i: 8.936 || i/s: 111.907
Results for PMBuffered: msecs: 687 || ms/i: 2.748 || i/s: 363.901

Testing 1024x1024 image:
Results for BufferCreateINT: msecs: 62 || ms/i: 10.3333 || i/s: 96.7742
Results for BufferCreateINT16: msecs: 94 || ms/i: 15.6667 || i/s: 63.8298
Results for BufferCreateFP16: msecs: 78 || ms/i: 13 || i/s: 76.9231
Results for BufferCreateFP32: msecs: 63 || ms/i: 10.5 || i/s: 95.2381
Results for JustCopy: msecs: 719 || ms/i: 0.719 || i/s: 1390.82
Results for SimpleSmooth: msecs: 984 || ms/i: 0.984 || i/s: 1016.26
Results for TexNoise: msecs: 875 || ms/i: 0.875 || i/s: 1142.86
Results for 3x3Conv: msecs: 1359 || ms/i: 2.718 || i/s: 367.918
Results for TEncode: msecs: 360 || ms/i: 0.72 || i/s: 1388.89
Results for TDecode: msecs: 891 || ms/i: 1.782 || i/s: 561.167
Results for LinDiffINT: msecs: 1250 || ms/i: 1.25 || i/s: 800
Results for LinDiffINT16: msecs: 3344 || ms/i: 3.344 || i/s: 299.043
Results for LinDiffFP16: msecs: 3375 || ms/i: 3.375 || i/s: 296.296
Results for LinDiffFP32: msecs: 11234 || ms/i: 11.234 || i/s: 89.0155
Results for LD_INT->FP16: msecs: 625 || ms/i: 1.25 || i/s: 800
Results for LD_INT->FP32: msecs: 750 || ms/i: 1.5 || i/s: 666.667
Results for LD_FP16->INT: msecs: 1547 || ms/i: 3.094 || i/s: 323.206
Results for LD_FP32->INT: msecs: 5250 || ms/i: 10.5 || i/s: 95.2381
Results for PMTEncoded: msecs: 5188 || ms/i: 10.376 || i/s: 96.3763
Results for PMStandard: msecs: 17703 || ms/i: 35.406 || i/s: 28.2438
Results for PMBuffered: msecs: 395047 || ms/i: 1580.19 || i/s: 0.632836
 
trinibwoy said:
PeterT said:
Hopefully someone will test an nv40 based card soon.

6800GT@Ultra AXP-M 2.5Ghz - have you added the new tests to ORC as yet?
I have added the ability to show the new test types (and to show the benchmark version used in generating a result), but I have not yet added any test results. I'll release a new ORC version with an updated results.db as soon as I finish that.


Anyway, with the new results in the other threads and this one, it looks like both fetching from and writing to higher accuracy buffers slows down nv40 based cards considerably, with the reason for the bad results in the FP based LinDiff tests mostly being the texture fetches.

On ATI cards the only slowdown seems to be caused by a lack of memory bandwidth in the most extreme (many FP32 reads/writes) cases.
 
PeterT said:
I have added the ability to show the new test types (and to show the benchmark version used in generating a result), but I have not yet added any test results. I'll release a new ORC version with an updated results.db as soon as I finish that.

Is the link the same one from the other thread? I tried it and it still says version 0.1 and doesn't show the new tests in the list.

PeterT said:
Anyway, with the new results in the other threads and this one, it looks like both fetching from and writing to higher accuracy buffers slows down nv40 based cards considerably, with the reason for the bad results in the FP based LinDiff tests mostly being the texture fetches.

Any clue on the architectural differences that explain Nvidia's poor performance?
 
trinibwoy said:
Is the link the same one from the other thread? I tried it and it still says version 0.1 and doesn't show the new tests in the list.
It's there now, even has some documentation.

PeterT said:
Any clue on the architectural differences that explain Nvidia's poor performance?
Well, not really, at least not before I started this thread. That's why I did open it in the first place.

Now, after reading the informed guesses/speculation of people like Tridam, Xmas and Demirug, and after running the additional tests, I'd say that one of the following is true:
- Reading RGBA FP textures (ie. a multi-cycle fetch) causes a latency related stalling on nv4x-based cards that cannot be hidden with math instructions
- It can, but the NV driver/GLSL compiler is not smart enough to do so yet

This also seems to correlate nicely with the stanford GPUBench results.
 
Xmas said:
Tridam said:
Xmas said:
I doubt it's an API issue. What cycle counts do you get for those shaders?

14 cycles with a 1024x1024 texture.
That's a lot for a single texture fetch that's supposed to take 4 cycles.

I've done the same test with a 2x2 FP32 texture. The sampling needs 3 cycles (without alpha) as expected and 3 math instructions can be executed at the same time. Of course only the last instruction can use the sampled data.
 
BTW, the sampling latency for different FP32 texture sizes :

2 : 3 cycles
...
512 : 3 cycles
1024 : 14 cycles
2048 : 91 cycles

Textures draw on 2 triangles in 1600x1200.

If the sampling needs more than 3 cycles because of a memory/cache bottleneck the shader compiler/scheduler can't know that in advance so only 4 (3+1) math instructions can be executed during the sampling. NVIDIA can probably tweak that in the driver for specific cases.

ATI hasn't this problem of course.
 
What exactly goes into a texture fetch and how does that process scale with texture resolution? Are there any guides around that delve into this stuff a bit more without being overly technical?
 
I'm confused..who is getting this thing right? :?:
Is a NV40 pixel pipeline able to execute pure ALU ops while it's waiting for a texture fetch?
 
Back
Top