arjan de lumens said:
As for the NV30 4x2 vs 8x1: 8 shading units could very well be grouped in a 4x2 fashion, just like the texture units apparently are - the only way to tell the difference is to test the performance falloff with increasing number of shading instructions - going from e.g. 5 to 6 instructions would have no performance hit with 4x2 but ~17% performance hit with 8x1. If they are organized as 4x2, and the 2 shaders in each pipeline are connected in series, you could have that during any given clock cycle (assuming shader latency >= 1 cycle), the 8 shading units indeed work on 8 different pixels, just like Nvidia is claiming .. :?
FP32 additions (FP16 is same speed)
3.98 fragm/cycle 0.25 cycle/fragm: 1add-FP32
1.90 fragm/cycle 0.53 cycle/fragm: 2add-FP32
1.26 fragm/cycle 0.79 cycle/fragm: 3add-FP32
0.95 fragm/cycle 1.06 cycle/fragm: 4add-FP32
0.76 fragm/cycle 1.32 cycle/fragm: 5add-FP32
0.63 fragm/cycle 1.59 cycle/fragm: 6add-FP32
0.54 fragm/cycle 1.85 cycle/fragm: 7add-FP32
FX12 additions
3.98 fragm/cycle 0.25 cycle/fragm: 1add-FX12
3.98 fragm/cycle 0.25 cycle/fragm: 2add-FX12
3.98 fragm/cycle 0.25 cycle/fragm: 3add-FX12
1.90 fragm/cycle 0.53 cycle/fragm: 4add-FX12
1.88 fragm/cycle 0.53 cycle/fragm: 5add-FX12
1.88 fragm/cycle 0.53 cycle/fragm: 6add-FX12
1.26 fragm/cycle 0.79 cycle/fragm: 7add-FX12
Texture loads with two paired texture fetches followed by FX12 adds
3.98 fragm/cycle 0.25 cycle/fragm: 1tex
3.97 fragm/cycle 0.25 cycle/fragm: 2tex-paired
1.88 fragm/cycle 0.53 cycle/fragm: 3tex-paired
1.88 fragm/cycle 0.53 cycle/fragm: 4tex-paired
1.25 fragm/cycle 0.80 cycle/fragm: 5tex-paired
1.25 fragm/cycle 0.80 cycle/fragm: 6tex-paired
0.94 fragm/cycle 1.06 cycle/fragm: 7tex-paired
Texture loads with individual texture fetches followed by FX12 add
3.98 fragm/cycle 0.25 cycle/fragm: 1tex
1.88 fragm/cycle 0.53 cycle/fragm: 2tex-nonpaired
1.26 fragm/cycle 0.80 cycle/fragm: 3tex-nonpaired
0.94 fragm/cycle 1.06 cycle/fragm: 4tex-nonpaired
0.75 fragm/cycle 1.33 cycle/fragm: 5tex-nonpaired
0.63 fragm/cycle 1.59 cycle/fragm: 6tex-nonpaired
0.54 fragm/cycle 1.86 cycle/fragm: 7tex-nonpaired
Program details (only some listed, the rest are similar):
"1add-FP32",
"ADD o[COLR],R0,R0;",
"7add-FP32",
"ADD R0,R0,R0;",
"ADD R0,R0,R0;",
"ADD R0,R0,R0;",
"ADD R0,R0,R0;",
"ADD R0,R0,R0;",
"ADD R0,R0,R0;",
"ADD o[COLR],R0,R0;",
"1add-FX12",
"ADDX o[COLH],H0,H0;",
"7add-FX12",
"ADDX H0,H0,H0;",
"ADDX H0,H0,H0;",
"ADDX H0,H0,H0;",
"ADDX H0,H0,H0;",
"ADDX H0,H0,H0;",
"ADDX H0,H0,H0;",
"ADDX o[COLH],H0,H0;",
"1tex",
"TEX o[COLH],f[TEX0],TEX0,2D;",
"2tex-paired",
"TEX H0,f[TEX0],TEX0,2D;",
"TEX H1,f[TEX1],TEX0,2D;",
"ADDX o[COLH],H0,H1;",
"7tex-paired",
"TEX H0,f[TEX0],TEX0,2D;",
"TEX H1,f[TEX1],TEX0,2D;",
"ADDX H2,H2,H0;",
"ADDX H2,H2,H1;",
"TEX H0,f[TEX2],TEX0,2D;",
"TEX H1,f[TEX3],TEX0,2D;",
"ADDX H2,H2,H0;",
"ADDX H2,H2,H1;",
"TEX H0,f[TEX4],TEX0,2D;",
"TEX H1,f[TEX5],TEX0,2D;",
"ADDX H2,H2,H0;",
"ADDX H2,H2,H1;",
"TEX H0,f[TEX6],TEX0,2D;",
"ADDX o[COLH],H2,H0;",
"2tex-nonpaired",
"TEX H0,f[TEX0],TEX0,2D;",
"ADDX H1,H1,H0;",
"TEX H0,f[TEX1],TEX0,2D;",
"ADDX o[COLH],H1,H0;",
"7tex-nonpaired",
"TEX H0,f[TEX0],TEX0,2D;",
"ADDX H1,H1,H0;",
"TEX H0,f[TEX1],TEX0,2D;",
"ADDX H1,H1,H0;",
"TEX H0,f[TEX2],TEX0,2D;",
"ADDX H1,H1,H0;",
"TEX H0,f[TEX3],TEX0,2D;",
"ADDX H1,H1,H0;",
"TEX H0,f[TEX4],TEX0,2D;",
"ADDX H1,H1,H0;",
"TEX H0,f[TEX5],TEX0,2D;",
"ADDX H1,H1,H0;",
"TEX H0,f[TEX6],TEX0,2D;",
"ADDX o[COLH],H1,H0;",
Chalnoth said:
I believe that the current "PS 2.0" functionality takes the place of the old pixel shader that was in the NV2x. This fragment shader always executes floating-point ops, and part of nearly any pixel shader program will be executed in this portion of the processor.
Fixed point operations can be executed between texture fetches, and the result is faster than using FP16
0.95 fragm/cycle 1.06 cycle/fragm: tex+madfx12+dep.tex+add cwrite [10:d1] [b0 x0]
0.63 fragm/cycle 1.59 cycle/fragm: tex+madfp16+dep.tex+add cwrite [11:d0] [b0 x0]
"tex+madfx12+dep.tex+add",
"TEX H0,f[TEX0],TEX0,2D;",
"TEX H1,f[TEX1],TEX0,2D;",
"MADX H0,H0,H0,H1;",
"MADX H0,H0,H1,H0;",
"TEX H0,H0,TEX0,2D;",
"TEX H1,H1,TEX0,2D;",
"ADD o[COLH],H0,H1;",
"tex+madfp16+dep.tex+add",
"TEX H0,f[TEX0],TEX0,2D;",
"TEX H1,f[TEX1],TEX0,2D;",
"MAD H0,H0,H0,H1;",
"MAD H0,H0,H1,H0;",
"TEX H0,H0,TEX0,2D;",
"TEX H1,H1,TEX0,2D;",
"ADD o[COLH],H0,H1;",
Chalnoth said:
The NV3x also has at least four additional shader processors, called register combiners in OpenGL. These work at 12-bit fixed point, and are for doing calculations only after all texturing operations are completed (I think...I really don't know if this is the way it works precisely, this is all speculation, but it seems to make sense...).
I believe the register combiners are still 9 bit (as they are described as exactly similar to NV2x in NVidia docs). Unfortunately it is not possible to test fragment shaders and register combiners at the same time to see if they any use shared resources, as NVidia removed the support for this (it was originally documented to exist).