According to further tests, this is what the NV30 fragment shader architecture looks like:
There are 4 of these pipelines, or 8 pipelines that output every other cycle. Take your pick. The only case where 8 non-integer operations per cycle have been observed is with PS1.1 style non-dependent texture fetches.
Unit description:
The FLOAT/TEXTURE unit can handle any instruction with any format of input or output. All instructions execute in one cycle, except for LRP,RSQ,LIT,POW which take 2 and RFL which takes 4.
The INTEGER unit can perform 1 generic integer operation or 2 parallel multiplies. There are some limitations on when the parallel multiplies can be done (see detailed tests at end).
When textures are fetched, nothing else can be done in the FLOAT/TEXTURE unit. The following INTEGER unit can use the fetched texture result freely. FLOAT/TEXTURE unit can perform one fetch if the coordinates come from a previous calculation (meaning 4 textures/cycle). If the coordinates come directly from inputs as in PS1.1, the unit can perform a pair of texture fetches (meaning 8 textures/cycle).
The total integer performance can be calculated by combining the FLOAT/TEXTURE unit and the two INTEGER units. This gives 5 multiplications or 3 generic integer operations per cycle.
It can also be speculated, that the INTEGER unit is able to perform one PS1.1-PS1.3 register combiner operation. It seems to have the required amount of MUL/ADD units.
Registers and performance:
Number of registers used affects performance. For maximum performance, it seems you can only use 2 FP32-registers. Every two new registers slow down things:
One can fit two FP16 registers into one FP32 register, so using FP16 doubles the number of registers you can use without lowering performance. With two registers as in these tests, the performance for both is identical.
Finally results for individual programs:
Test method: large number of full window quads were drawn and timed. Zbuffer disabled, color writes enabled. Result is cycles/pixel, which has been scaled by 4 (assuming 4 pipelines) and 0.947 (invented efficiency factor that makes all numbers very near integers).
Driver 43.45, Geforce FX5800 Ultra.
This gives the number of rounds the pixel has to make in the above architecture, assuming a new pixel goes to the pipeline every cycle, and assuming there are 4 pipelines. The operations assumed to be performed in each round are shown in the parenthesis.
The first number tells how many micro-ops the program takes when all instructions depend on the previous one. The second number corresponds to case with paired instructions, where every pair is dependent on the previous pair (there can be difference only when there are multiply instructions).
In the program description string each letter corresponds to one instruction:
a:FX12 add (same results for multiply-accumulate, not shown)
m:FX12 mul
A:FP16 add (FP32 has same speed)
M:FP16 mul (FP32 has same speed)
T:texture fetch with coordinates from register (dependent fetch)
S:texture fetch with coordinates directly from pixel inputs
More test cases were used to come to these conclusions (especially for the organization of the FX12 units). The second column in the second list above shows the performance for different cases in the integer combiners.
There are probably errors in these tests, and there may be cases where performance is lower due to some limitation not observed. There could also be cases where 8 float-operations/cycle were possible, but at least with common operations such cases were not found.
In any case the overall performance should hopefully be accurate.
Code:
FLOAT/TEXTURE-UNIT (handles FP16, FP32 and texture)
|
INTEGER-UNIT (handles FX12, 1-2 ops in parallel)
|
INTEGER-UNIT (handles FX12, 1-2 ops in parallel)
|
(loopback or output)
Unit description:
The FLOAT/TEXTURE unit can handle any instruction with any format of input or output. All instructions execute in one cycle, except for LRP,RSQ,LIT,POW which take 2 and RFL which takes 4.
The INTEGER unit can perform 1 generic integer operation or 2 parallel multiplies. There are some limitations on when the parallel multiplies can be done (see detailed tests at end).
When textures are fetched, nothing else can be done in the FLOAT/TEXTURE unit. The following INTEGER unit can use the fetched texture result freely. FLOAT/TEXTURE unit can perform one fetch if the coordinates come from a previous calculation (meaning 4 textures/cycle). If the coordinates come directly from inputs as in PS1.1, the unit can perform a pair of texture fetches (meaning 8 textures/cycle).
The total integer performance can be calculated by combining the FLOAT/TEXTURE unit and the two INTEGER units. This gives 5 multiplications or 3 generic integer operations per cycle.
It can also be speculated, that the INTEGER unit is able to perform one PS1.1-PS1.3 register combiner operation. It seems to have the required amount of MUL/ADD units.
Registers and performance:
Number of registers used affects performance. For maximum performance, it seems you can only use 2 FP32-registers. Every two new registers slow down things:
Code:
4.23 cycles/pixel: 1 regs, 16 add instr, 1 mov instr
4.23 cycles/pixel: 2 regs, 16 add instr, 1 mov instr
4.66 cycles/pixel: 3 regs, 16 add instr, 1 mov instr
4.66 cycles/pixel: 4 regs, 16 add instr, 1 mov instr
6.08 cycles/pixel: 5 regs, 16 add instr, 1 mov instr
6.08 cycles/pixel: 6 regs, 16 add instr, 1 mov instr
8.52 cycles/pixel: 8 regs, 16 add instr, 1 mov instr
13.67 cycles/pixel: 10 regs, 16 add instr, 1 mov instr
14.36 cycles/pixel: 12 regs, 16 add instr, 1 mov instr
19.74 cycles/pixel: 14 regs, 16 add instr, 1 mov instr
20.64 cycles/pixel: 16 regs, 16 add instr, 1 mov instr
Finally results for individual programs:
Test method: large number of full window quads were drawn and timed. Zbuffer disabled, color writes enabled. Result is cycles/pixel, which has been scaled by 4 (assuming 4 pipelines) and 0.947 (invented efficiency factor that makes all numbers very near integers).
Driver 43.45, Geforce FX5800 Ultra.
This gives the number of rounds the pixel has to make in the above architecture, assuming a new pixel goes to the pipeline every cycle, and assuming there are 4 pipelines. The operations assumed to be performed in each round are shown in the parenthesis.
The first number tells how many micro-ops the program takes when all instructions depend on the previous one. The second number corresponds to case with paired instructions, where every pair is dependent on the previous pair (there can be difference only when there are multiply instructions).
In the program description string each letter corresponds to one instruction:
a:FX12 add (same results for multiply-accumulate, not shown)
m:FX12 mul
A:FP16 add (FP32 has same speed)
M:FP16 mul (FP32 has same speed)
T:texture fetch with coordinates from register (dependent fetch)
S:texture fetch with coordinates directly from pixel inputs
Code:
rounds: 1.00 1.01 prog: a (1:a )
rounds: 1.00 1.00 prog: aa (1:aa )
rounds: 1.00 1.00 prog: aaa (1:aaa )
rounds: 2.00 2.00 prog: aaaa (2:aaa,a )
rounds: 2.00 2.00 prog: aaaaa (2:aaa,aa )
rounds: 2.01 2.01 prog: aaaaaa (2:aaa,aaa )
rounds: 3.01 3.01 prog: aaaaaaa (3:aaa,aaa,a )
rounds: 1.01 1.00 prog: A (1:A )
rounds: 1.00 1.00 prog: Aa (1:Aa )
rounds: 1.01 1.00 prog: Aaa (1:Aaa )
rounds: 2.00 2.01 prog: Aaaa (2:Aaa,a )
rounds: 2.00 2.01 prog: AA (2:A,A )
rounds: 2.01 2.01 prog: AAaa (2:A,Aaa )
rounds: 1.00 1.00 prog: T (1:T )
rounds: 1.00 1.00 prog: Ta (1:Ta )
rounds: 1.00 1.00 prog: Taa (1:Taa )
rounds: 2.00 2.00 prog: Taaa (2:Taa,a )
rounds: 2.00 2.00 prog: TT (2:T,T )
rounds: 2.01 2.01 prog: TTaa (2:T,Taa )
rounds: 1.00 1.00 prog: S (1:S )
rounds: 1.00 1.00 prog: Sa (1:Sa )
rounds: 1.00 1.00 prog: Saa (1:Saa )
rounds: 2.00 2.00 prog: Saaa (2:Saa,a )
rounds: 1.00 1.00 prog: SS (1:SS )
rounds: 1.00 1.00 prog: SSa (1:SSa )
rounds: 1.00 1.01 prog: SSaa (1:SSaa )
rounds: 2.00 2.00 prog: SSaaa (2:SSaa,a )
rounds: 1.00 1.00 prog: m (1:m )
rounds: 1.00 1.00 prog: mm (1:mm )
rounds: 1.01 1.00 prog: mmm (1:mmm )
rounds: 2.00 1.00 prog: mmmm (2:mmm,m )
rounds: 2.00 1.01 prog: mmmmm (2:mmm,mm )
rounds: 2.01 2.01 prog: mmmmmm (2:mmm,mmm )
rounds: 3.01 2.00 prog: mmmmmmm (3:mmm,mmm,m )
rounds: 1.00 1.00 prog: M (1:M )
rounds: 2.01 1.00 prog: Mmmm (2:Mmm,m )
rounds: 2.01 1.00 prog: Mmmmm (2:Mmm,mm )
rounds: 2.01 2.00 prog: Mmmmmm (2:Mmm,mmm )
rounds: 3.01 2.00 prog: Mmmmmmm (3:Mmm,mmm,m )
rounds: 2.00 2.00 prog: MM (2:M,M )
rounds: 2.01 2.01 prog: MMmm (2:M,Mmm )
rounds: 3.01 2.01 prog: MMmmmm (3:M,Mmm,mm )
rounds: 3.01 3.01 prog: MMmmmmm (3:M,Mmm,mmm )
rounds: 4.02 3.01 prog: MMmmmmmm (4:M,Mmm,mmm,m)
rounds: 1.00 1.00 prog: T (1:T )
rounds: 1.01 1.00 prog: Tmm (1:Tmm )
rounds: 2.01 1.00 prog: Tmmm (2:Tmm,m )
rounds: 2.00 1.01 prog: Tmmmm (2:Tmm,mm )
rounds: 2.01 2.00 prog: Tmmmmm (2:Tmm,mmm )
rounds: 2.00 2.00 prog: TT (2:T,T )
rounds: 2.01 2.01 prog: TTmm (2:T,Tmm )
rounds: 3.01 2.01 prog: TTmmm (3:T,Tmm,m )
rounds: 3.01 2.01 prog: TTmmmm (3:T,Tmm,mm )
rounds: 3.01 3.01 prog: TTmmmmm (3:T,Tmm,mmm )
rounds: 1.01 1.01 prog: S (1:S )
rounds: 1.00 1.00 prog: Smm (1:Smm )
rounds: 2.00 1.00 prog: Smmm (2:Smm,m )
rounds: 2.00 1.00 prog: Smmmm (2:Smm,mm )
rounds: 2.01 2.01 prog: Smmmmm (2:Smm,mmm )
rounds: 1.00 1.00 prog: SS (1:SS )
rounds: 1.00 1.00 prog: SSmm (1:SSmm )
rounds: 2.01 1.01 prog: SSmmm (2:SSmm,m )
rounds: 2.01 1.00 prog: SSmmmm (2:SSmm,mm )
rounds: 2.01 2.00 prog: SSmmmmm (2:SSmm,mmm )
Code:
rounds: 1.00 1.00 prog: Amm (1:Amm )
rounds: 1.01 1.00 prog: Aam (1:Aam )
rounds: 1.00 1.00 prog: Ama (1:Ama )
rounds: 1.00 1.00 prog: Aaa (1:Aaa )
rounds: 2.00 1.00 prog: Ammm (2:Amm,m )
rounds: 2.00 1.00 prog: Aamm (2:Aam,m )
rounds: 2.00 2.00 prog: Amam (2:Ama,m )
rounds: 2.00 2.00 prog: Aaam (2:Aaa,m )
rounds: 2.00 1.01 prog: Amma (2:Amm,a )
rounds: 2.00 2.00 prog: Aama (2:Aam,a )
rounds: 2.00 2.01 prog: Amaa (2:Ama,a )
rounds: 2.00 2.01 prog: Aaaa (2:Aaa,a )
rounds: 2.00 1.00 prog: Ammmm (2:Amm,mm )
rounds: 2.00 2.00 prog: Aammm (2:Aam,mm )
rounds: 2.00 2.00 prog: Amamm (2:Ama,mm )
rounds: 2.00 2.00 prog: Aaamm (2:Aaa,mm )
rounds: 2.00 2.00 prog: Ammam (2:Amm,am )
rounds: 2.00 2.01 prog: Aamam (2:Aam,am )
rounds: 2.00 2.00 prog: Amaam (2:Ama,am )
rounds: 2.00 2.00 prog: Aaaam (2:Aaa,am )
rounds: 2.01 2.00 prog: Ammma (2:Amm,ma )
rounds: 2.00 2.01 prog: Aamma (2:Aam,ma )
rounds: 2.00 2.00 prog: Amama (2:Ama,ma )
rounds: 2.00 2.01 prog: Aaama (2:Aaa,ma )
rounds: 2.00 2.00 prog: Ammaa (2:Amm,aa )
rounds: 2.00 2.00 prog: Aamaa (2:Aam,aa )
rounds: 2.00 2.00 prog: Amaaa (2:Ama,aa )
rounds: 2.01 2.01 prog: Aaaaa (2:Aaa,aa )
There are probably errors in these tests, and there may be cases where performance is lower due to some limitation not observed. There could also be cases where 8 float-operations/cycle were possible, but at least with common operations such cases were not found.
In any case the overall performance should hopefully be accurate.