NV30 fragment processor test results

thepkrl · Apr 1, 2003

According to further tests, this is what the NV30 fragment shader architecture looks like:

Code:

FLOAT/TEXTURE-UNIT (handles FP16, FP32 and texture)
  |
INTEGER-UNIT (handles FX12, 1-2 ops in parallel)
  |
INTEGER-UNIT (handles FX12, 1-2 ops in parallel)
  |
(loopback or output)

There are 4 of these pipelines, or 8 pipelines that output every other cycle. Take your pick. The only case where 8 non-integer operations per cycle have been observed is with PS1.1 style non-dependent texture fetches.

Unit description:

The FLOAT/TEXTURE unit can handle any instruction with any format of input or output. All instructions execute in one cycle, except for LRP,RSQ,LIT,POW which take 2 and RFL which takes 4.

The INTEGER unit can perform 1 generic integer operation or 2 parallel multiplies. There are some limitations on when the parallel multiplies can be done (see detailed tests at end).

When textures are fetched, nothing else can be done in the FLOAT/TEXTURE unit. The following INTEGER unit can use the fetched texture result freely. FLOAT/TEXTURE unit can perform one fetch if the coordinates come from a previous calculation (meaning 4 textures/cycle). If the coordinates come directly from inputs as in PS1.1, the unit can perform a pair of texture fetches (meaning 8 textures/cycle).

The total integer performance can be calculated by combining the FLOAT/TEXTURE unit and the two INTEGER units. This gives 5 multiplications or 3 generic integer operations per cycle.

It can also be speculated, that the INTEGER unit is able to perform one PS1.1-PS1.3 register combiner operation. It seems to have the required amount of MUL/ADD units.

Registers and performance:

Number of registers used affects performance. For maximum performance, it seems you can only use 2 FP32-registers. Every two new registers slow down things:

Code:

   4.23 cycles/pixel:  1 regs, 16 add instr, 1 mov instr
   4.23 cycles/pixel:  2 regs, 16 add instr, 1 mov instr
   4.66 cycles/pixel:  3 regs, 16 add instr, 1 mov instr
   4.66 cycles/pixel:  4 regs, 16 add instr, 1 mov instr
   6.08 cycles/pixel:  5 regs, 16 add instr, 1 mov instr
   6.08 cycles/pixel:  6 regs, 16 add instr, 1 mov instr
   8.52 cycles/pixel:  8 regs, 16 add instr, 1 mov instr
  13.67 cycles/pixel: 10 regs, 16 add instr, 1 mov instr
  14.36 cycles/pixel: 12 regs, 16 add instr, 1 mov instr
  19.74 cycles/pixel: 14 regs, 16 add instr, 1 mov instr
  20.64 cycles/pixel: 16 regs, 16 add instr, 1 mov instr

One can fit two FP16 registers into one FP32 register, so using FP16 doubles the number of registers you can use without lowering performance. With two registers as in these tests, the performance for both is identical.

Finally results for individual programs:

Test method: large number of full window quads were drawn and timed. Zbuffer disabled, color writes enabled. Result is cycles/pixel, which has been scaled by 4 (assuming 4 pipelines) and 0.947 (invented efficiency factor that makes all numbers very near integers).

Driver 43.45, Geforce FX5800 Ultra.

This gives the number of rounds the pixel has to make in the above architecture, assuming a new pixel goes to the pipeline every cycle, and assuming there are 4 pipelines. The operations assumed to be performed in each round are shown in the parenthesis.

The first number tells how many micro-ops the program takes when all instructions depend on the previous one. The second number corresponds to case with paired instructions, where every pair is dependent on the previous pair (there can be difference only when there are multiply instructions).

In the program description string each letter corresponds to one instruction:
a:FX12 add (same results for multiply-accumulate, not shown)
m:FX12 mul
A:FP16 add (FP32 has same speed)
M:FP16 mul (FP32 has same speed)
T:texture fetch with coordinates from register (dependent fetch)
S:texture fetch with coordinates directly from pixel inputs

Code:

rounds:  1.00  1.01  prog: a         (1:a          )
rounds:  1.00  1.00  prog: aa        (1:aa         )
rounds:  1.00  1.00  prog: aaa       (1:aaa        )
rounds:  2.00  2.00  prog: aaaa      (2:aaa,a      )
rounds:  2.00  2.00  prog: aaaaa     (2:aaa,aa     )
rounds:  2.01  2.01  prog: aaaaaa    (2:aaa,aaa    )
rounds:  3.01  3.01  prog: aaaaaaa   (3:aaa,aaa,a  )
rounds:  1.01  1.00  prog: A         (1:A          )
rounds:  1.00  1.00  prog: Aa        (1:Aa         )
rounds:  1.01  1.00  prog: Aaa       (1:Aaa        )
rounds:  2.00  2.01  prog: Aaaa      (2:Aaa,a      )
rounds:  2.00  2.01  prog: AA        (2:A,A        )
rounds:  2.01  2.01  prog: AAaa      (2:A,Aaa      )
rounds:  1.00  1.00  prog: T         (1:T          )
rounds:  1.00  1.00  prog: Ta        (1:Ta         )
rounds:  1.00  1.00  prog: Taa       (1:Taa        )
rounds:  2.00  2.00  prog: Taaa      (2:Taa,a      )
rounds:  2.00  2.00  prog: TT        (2:T,T        )
rounds:  2.01  2.01  prog: TTaa      (2:T,Taa      )
rounds:  1.00  1.00  prog: S         (1:S          )
rounds:  1.00  1.00  prog: Sa        (1:Sa         )
rounds:  1.00  1.00  prog: Saa       (1:Saa        )
rounds:  2.00  2.00  prog: Saaa      (2:Saa,a      )
rounds:  1.00  1.00  prog: SS        (1:SS         )
rounds:  1.00  1.00  prog: SSa       (1:SSa        )
rounds:  1.00  1.01  prog: SSaa      (1:SSaa       )
rounds:  2.00  2.00  prog: SSaaa     (2:SSaa,a     )
rounds:  1.00  1.00  prog: m         (1:m          )
rounds:  1.00  1.00  prog: mm        (1:mm         )
rounds:  1.01  1.00  prog: mmm       (1:mmm        )
rounds:  2.00  1.00  prog: mmmm      (2:mmm,m      )
rounds:  2.00  1.01  prog: mmmmm     (2:mmm,mm     )
rounds:  2.01  2.01  prog: mmmmmm    (2:mmm,mmm    )
rounds:  3.01  2.00  prog: mmmmmmm   (3:mmm,mmm,m  )
rounds:  1.00  1.00  prog: M         (1:M          )
rounds:  2.01  1.00  prog: Mmmm      (2:Mmm,m      )
rounds:  2.01  1.00  prog: Mmmmm     (2:Mmm,mm     )
rounds:  2.01  2.00  prog: Mmmmmm    (2:Mmm,mmm    )
rounds:  3.01  2.00  prog: Mmmmmmm   (3:Mmm,mmm,m  )
rounds:  2.00  2.00  prog: MM        (2:M,M        )
rounds:  2.01  2.01  prog: MMmm      (2:M,Mmm      )
rounds:  3.01  2.01  prog: MMmmmm    (3:M,Mmm,mm   )
rounds:  3.01  3.01  prog: MMmmmmm   (3:M,Mmm,mmm  )
rounds:  4.02  3.01  prog: MMmmmmmm  (4:M,Mmm,mmm,m)
rounds:  1.00  1.00  prog: T         (1:T          )
rounds:  1.01  1.00  prog: Tmm       (1:Tmm        )
rounds:  2.01  1.00  prog: Tmmm      (2:Tmm,m      )
rounds:  2.00  1.01  prog: Tmmmm     (2:Tmm,mm     )
rounds:  2.01  2.00  prog: Tmmmmm    (2:Tmm,mmm    )
rounds:  2.00  2.00  prog: TT        (2:T,T        )
rounds:  2.01  2.01  prog: TTmm      (2:T,Tmm      )
rounds:  3.01  2.01  prog: TTmmm     (3:T,Tmm,m    )
rounds:  3.01  2.01  prog: TTmmmm    (3:T,Tmm,mm   )
rounds:  3.01  3.01  prog: TTmmmmm   (3:T,Tmm,mmm  )
rounds:  1.01  1.01  prog: S         (1:S          )
rounds:  1.00  1.00  prog: Smm       (1:Smm        )
rounds:  2.00  1.00  prog: Smmm      (2:Smm,m      )
rounds:  2.00  1.00  prog: Smmmm     (2:Smm,mm     )
rounds:  2.01  2.01  prog: Smmmmm    (2:Smm,mmm    )
rounds:  1.00  1.00  prog: SS        (1:SS         )
rounds:  1.00  1.00  prog: SSmm      (1:SSmm       )
rounds:  2.01  1.01  prog: SSmmm     (2:SSmm,m     )
rounds:  2.01  1.00  prog: SSmmmm    (2:SSmm,mm    )
rounds:  2.01  2.00  prog: SSmmmmm   (2:SSmm,mmm   )

Code:

rounds:  1.00  1.00  prog: Amm       (1:Amm        )
rounds:  1.01  1.00  prog: Aam       (1:Aam        )
rounds:  1.00  1.00  prog: Ama       (1:Ama        )
rounds:  1.00  1.00  prog: Aaa       (1:Aaa        )
rounds:  2.00  1.00  prog: Ammm      (2:Amm,m      )
rounds:  2.00  1.00  prog: Aamm      (2:Aam,m      )
rounds:  2.00  2.00  prog: Amam      (2:Ama,m      )
rounds:  2.00  2.00  prog: Aaam      (2:Aaa,m      )
rounds:  2.00  1.01  prog: Amma      (2:Amm,a      )
rounds:  2.00  2.00  prog: Aama      (2:Aam,a      )
rounds:  2.00  2.01  prog: Amaa      (2:Ama,a      )
rounds:  2.00  2.01  prog: Aaaa      (2:Aaa,a      )
rounds:  2.00  1.00  prog: Ammmm     (2:Amm,mm     )
rounds:  2.00  2.00  prog: Aammm     (2:Aam,mm     )
rounds:  2.00  2.00  prog: Amamm     (2:Ama,mm     )
rounds:  2.00  2.00  prog: Aaamm     (2:Aaa,mm     )
rounds:  2.00  2.00  prog: Ammam     (2:Amm,am     )
rounds:  2.00  2.01  prog: Aamam     (2:Aam,am     )
rounds:  2.00  2.00  prog: Amaam     (2:Ama,am     )
rounds:  2.00  2.00  prog: Aaaam     (2:Aaa,am     )
rounds:  2.01  2.00  prog: Ammma     (2:Amm,ma     )
rounds:  2.00  2.01  prog: Aamma     (2:Aam,ma     )
rounds:  2.00  2.00  prog: Amama     (2:Ama,ma     )
rounds:  2.00  2.01  prog: Aaama     (2:Aaa,ma     )
rounds:  2.00  2.00  prog: Ammaa     (2:Amm,aa     )
rounds:  2.00  2.00  prog: Aamaa     (2:Aam,aa     )
rounds:  2.00  2.00  prog: Amaaa     (2:Ama,aa     )
rounds:  2.01  2.01  prog: Aaaaa     (2:Aaa,aa     )

More test cases were used to come to these conclusions (especially for the organization of the FX12 units). The second column in the second list above shows the performance for different cases in the integer combiners.

There are probably errors in these tests, and there may be cases where performance is lower due to some limitation not observed. There could also be cases where 8 float-operations/cycle were possible, but at least with common operations such cases were not found.

In any case the overall performance should hopefully be accurate.

LeStoffer · Apr 1, 2003

thepkrl said:
The FLOAT/TEXTURE unit can handle any instruction with any format of input or output. All instructions execute in one cycle, except for LRP,RSQ,LIT,POW which take 2 and RFL which takes 4.

...

When textures are fetched, nothing else can be done in the FLOAT/TEXTURE unit. The following INTEGER unit can use the fetched texture result freely.

Okay, I might be a bit dense today: So you are saying that the relatively low FP shader performance on NV30 stems from the FP ALU unit sharing crucial ressources with the FP texturing unit?

It sounds like an odd bottleneck in a CineFX architecture (unless they still regard those Register Combiners as having state of the art shading ability).

Hyp-X · Apr 1, 2003

LeStoffer said:
Okay, I might be a bit dense today: So you are saying that the relatively low FP shader performance on NV30 stems from the FP ALU unit sharing crucial ressources with the FP texturing unit?

There are four reasons:

1. NV30 has 4 FP units - R300 has 8
2. NV30 cannot execute an FP instruction in the cycle it fetches textures - R300 can
3. NV30 doesn't seem to be able to pack a vec3 and a float in the same cycle - R300 can
4. NV30 has penalty when using more than 2 FP32 registers or 4 FP16 registers - R300 has no such penalty

LeStoffer · Apr 1, 2003

Hyp-X said:
There are four reasons:

1. NV30 has 4 FP units - R300 has 8
2. NV30 cannot execute an FP instruction in the cycle it fetches textures - R300 can
3. NV30 doesn't seem to be able to pack a vec3 and a float in the same cycle - R300 can
4. NV30 has penalty when using more than 2 FP32 registers or 4 FP16 registers - R300 has no such penalty

Thanks, Hyp-X! :idea:

We knew about #1, but I'm still a bit surprised by #2 & #4. It seems that the performance delta between running at FP16 or FP32 isn't in itself as dramatic as previously assumed.

Tagrineth · Apr 1, 2003

Looks like NV30's "sweet spot" for FP shaders is 4 registers - barely less performance than 2 at all.

But it's still ridiculous. nVidia has made many mistakes with this architecture... $5 say AT MOST half of those will be fixed in NV35.

McElvis · Apr 1, 2003

$5 say AT MOST half of those will be fixed in NV35

Only $5....

antlers · Apr 2, 2003

Speed improvements, it seems, come not from going to FP32 to FP16 but from using FX12. Too bad neither PS2.0 or OpenGL shader specs were designed with that in mind.

Thanks for doing that work thepkr1

Tagrineth · Apr 2, 2003

McElvis said:
$5 say AT MOST half of those will be fixed in NV35

Click to expand...

Only $5....

Just an expression.

If I was placing a bet, I would have said so

demalion · Apr 2, 2003

Ok, one question for clarification:

If 8 texture ops can be done per clock cycle, how are we doing 8 with 4 units? Maybe it is just something simple I'm missing.

The below is associated with speculation from the above...if I'm missing something, the questions may not be pertinent:

How many components are being assumed in the texture op? Can the nv30 do 8 cube map texture address ops per cycle? 4? 2? If so, wouldn't that require the same register space as an fp32 op for each one?

A related question (to me): Microsoft specifies minimum precision for each texture op component as 24 bit IIRC...is there are any reason, like some specific texture compression idea (if that would make sense at all), that hardware might be optimized for less precision for texture addressing?

If it can do 8 cube map texture ops per cycle, I'd conclude slightly differently than has been done here ('some of' my nv35 guess in the other thread was based on this). With the transistor count, I'd expect this to be the case, but maybe I'm missing something in the benchmark results.

If it can do 2, I think it matches what has been said here.

If it can do 4, I'm puzzled as to why it can't ever do 4 fp32 ops per clock cycle, as that would mean the register space is there...it should atleast be able to do something like the texm3x3 instruction at 4 per clock, shouldn't it?

The 4 and 8 cases match my expectations for "GF 4 + 4 fp32 units", depending on how many cube map ops per cycle the GF 4 could do.

thepkrl · Apr 2, 2003

demalion said:
If 8 texture ops can be done per clock cycle, how are we doing 8 with 4 units? Maybe it is just something simple I'm missing.

The performance numbers don't really tell how many units are involved, just how many results we get each cycle. There could be four texture units that can do one trilinear or two bilinear. Or there could be 8 texture units, two connected to each fragment processor. The actual hardware wiring is probably much more complicated than the simple diagram.

From my tests, it seems that to get 8 texture fetches per cycle, the fetches have to be paired and nondependent. One can do integer operations after the pair as if the pair were a single FP operation, which suggest they are done in parallel at least from the fragment processors perspective.

The number of cycles a texture fetch takes is variable depending on how many samples are needed. For example you can even get 8 trilinear or anisotropic samples/cycle, if they happen to require just magnification (meaning they are effectively bilinear).

demalion said:
How many components are being assumed in the texture op? Can the nv30 do 8 cube map texture address ops per cycle? 4? 2? If so, wouldn't that require the same register space as an fp32 op for each one?

The texture used is a mipmapped 16x16 RGBA texture (4 components). It is small enough to fit into caches and give constant performance.

What do you mean by components and register space? Perspective textures and cubemaps have the same number of input coordinates (3) and the same number of outputs (4 for RGBA texture).

When I mentioned register limitations, I meant temporary registers (R1,R2,...) which you use to store intermediate results. In addition there are input and output registers (colors, texture coordinates, result), using which doesn't add extra slowdown. However, there seem to be limitations on where input registers can be efficiently used. Using a texcoord or color in fp-calculation costs an extra cycle. This makes the register connections look something like:

Code:

temporary registers (R0,R1,..,H0,H1,..)
  |
FLOAT (perhaps does DDX/DDY for dependent fetches)
  | \
  | TEXTURE &lt;-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
  | /  
INTEGER &lt;-- f[COL0],f[COL1]
  |
INTEGER &lt;-- f[COL0],f[COL1]
  |
(loopback to temporary registers or output)

If input regs are used in the unit they are connected to, using them is free. If they are used for FLOAT/TEXTURE coords, an extra round is needed to first store them into a temp register. For example "ADD R0,f[TEX0],f[TEX0]" takes two rounds.

I tried cube maps, and they are as fast as normal textures, meaning NV30 can fetch 8 cube-maps/cycle if coordinates come from inputs, or 4 cube-maps/cycle if coordinates come from temp registers.

demalion said:
A related question (to me): Microsoft specifies minimum precision for each texture op component as 24 bit IIRC...is there are any reason, like some specific texture compression idea (if that would make sense at all), that hardware might be optimized for less precision for texture addressing?

Older NV_fragment_program specs say (this line is no longer present in latest specs though): "Fragment program instructions using the fragment attribute registers f[FOGC] or f[TEX0] through f[TEX7] will be carried out at full fp32 precision, regardless of the precision specified by the instruction."

From my tests it seems texture coordinates are handled as FP32 and the result can be stored to FP32 or FP16 with same performance.

demalion · Apr 2, 2003

Units:

Transistor count and listed performance lead me to lean more towards 8 fp units.

This only matters significantly for my nv35 speculation, and for insight into the nature of the register limitations.

Coordinates:

Ok, perhaps component is a bad choice of wording if it implies color component by usage...I'm not talking about the output of a texture look up, I'm just trying to establish the register demands for the stated texture op throughput exceeded what I understood to be mentioned already.

With your all cap words being effective units, your diagram seems to indicate this, with the dedicated texture register capacity allowing that throughput, and temporary registers hindering using it beyond what is indicated. What I think is that this setup offers flexibility possibilities to greatly benefit nv35 fp shader performance with register improvements/reorginzation (i.e., within the speculated transistor budget) in specific situations (long computations without texture ops).

Registers:

AFAICS, your observation answers the above questions, and establishes that the nv35 could likely offer the possibility I mentioned.

However, I think your register path should meet just before the INTEGER parts and have another path bypassing them like you did for the TEXTURE part...moreso because of precision issues. Make sense?

Precision:

Well, I was approaching my question from two directions...the other direction seems to have provided the answer. Had a full set of speculation lined up for 16 bit texture coordinates, though.

KimB · Apr 2, 2003

I'd just like to say that those performance figures definitely make it look like something serious is up. In particular, I think the most damaging aspect is the extra performance hit from using the extra texture registers.

As a side note, has anybody else wondered that if nVidia didn't significantly improve the FP power of the NV30 over the NV2x (texture ops always used FP), what were the 120 million transistors used for?

LeStoffer · Apr 2, 2003

Chalnoth said:
As a side note, has anybody else wondered that if nVidia didn't significantly improve the FP power of the NV30 over the NV2x (texture ops always used FP), what were the 120 million transistors used for?

Good question. One thing could be a having both Register Combiners (apparently without much change over GF4) and the new FP32 ALU in there on the same time. Maybe the more flexible Vertex Shader array take up more silicon too. But I agree it seems a bit excessive when you think about both the R300 and NV25.

Any Easter Eggs in there?

KimB · Apr 2, 2003

Oh, and here's something I didn't think of previously.

How about a comparison with significantly different fragment program lengths?

That is, can it be shown that there's a performance hit from using above X number of instructions on the FX?

Additionally, can the performance hit from using too many registers be lessened by spreading the use of those registers out over more instructions? Quick example:

1-16 use registers x, y, z
17-32 use registers a, b, c
33-48 use z, c

And one more thing. As an alternative to the above, can it be shown that inserting extra move instructions so that the actual calculations are always done on no more than, say, 4 registers at once actually improves performance? Or, alternatively, that doing so does not change performance (that is, that the added move instructions incur no performance hit...which may indicate that the drivers are currently doing such moves, meaning only a few registers are actually available at once to each arithmetic unit).

BoardBonobo · Apr 2, 2003

Chalnoth said:
As a side note, has anybody else wondered that if nVidia didn't significantly improve the FP power of the NV30 over the NV2x (texture ops always used FP), what were the 120 million transistors used for?

Sounds like a case for Sherlock Homes, we have all these transistors... but what do they do??

Maybe the DDRII interface needs lots of transistors and would that number include transistors used by cache?

KimB · Apr 2, 2003

BoardBonobo said:
Maybe the DDRII interface needs lots of transistors and would that number include transistors used by cache?

I don't think any memory interface could take up that many transistors.

KimB · Apr 2, 2003

Oh, and there's one more thing.

Were your assembly commands strickly serial? That is, did you try "pairing up" commands that could be executed in parallel?

Quick example; instead of:
c=a+b
a=b+c
b=c+a

...and so on, do:
c=a+b
e=a+d
a=b+c

...and so on.

Note that you could use constants to avoid using too many temporary registers.

thepkrl · Apr 4, 2003

demalion said:
However, I think your register path should meet just before the INTEGER parts and have another path bypassing them like you did for the TEXTURE part...moreso because of precision issues. Make sense?

Yes you are right. Integer units can be bypassed or the float data just flows through unchanged.

Chalnoth said:
How about a comparison with significantly different fragment program lengths?

Program length doesn't seem to matter. Program is probably loaded from memory and long programs would only be slower in bandwidth limited cases. I also tried texture fetches and they went the same speed (with a small texture that fits in caches).

Code:

     0.95 rounds: 1 add cwrite
     2.00 rounds: 2 add cwrite
     4.00 rounds: 4 add cwrite
     8.01 rounds: 8 add cwrite
    16.03 rounds: 16 add cwrite
    32.06 rounds: 32 add cwrite
    64.14 rounds: 64 add cwrite
   128.23 rounds: 128 add cwrite
   256.40 rounds: 256 add cwrite
   512.81 rounds: 512 add cwrite
  1002.50 rounds: 1000 add cwrite

Chalnoth said:
Additionally, can the performance hit from using too many registers be lessened by spreading the use of those registers out over more instructions?

Register usage order doesn't seem to matter either. Just the number of individual registers you use. In the tests below, there were 240 "ADD R0,R0,R0" instructions at the start, and then 16 instructions that push the number of registers used to the specified number. Using different amount of instructions (tried 32 and 256) gives the same performance ratio between the cases.

Code:

 256.39 rounds: 2 regs, 256 instr
 277.92 rounds: 4 regs, 256 instr
 357.52 rounds: 6 regs, 256 instr
 476.16 rounds: 8 regs, 256 instr
 727.96 rounds: 10 regs, 256 instr

Chalnoth said:
And one more thing. As an alternative to the above, can it be shown that inserting extra move instructions

It seems the compiler optimizes MOV instructions away. Adding them doesn't slow things, unless they move between different types of registers (int to float for example).

Chalnoth said:
Were your assembly commands strickly serial? That is, did you try "pairing up" commands that could be executed in parallel?

Quick example; instead of:
c=a+b
a=b+c
b=c+a

...and so on, do:
c=a+b
e=a+d
a=b+c

For all operations I tried too variants (example for ADD):

dependent:
ADD R0,R0,R0
ADD R0,R0,R0
ADD R0,R0,R0
ADD R0,R0,R0
...
ADD o[COL0],R0,R0

paired:
ADD R0,R0,R0
ADD R1,R1,R1
ADD R0,R0,R0
ADD R1,R1,R1
...
ADD o[COL0],R0,R1

The performance was always the same, except for FX12 multiplies. In the end I attributed this to having the capability for two parallel multiplications in the integer unit.

Even though the above programs are trivial (initial value of all registers is 0 and so the result is 0) the driver doesn't seem to optimize them. Of course writing code like the above doesn't make sense, so why spend time optimizing it. The driver does optimize away calculations whose results are never used and seems to reorder instructions (moving FX12 instructions to between FP instructions).

I tried separately to load some data (texture coordinates/colors) into registers before some tests, but it always just added 1-2 cycles to the total which was consistent with a few extra load instructions. I also tried instruction triples for multiplies, but that didn't go faster than pairs.

Arun · Apr 6, 2003

Very interesting stuff, thepkrl! Thanks for the info!
BTW, could we get numbers for 32 registers? You've only given us till 16 registers...

Okay, so basically...

- The NV30 is a *4* pipeline architecture, with *native 1 cycle FP32 support* on a vec3.
The only advantage of FP16 over FP32 is lower register usage.

- Register usage is critical to performance: for example, if you use 10 FP32 registers or 20 FP16 registers, performance is nearly three times lower than theorical.

- The NV30 got two additional integer units per pipeline, for a total of 8, who are FX12.

- The NV30 can do 1 dependent fetch/cycle per pipeline, or 2 when it's not dependent.

My guess for the register usage thing is that, for a reason or another, nVidia fetches the information of ALL registers which are currently in use.
Since they are FP32 4-component and the performance hit is only every two registers, you'd conclude they can fetch 256 bits/cycle.
But why doesn't it become 2 times slower?
Well, one explanation would be that there's no true dedicated units for that. You'd be using units which would otherwise be idle at that very moment. So, since that takes more time, you can put more of those units to it. I don't know what those units are, really - maybe it's the Z units - and maybe it isn't a good explanation, either.
I could very well be just plain stupid...

The NV35 ( which is being announced at E3 BTW, see http://www.notforidiots.com/GPURW.php ) might thus be the following:
<speculation>
True 8 pipelines, with 1 FP32 unit and 1 INT12 unit per pipeline.
One dependent fetch/cycle, or 2/cycle when it's not dependent.
More "register walking" power, at least two times more.
</speculation>

Okay, so the question would be how the heck they'd do to get this done with 130M transistors, if they needed 125M transistors for the NV30...

There are reports that there's much bugged silicon, but that obviously can't be more than 5M or 10M - and they probably fixed part of it, maybe saving only 5M.
Also, the NV35 probably doesn't have more transistors for Vertex Shading - maybe a few optimizations though, who knows.

They'd really have to do an *amazing* optimization job... It would be very impressive, IMO.

Uttar

BRiT · Apr 6, 2003

My gut feel to the situation is that the NV30 has much more than 5M to 10M of bugged transistors, or they were in the key critical performance sections. They obviously needed to release something to fill the product gaps, so they performed incredible hack jobs in hardware and drivers just to get the thing to work at all. The NV30 is nothing but a terribly failed chip that was released just to say "Hey, we have a product too". My take on things is the NV35 has just a few minor tweaks from what the NV30 was supposed to have been [read: The current productized NV30 is nothing like the theoretical/designed NV30.]

Anyways, certainly interesting findings.

NV30 fragment processor test results

thepkrl

LeStoffer

Hyp-X

Irregular

LeStoffer

Tagrineth

murr

McElvis

antlers

Tagrineth

murr

demalion

thepkrl

demalion

KimB

LeStoffer

KimB

BoardBonobo

My hat is white(ish)!

KimB

KimB

thepkrl

Arun

Unknown.

BRiT

(>• •)>⌐■-■ (⌐■-■)