NV30 fragment processor test results

Arun · Apr 7, 2003

Dawn's skin shader:

Code:

!!FP1.0
# NV_fragment_program generated by NVIDIA Cg compiler
# cgc version 1.5.0001, build date Oct 15 2002  01:04:08
# command line args: -profile fp30
#vendor NVIDIA Corporation
#version 1.0.1
#profile fp30
#program main
#semantic main.skinColor_frontSpec
#semantic main.skin_norm_sideSpec
#semantic main.g_specular_colorShift
#semantic main.g_blood_texture
#semantic main.g_transmission_terms
#semantic main.g_diffuse_Cube
#semantic main.g_specular_Cube
#semantic main.g_nrmalize_Cube
#semantic main.g_dappleProjection
#semantic main.g_hilight_Cube
#semantic main.g_oiliness
#var float4 v2f.skinColor_frontSpec : $vin.TEXCOORD0 : TEXCOORD0 : 0 : 1
#var float3 v2f.worldEyeDir	   : $vin.TEXCOORD2 : TEXCOORD2 : 0 : 1
#var float3 v2f.worldTanMatrixX 	   : $vin.TEXCOORD5 : TEXCOORD5 : 0 : 1
#var float3 v2f.worldTanMatrixY 	   : $vin.TEXCOORD6 : TEXCOORD6 : 0 : 1
#var float3 v2f.worldTanMatrixZ 	   : $vin.TEXCOORD7 : TEXCOORD7 : 0 : 1
#var float4 v2f.SkinSilouetteVec    : $vin.TEXCOORD3 : TEXCOORD3 : 0 : 1
#var samplerRECT skinColor_frontSpec :  : texunit 0 : 1 : 1
#var samplerRECT skin_norm_sideSpec  :  : texunit 1 : 2 : 1
#var samplerRECT g_specular_colorShift :  : texunit 2 : 3 : 1
#var samplerRECT g_blood_texture 	     :  : texunit 3 : 4 : 1
#var samplerRECT g_transmission_terms  :  : texunit 4 : 5 : 1
#var samplerCUBE g_diffuse_Cube 	     :  : texunit 5 : 6 : 1
#var samplerCUBE g_specular_Cube       :  : texunit 6 : 7 : 1
#var samplerCUBE g_nrmalize_Cube       :  : texunit 7 : 8 : 1
#var samplerRECT g_dappleProjection    :  : texunit 8 : 9 : 1
#var samplerCUBE g_hilight_Cube        :  : texunit 9 : 10 : 1
#var float2 g_oiliness :  :  : 11 : 1
#var half4 COL : $vout.COLOR : COLOR : -1 : 1

DEFINE LUMINANCE = {0.299, 0.587, 0.114, 0.0};
DECLARE g_oiliness;

############################################################################################
#  These two blocks slow code down 10%                                                     #
############################################################################################
TEX H0, f[TEX0], TEX1, 2D;                  # store range-compressed normal (and side spec) in H1
TEX H1, f[TEX0], TEX0, 2D;                  # store skin color in H3

MOVH H2.x, g_oiliness.y;
MULX H0.xy, H0, H2.x;
TEX H0, H0, TEX7, CUBE;

MOVH H3, f[TEX5];
DP3X H2.x, H0, H3;
MOVH H3, f[TEX6];
DP3X H2.y, H0, H3;
MOVH H3, f[TEX7];
DP3X H2.z, H0, H3;                          # H2 now contains worldNormal - extinguish

MOVH H0.xyz, f[TEX2];                       # store v2f.world_V in H1
DP3X H2.w, H0, H2;                          # H2.w = dot(Normal, View)
MULX H2.xyz,-H2, -2;                        # twice the normal
MADX H3.xyz, -H2, H2.w, H0;                 # H4 = -2*dot(H0, H2)*H2 + H0  = reflection vector.  Normal doesn't need to be fixed, because it's uniformly scaled

TEX H0.xyz, H2, TEX5, CUBE;                 # diffuse lighting
TEX H2, H2, TEX6, CUBE;                     # side specular
MULX H0.xyz, H0, H1;                        # skin diffuse*diffuse color
#MULX H2, H2, H0.w;                         # side_spec*side_spec term (H0.w is the side_spec term)
MULX H2, H2, H1.w;                          # side_spec*side_spec term (H0.w is the side_spec term)

MOVH H1.xyz, f[TEX3];                       # copy ndotv (H1.w is the spec map)
MULX H0, H0, H1.x;
MADX H0, H2, H1.y, H0;                      # diffuse + side_spec (H2 avail)

TEX H2, f[TEX2], TEX9, CUBE;                # fetch hilight
TEX H3, H3, TEX6, CUBE;                     # fetch direct specular
MULX H2.xyz, H2, H1.y;                      # hilight by facing ratio
MULX H2.xyz, H2, H1.x;
MULX H3, H3, H1.w;                          # spec*spec_map
MULX H3, H3, 0.02;                          # scale specular
MOVH H3.w, g_oiliness.x;
MULH H3, H3, H3.w;		            # scale the oiliness
MADX H2.xyz, H2, 0.7, H3;                   # 0.7*hilight + direct spec

ADDX H0, H2, H0;                            # diffuse specs and hilight
#MADX o[COLH], H0, H1.x, H2.w;              # add haze
ADDX o[COLH], H0, H2.w;                     # add haze
END


# End of program

Also, note how only H0->H3 is used. Sounds like nVidia knows about the register performance

Okay, so in that program, there are:
ADDX: 2
MULX: 9
MADX: 3
DP3X: 4
TEX: 7
MULH: 1
MOVH: 7

-> 5a, 13m, 3S, 4T, 1M and 7MOVs
All MOVs seem to do FP->FP

According to thepkrl data, the float/texture part would take 7 cycles ( 1.5+4+1 = 6.5 -> 7 )
I didn't check whether "m"s are dependent.
If they aren't, then it's possible to do the INT part in 7 cycles too, in parallel to all of the FP/Tex stuff:
12/3 + 6/2 = 4 + 3 = 7

Does that make sense?

BTW, I'd like to know how much rounds SM, SA, TM and TA takes. If they only take one, then the FP/Tex part would only take 6 cycles...

Uttar

thepkrl · Apr 8, 2003

Uttar said:
BTW, could we get numbers for 32 registers? You've only given us till 16 registers...

More numbers below. Looking at them more closely, it seems there is more structure to the slowdown than just every two registers slowing things. I've grouped the numbers a bit to show this.

Code:

   32.12 rounds   8.48 cycle/fragm:  1 regs, 32 instr
   32.05 rounds   8.46 cycle/fragm:  2 regs, 32 instr

   34.29 rounds   9.05 cycle/fragm:  3 regs, 32 instr
   34.27 rounds   9.05 cycle/fragm:  4 regs, 32 instr

   43.62 rounds  11.52 cycle/fragm:  5 regs, 32 instr
   43.64 rounds  11.52 cycle/fragm:  6 regs, 32 instr

   59.60 rounds  15.73 cycle/fragm:  7 regs, 32 instr
   59.61 rounds  15.74 cycle/fragm:  8 regs, 32 instr

   93.42 rounds  24.66 cycle/fragm:  9 regs, 32 instr
   93.36 rounds  24.65 cycle/fragm: 10 regs, 32 instr
   95.97 rounds  25.33 cycle/fragm: 11 regs, 32 instr
   95.97 rounds  25.33 cycle/fragm: 12 regs, 32 instr

  129.35 rounds  34.15 cycle/fragm: 13 regs, 32 instr
  129.33 rounds  34.14 cycle/fragm: 14 regs, 32 instr
  132.74 rounds  35.04 cycle/fragm: 15 regs, 32 instr
  132.80 rounds  35.06 cycle/fragm: 16 regs, 32 instr

  207.44 rounds  54.76 cycle/fragm: 17 regs, 32 instr
  207.42 rounds  54.76 cycle/fragm: 18 regs, 32 instr
  212.65 rounds  56.14 cycle/fragm: 19 regs, 32 instr
  212.64 rounds  56.14 cycle/fragm: 20 regs, 32 instr
  217.83 rounds  57.50 cycle/fragm: 21 regs, 32 instr
  217.85 rounds  57.51 cycle/fragm: 22 regs, 32 instr
  223.03 rounds  58.88 cycle/fragm: 23 regs, 32 instr
  223.05 rounds  58.88 cycle/fragm: 24 regs, 32 instr

  299.50 rounds  79.07 cycle/fragm: 25 regs, 32 instr
  299.49 rounds  79.06 cycle/fragm: 26 regs, 32 instr
  306.32 rounds  80.87 cycle/fragm: 27 regs, 32 instr
  306.32 rounds  80.87 cycle/fragm: 28 regs, 32 instr
  313.12 rounds  82.66 cycle/fragm: 29 regs, 32 instr
  313.15 rounds  82.67 cycle/fragm: 30 regs, 32 instr
  319.91 rounds  84.45 cycle/fragm: 31 regs, 32 instr
  319.95 rounds  84.46 cycle/fragm: 32 regs, 32 instr

One guess to explain this is that the more registers is used, the fewer fragments are active in the pipeline. This would lower efficiency but would increase memory available to a single fragment. This might be a sensible strategy, as using many registers means long programs, which might have internal parallelism that could be exploited. This doesn't seem to happen with current drivers though (programs with many registers do not perform paired FP-ops any faster).

I tried more tests with complicated programs using 8 registers. It seems the numbers above are the best case, and when registers are used in more complicated patterns, things get a bit slower (25% in one case).

Uttar said:
Also, note how only H0->H3 is used. Sounds like nVidia knows about the register performance

Good observation.

Looking at the Dawn skin shader you showed, there is an interesting code fragment with extra MOV instructions:

Code:

MOVH H3, f[TEX5];
DP3X H2.x, H0, H3;
MOVH H3, f[TEX6];
DP3X H2.y, H0, H3;
...

I tried a similar code segment to see if moving a texture coordinate f[TEX] to a temp reg first is faster than using it directly. And it is! As mentioned earlier, using a f[TEX]-register in arithmetic ops seems to cost one extra cycle (for both FP16 and FP32). However, the extra move in the FP16 case makes the operation complete in one cycle, including the MOV which becomes free.

This might be related to the NV program specification, which says instructions with f[...] are always performed with FP32 accuracy. So using it directly forces the operation into FP32 which is always 2 cycles in this case, but adding a dummy move that changes the type makes it legal for the driver to ignore the rule.

The whole Dawn shader takes 17 rounds/pixel (4.25 cycles/pixel) in my testbench. There is dependency between instructions which lowers efficiency quite a bit from the theoretical maximums you calculated. I tried to manually divide the shader to rounds, and got the same number of stages as the tests showed, so this might be close:

Code:

TEX H0, f[TEX0], TEX1, 2D;
TEX H1, f[TEX0], TEX0, 2D;
--
MOVH H2.x, g_oiliness.y;
MULX H0.xy, H0, H2.x;
--
TEX H0, H0, TEX7, CUBE;
--
MOVH H3, f[TEX5];
DP3X H2.x, H0, H3;
--
MOVH H3, f[TEX6];
DP3X H2.y, H0, H3;
--
MOVH H3, f[TEX7];
DP3X H2.z, H0, H3;
--
MOVH H0.xyz, f[TEX2];
DP3X H2.w, H0, H2;
--
MULX H2.xyz,-H2, -2;
MADX H3.xyz, -H2, H2.w, H0;
--
TEX H0.xyz, H2, TEX5, CUBE;
--
TEX H2, H2, TEX6, CUBE;
--
MULX H0.xyz, H0, H1;
MULX H2, H2, H1.w;
--
MOVH H1.xyz, f[TEX3];
MULX H0, H0, H1.x;
MADX H0, H2, H1.y, H0;
--
TEX H2, f[TEX2], TEX9, CUBE;
 MULX H2.xyz, H2, H1.y;
 MULX H2.xyz, H2, H1.x;
--
TEX H3, H3, TEX6, CUBE;
 MULX H3, H3, H1.w;
 MULX H3, H3, 0.02;
--
MOVH H3.w, g_oiliness.x;
MULH H3, H3, H3.w;
--
MADX H2.xyz, H2, 0.7, H3;
ADDX H0, H2, H0;
--
ADDX o[COLH], H0, H2.w;

Uttar said:
BTW, I'd like to know how much rounds SM, SA, TM and TA takes.

Texture does seem to take the whole round and the Add or Mul goes to the next round.

Code:

  1.00 rounds, prog: T
  2.02 rounds, prog: TA
  3.01 rounds, prog: TAA
  1.00 rounds, prog: S
  2.01 rounds, prog: SA
  3.01 rounds, prog: SAA
  1.01 rounds, prog: T
  2.00 rounds, prog: TM
  3.03 rounds, prog: TMM
  1.02 rounds, prog: S
  2.03 rounds, prog: SM
  3.04 rounds, prog: SMM

KimB · Apr 11, 2003

Oh, I'd like to post a quick update. I finally found the quote I was looking for.

With 4 FP ops and 8 Int ops per clock from these tests, the total number of functional units just does not add up.

From Extremetech:

Pipes don't mean as much as they used to. In the [dual-pipeline] TNT2 days you used to be able to do two pixels in one clock if they were single textured, or one dual-textured pixel per pipe in every two clocks, it could operate in either of those two modes. We've now taken that to an extreme. Some things happen at sixteen pixels per clock. Some things happen at eight. Some things happen at four, and a lot of things happen in a bunch of clock cycles four pixels at a time. For instance, if you're doing sixteen textures, it's four pixels per clock, but it takes more than one clock. There are really 32 functional units that can do things in various multiples. We don't have the ability in NV30 to actually draw more than eight pixels per cycle. It's going to be a less meaningful question as we move forward...[GeForceFX] isn't really a texture lookup and blending pipeline with stages and maybe loop back anymore. It's a processor, and texture lookups are decoupled from this hard-wired pipe.

(This was a quote by Davis Kirk)

Now, if the GeForce FX indeed does have 32 functional units, what are they doing? We only see 12 in use with these benches.

Luminescent · Apr 11, 2003

Well remember, each fp shader pipeline contains 4 fp subunits (perhaps mads/general fp processors) which compute the color for each individual color component (RGBA). So there are 4 32-bit units per shader pipeline. If there are 4 pipes, this means 16 units. This is only for the fragment shader. What about the 12 integer units?

Edit:
I do agree that there is some mystery behind the 125 million transistor count of the NV30; something is not performing. With the NV35 rumored to sport only 5 million more, something smells fishy. Does anyone think NV35 will be able to exectute texture ops and fragment shaders concurrently? Why isn't NV30 not able to do this. According to thepkrl, the NV30 uses the fragment shader for only dependent texturing. Therefore, why is it not performing alongside the texture units for independent texturing?

KimB · Apr 11, 2003

But if you're going to calculate things like that, you'd get 48 functional units, which also doesn't add up.

My first guess would be that there's a second FP unit per pipe that isn't made available in current drivers. For what reason, I'm not certain (though parallelism concerns seem to be the most likely issue...possibly with register sharing problems). These 16 functional units may be counted as 32 if each is an mad unit (multiply+add).

demalion · Apr 11, 2003

OK, MAD can be viewed as 2 units for your purposes?

If you'll look back here, you'll see the argument I made for 8 FP units and subsequent discussion about it (if you look at the other thread, I think you'll see why I consider that important for the NV35 when not performing texture ops for long sequences of instructions).

From that and the other detailed discussion thepkrl provided, 8 * (FP MUL + FP ADD) + 4 * (2 MUL + 2 ADD) = 32.

OK, starting from this, and pointing out the flaws in it and my past discussion, what is your point Chalnoth? I'm not trying to flame, I just don't see how your discussion of a 32 count calls the earlier discussion into question.

antlers · Apr 11, 2003

I don't think there are any "missing" calc units there. David Kirk was just in total marketing mode. I mean, he said that they can never write more than 8 pixels per cycle, when the truth of the matter is they can never write more than four.

Saem · Apr 11, 2003

But it's still ridiculous. nVidia has made many mistakes with this architecture.

Argh, there is a significant difference between architecture and design. I'm guessing you mean mistakes in the design. The architecture is likely quite good, just this implementation (Read: design) sucks when compared to the R300.

KimB · Apr 12, 2003

I don't think it sucks compared to the R300 (speaking solely about the shader implementation).

The only problem is, you just need to make use of integer calculations whenever possible for the NV30 architecture to have high performance. If your calculations can use integer precision enough (2/3 of all calculations), then the NV30 can, with the benchmarks in this thread, have performance as much as 50% above the R300 on a per-clock basis.

Now, this certainly makes the NV30 a little bit harder to program for, but this is probably why nVidia made Cg. With optimized compiling, the vast majority of the quirkiness of the NV30 architecture need not be made visible to the programmer. Still, Direct3D remains a problem for the FX.

Now, if you really want to say that the NV30's shader design is fundamentally inferior to the R300's, you will need to have more information about specific shaders that would require higher than integer precision all throughout the pipeline. And remember, even if a piece of software has some shaders that require floating-point precision all throughout the shader, the NV30 may still end up with higher performance if enough shaders need mostly integer precision.

demalion · Apr 12, 2003

Are you including that one has 8 pipes and another has 4 for that "50%" higher per clock figure?

KimB · Apr 12, 2003

demalion said:
Are you including that one has 8 pipes and another has 4 for that "50%" higher per clock figure?

Well, isn't the Radeon 9700 capable of FP shader op per clock per pixel pipeline? That's 8 fp ops per clock.

The GeForce FX is, according to the benches in this thread, capable of 4 fp ops and 8 int ops per clock. That's 50% more pixel shading power, if enough ops use integer precision.

Of course, if lots of scalar ops are issued, the R300 can do even better...but I'm just trying to say that the FX shader architecture is not fundamentally inferior to the R300 shader architecture (Of course, this depends on how many calculations can run at integer precision...but current information points to the FX as generally being better if as many ops as possible use integer precision).

MDolenc · Apr 12, 2003

No matter how you twist it: ps_2_x and ARB_fragment_program DO NOT expose anything to support integer calculations. Only NV_fragment_program exposes such functionality and only there Cg can help GeForce FX. Generally you ARE NOT ALLOWED to run below fp16 in ps_2_x and ARB_fragment_program, meaning that NVIDIA drivers will have to hack around trying to figure out if shader actually needs (full) float precision for that instruction or not.
I don't know what has NVIDIA actually tried to do with such approach? They are definitely not making their driver writer lives any easier. They also don't seem to save any transistors. So what the hell are they saving?

LeStoffer · Apr 12, 2003

Chalnoth said:
The only problem is, you just need to make use of integer calculations whenever possible for the NV30 architecture to have high performance.
...
Now, this certainly makes the NV30 a little bit harder to program for, but this is probably why nVidia made Cg. With optimized compiling, the vast majority of the quirkiness of the NV30 architecture need not be made visible to the programmer.

Yes, this very high dependence on int12 ins is obviously key in the NV30 shader performance. I would also agree that this in part is why they created Cg, but it seems increasing clear to me, that the hardware architecture guys where too much influenced by the those within nVidia who primary wanted a strong NV30GL (Quadro FX) for the professional GL-apps.

That market probably loves that the good old register combiner stayed in there and they should like the fact that they still can still work with the NV extensions they have been used to. A more capable FP16/32 shader glue on top is just the way a relatively conservative market segment likes it.

This line of thinking is IMHO why nVidia never considered the R300 approach.

MDolenc said:
No matter how you twist it: ps_2_x and ARB_fragment_program DO NOT expose anything to support integer calculations. Only NV_fragment_program exposes such functionality and only there Cg can help GeForce FX.

Yes, and I see this as partly evidence to my line of thinking.

MDolenc said:
Generally you ARE NOT ALLOWED to run below fp16 in ps_2_x and ARB_fragment_program, meaning that NVIDIA drivers will have to hack around trying to figure out if shader actually needs (full) float precision for that instruction or not.

Good point ... and how the hell do you figure that out? I had the impression that the register combiners only can receive a fragment from the fragment program but can't send it back for futher FP ins. Is this right?

MDolenc · Apr 12, 2003

LeStoffer said:
Good point ... and how the hell do you figure that out? I had the impression that the register combiners only can receive a fragment from the fragment program but can't send it back for futher FP ins. Is this right?

You can freely mix int, fp16 and fp32 instructions in NV_fragment_program. If you look up to Dawn 's skin shader you'll notice two kinds of instructions *X and *H. *X are integer (fiXed point), *H are half float and there are also *R instructions which are full float.

Arun · Apr 12, 2003

Chalnoth: At the link you've given, in a later page, David Kirk actually suggests that you can have 32 pixels in flight...

For a number of generations now, since GeForce, the state information is pipelined, so that when you do a state change it doesn't stall the pipeline. The state change flows down the pipeline behind the last commands, and doesn't overwrite the relevant register until it can't affect anything downstream. The same is true with a shader. Let's say you have 32 pixels in flight finishing a polygon. There's then a state change or program change and then another polygon. That stage or program change follows the data in to the pipe, flows along with it, and at some point you'll have some pixels from one polygon, some state change information, and some pixels running in the new state flowing through the processor pipe.

If you actually had to have dedicated registers for 32 units, it would explain the transistor count for sure.
And if you didn't have all of the 32 FP32 registers in each 32 unit, it would explain the register slowdown.

Another possibility is that the NV30 architecture is fundamentally different from what we think ( and that's what I'd bet on )
32 units, which can do either:
1FP in 8 cycles
1TEX in 4 cycles ( or 8 cycles, if dependent )
1FX ADD in 4 cycles
1FX MUL in 2 cycles

( note that those numbers are speculative, and assume 32 functional units. If some units were disabled for thermal issues or because they were buggy, those numbers might have to be divided by 2 )

You'd then have a register pool, which would be bigger than the 32 FP32 limit. And you'd have an instruction pool, which would be shared among all pipelines.

So, this is my guess for the whole NV30 disaster: There was so darn much headroom and parallelism they had to disable some registers, some units, and stuff just to be able to survive the thermal problems, and it requires Flow FX and a ridiculous amount of power... At the same time, GDDR-2 also produces more heat than expected.

All nVidia would have to do with the NV35 is reduce headroom, increase per-clock efficiency by reactivating the units which were disabled for thermal reasons, and finally use DDR-I or cooler GDDR-II.
Sounds good, doesn't it?

Uttar

LeStoffer · Apr 12, 2003

MDolenc said:
You can freely mix int, fp16 and fp32 instructions in NV_fragment_program. If you look up to Dawn 's skin shader you'll notice two kinds of instructions *X and *H. *X are integer (fiXed point), *H are half float and there are also *R instructions which are full float.

Mix, yes, but can you within the same shader do a bit of FP (in the fragment processor), then some int (in the register combiners) and then again some FP (back in the fragment processor) going back and forth freely?

I'm asking because the NV30 OpenGL specs only explain how you should output from the fragment processor to the register combiners (using registers TEX0 - TEX3) but not the other way back. I don't know whether this is an issue at all, but it would imply a limitation to me if you can't 'go back' to FP ins after using a int ins.

MDolenc · Apr 12, 2003

You don't even have to jump out of NV_fragment_program at all to use ints.

Edit: But of course you can't jump from NV_fragment_program to register combiners and back.

Edit 2:

Code:

!!FP1.0
MOVH H0, f[TEX3];
DP4H H0.x, f[TEX2], H0;
MULH H0, f[TEX2], H0.x;
MOVX H1, f[COL1];
DP4X H1.x, f[COL0], H1;
MULH H1, f[TEX2], H1.x;
MOVR R0, f[TEX1];
DP4R R0.x, f[TEX0], R0;
MADR R0, f[TEX2], R0.x, H0;
ADDR o[COLR], R0, H1;

This is a completly valid NV_fragment_program shader that uses int (*X), half float (*H) and full float (*R) instructions. No register combiners.

demalion · Apr 12, 2003

Chalnoth said:
demalion said:

Are you including that one has 8 pipes and another has 4 for that "50%" higher per clock figure?

Click to expand...

Well, isn't the Radeon 9700 capable of FP shader op per clock per pixel pipeline? That's 8 fp ops per clock.

Hmm...I got confused by your implying shader functionality parity (aside from integer processing), and didn't realize you were excluding texture ops and complex operations.

The GeForce FX is, according to the benches in this thread, capable of 4 fp ops and 8 int ops per clock. That's 50% more pixel shading power, if enough ops use integer precision.

That's ps 1.3 functionality with no dependent texture reads or complex instructions occuring, but you're right that's a 50% per clock advantage in that circumstance. For what length of a shader is that a realistic set of limitations? Where is the data for these operations coming from?

Of course, if lots of scalar ops are issued, the R300 can do even better...but I'm just trying to say that the FX shader architecture is not fundamentally inferior to the R300 shader architecture (Of course, this depends on how many calculations can run at integer precision...but current information points to the FX as generally being better if as many ops as possible use integer precision).

When running with less functionality than R300 shaders, using integer precision, using multiply operations, not using any texture ops, and using 4 component vectors alone. Did I forget instructions that allow the nv30 to exhibit similar execution per clock characteristics?

I would characterize the nv30 design as being either fundamentally inferior in features offered or fundamentally inferior in speed with regards to shaders. I concur with your 50% clock speed advantage special case, but don't see how it refutes this description.

EDIT: error.

Arun · Apr 12, 2003

The NV30 is superior to the R300 if the following is true:
1. Both INT & FP are used in the same program
2. Few registers are used
3. There's little scalar

It's not THAT hard, now, is it?
2 is Cg's job.
3 is true in many cases.

1 is the big problem, because DX9 doesn't support INT. Not even through extensions. I'm sure nVidia would gladly pay them $25M "under the table", or maybe even more, to get full integer support in DX9.1 extensions and the right to use FP16 registers for most operations...

Uttar

Luminescent · Apr 12, 2003

But why is Nvidia so stuck on int and not fp!!

NV30 fragment processor test results

Arun

Unknown.

thepkrl

KimB

Luminescent

KimB

demalion

antlers

Saem

KimB

demalion

KimB

MDolenc

LeStoffer

MDolenc

Arun

Unknown.

LeStoffer

MDolenc

demalion

Arun

Unknown.

Luminescent