Since many of us are still trying to explain/justify the performance of the NV35 with words or theories, I have set out to achieve this numberically. Maybe some of you will get a better idea of the performances NV35 and R350 have relative to each other in such a way (from my understanding
) .
To do this, I will compare the "effective" number of fragment shader fp flops each architecture can achieve, in an attempt to generate life-like, real-world performance expectations. Consequently, we need to take shader code into account; only then will we have "effective" results.
With following statement in mind, we may proceed:
"...the more non-Vec4 operations are used the more efficiencies will be gained – 3Dlabs have suggested that
up to 30% of instructions even in a standard OpenGL Transformation pipeline may not be Vec4 instructions.",
the calculations procede as follows:
This is how it pans out:
To determine number of "effective" flops, I sum the maximum vec4 flop rate, multiplied by its weight (70%), with the maximum scalar rate, multiplied by its weight (30%), and divide by the total weight.
Scalar=.3 or 30%; Vector=.7 or 70% of average fragment instruction, so the total number of scalar flops availabe in each processor is multiplied by .3 and the total number of vector flops available by .7.
Since R350 has 8 fp fragment shader pipelines (each capable of simultaneously processing a vec4 and scalar op), we obtain a maximum flop capability of 380 (clockspeed in MHz)* 8 (number of fp fragment shader units)*8(number of ops with vec4 instructions i.e. mad), which yields 24.320 gflops. On scalar ops, the number becomes 380*8*1, which is 3.040 gflops.
Nv35 contains 12 fp fragment shader pipelines, each capable of either processing a vec4 or a scalar op. At a clockspeed of 450, the maximum flop capability of NV35 with vec4 ops is 450*12*8 (on mad instructions) or 43.200 gflops. With a scalar op, the NV35 is capable of 450*12*2 or 12.800 gflops. This holds true for bot fp32 and fp16 precision.
Because NV35 can execute either scalars or vectors, and not the two concurrently, the possible "effective" flop number is derived from a straight average of the weighted scalar and vector flop performances .
NV35: (.7*43.200+.3*10.800)/2=(30.240+3.240)/2=16.740 gflops
R350: (.7*24.320+.3*3.040)/1=(17.024+9.12)/1=17.936 gflops
With 6 registers in use for 3 fp units per pipeline (an average of 2 registers per fp fragment shader, assuming the performance penalties of NV30 and fp32 precision) NV35's number of effective flops becomes:
16.740 (maximum available gflops) /1.52 (clock cycles per instruction with 6 registers enabled)=11.01 gflops
Note: 2/3's comes from
thepkrl's NV30 pipeline results thread and their performance analysis
here.
Since R350 suffers no performance degredation when less than 32 registers are in use, the number of effective flops available on the NV35 in comparison to R350 is:
11.01 vs. 17.936 (gflops).
Thus the gflop ratio of NV35 relative to R350 is: ~0.614
In the real world this could translate into a 38% performance difference between NV35 and R350 when running fp fragment shaders at full precision. The NV35's performance, however, is only available when no texturing is required, otherwise, it loses 4 fp fragment shader units and the performance difference would, probably, increase another 20-30 percent (translates into a 60-70% difference).
References:
-"...the more non-Vec4 operations are used the more efficiencies will be gained – 3Dlabs have suggested that up to 30% of instructions even in a standard OpenGL Transformation pipeline may not be Vec4 instructions."
http://www.beyond3d.com/articles/p10tech/index.php?page=page2.inc
-Thepkrl's research
http://www.beyond3d.com/forum/viewtopic.php?p=100394#100394
-NV35 fragment pipeline fp fragment shader details
http://www.beyond3d.com/forum/viewtopic.php?p=121958#121958
http://www.xbitlabs.com/articles/video/display/geforcefx-5900ultra.html
Take what you will from this tedious explanation, but it seems the R350's fragment architecture holds some definite advantages over the NV35's, albeit precision (fp24 vs. fp32); instruction count of R350 relative to NV35 is also higher. By no means, though, is the NV35 at a great disadvantage. Its performance is definitely admirable (we would have never thought these many flops at this precision was possible in the past) and is even somewhat comparable to R350 @ fp24, but it seems to really shine and, possibly, outperform the R350 with fp16 precision. The CineFX fragment shader architecture also offers a little bit more flexibility and a couple of extra instructions.
All in all, it is just good competition amidst some nasty corruption (which needs to be abolished).