FM did nVidia a favor by not going CPU VS in 2 & 3 > I see it now.
If FM went CPU VS in 2 & 3: the PS1.4 of the 8500's would have made the GF4 Ti's look REAL bad w/their PS1.1, correct? Then the allegations of 'optimization' would fly & on face value would have creedance.
Actually...no.
But it's easy to come to that conclusion if you don't understand the difference between PS 1.4 and PS 1.1.
What is the difference between them? It's not that PS 1.4 is "faster" or "more powerful" in the sense of getting more done per instruction and therefore requiring fewer cycles in the pixel shader to achieve the same effect. Rather it's that PS 1.4 programs can be significantly longer, so that a certain effect which takes 2 (or sometimes 3) PS 1.1 programs can all be done in one PS 1.4 program.
Now, this doesn't reduce the workload for the pixel shader: the 1 PS 1.4 program that does the work of 2 PS 1.1 programs also takes (roughly) twice as long as each PS 1.1 program. Instead, it reduces the workload on the geometry engine (including vertex shaders), and reduces bandwidth utilization.
To see why, take a look at what happens when you perform the effect using 2 PS 1.1 programs instead, let's call then program A and program B. When it comes time for the GPU to render a poly to which the effect is applied, it will fetch the vertex coordinates, transform it, run any vertex shader programs to adjust those vertices, light it, and then, for each pixel in the interior of the polygon, run program A and write the result to the framebuffer. Then it will go on rendering all the other polys in the scene until it is done. Then it will start a second pass, and for any polys that are not finished rendering--like this one, which still needs program B to be applied--it will have to repeat the process again: fetch the vertices, run vertex shaders, and render again, this time running program B on the results from program A, and finally writing these final values out to the framebuffer.
With PS 1.4, you can do the effect with a single program. So you save the task of reading in the geometry again, running any vertex programs (including T&L) again, reading the temp values from the framebuffer and writing the new values to the framebuffer. Same amount of work in the pixel shaders, much less work in the rest of the GPU.
So in some sense, Nvidia is right to complain about vertex shaders in that GF4's inability to use the one-pass PS 1.4 rendering path does indeed increase its vertex shading workload. Of course, they're completely wrong and very disingenuous to imply that Futuremark could have used a different PS level instead: it's impossible to implement bump-mapped per-pixel specular and diffuse lighting in a single pass in PS 1.1, 1.2 or 1.3. About the best you can do is what Carmack has done in the Doom3 engine: 1 pass in PS 1.4, 2 passes in PS 1.1, and 5 passes for a DX7 GPU that only has fixed-function pixel pipelines. The rendering style used in GT2 and GT3 is a bit more complex: it takes 1 pass in PS 1.4, 3 in PS 1.1, and cannot be done at all on a DX7 card (presumably; or perhaps FM just didn't bother coding a fallback because the performance would be so absurdly bad).
So is Nvidia right when they assert "This approach creates such a serious bottleneck in the vertex portion of the graphics pipeline that the remainder of the graphics engine (texturing, pixel programs, raster operations, etc.) never gets an opportunity to stretch its legs"? No, absolutely not.
As Futuremark suggests, this fact is easily seen by looking at the scaling factors on the various cards as the resolution is increased. Luckily the Tech Report review has
the data we need. Now, if the only bottleneck on rendering this scene was pixel throughput (in this case, the pixel shaders), then you would expect the scores to scale linearly with the number of pixels onscreen; i.e., you would expect the score @1280*1024 to be exactly .6x the score @1024*768, and the score @1600*1200 to be exactly .4096x the score @1024*768. If, on the other hand, the only bottleneck, even at 1600*1200, was the vertex shader workload, then lowering the resolution wouldn't change the fps one bit. Similarly, if the vertex shaders are the bottleneck at 1024*768, increasing the resolution won't lower the fps until pixel shading becomes enough of a burden to shift the bottleneck away from vertex shading, and indeed the drop in performance at higher resolutions will be very slight.
So we can use those results at from TR to tell us by how far these GPUs deviate from perfect "100% pixel throughput bottleneck" on GT2. Each percentage represents the amount by which the actual score is faster than the theoretical score assuming linear scaling with resolution, using the 1024*768 scores as the base:
9700 Pro:
1280*1024 - 13.46%
1600*1200 - 23.31%
9500 Pro:
1280*1024 - 9.55%
1600*1200 - 14.63%
GF4 Ti4600:
1280*1024 - 13.76%
1600*1200 - 22.07%
The GF4 is just as pixel-limited as the 9700 Pro! It is slightly less so than the 9500 Pro, but nothing to complain about. And let me remind you that the entire effect we're measuring is going to be much more significant than the portion of it due only to the extra vertex skinning. Conversely, if you look at the graphs in any reviews here at B3D--in which the x-axis represents pixel throughput rather than fps, in order to facilitate exactly this sort of comparison--you'll notice that GT2 is much more pixel-limited than many 3d games on the market (which should be expected because a game is more likely to be CPU-limited than a synthetic 3d benchmark, 3DMark01 notwithstanding).
Another demonstration of the same result comes from
this comparison of a 9700 Pro with and without PS 1.4 support disabled in the drivers. The GT2 result increases by 22.5% when PS 1.4 is enabled; again, only a portion of that is due to any drop in vertex shader workload (probably more is due to the drop in required bandwidth), and only a portion of that portion is due to vertex skinning (as opposed to T&L).