Vertex Shader performance

RoOoBo

Regular
From Toms

In contrast, the GeForceFX uses a highly programmable floating-point array, which allows for a triangle transformation rate of over 350 Mverts/s. For comparison, the GeForce4 Ti can offers 136 Mverts/s while the Radeon 9700 PRO achieves about 325.

Normalized for clock speed, this gives us the following picture:

  • NVIDIA GeForce4 Ti4600 (300 MHz): 0,453 Mverts / clock
    NVIDIA GeForceFX (500MHz): 0,7 Mverts / clock
    ATI Radeon 9700 PRO (325 MHz): 1 Mverts / clock

Where does those numbers come from? Just PR? How are they calculated?

What is used for generating a single vertex? 1 VS shader instruction (exit may be)? 4 VS instructions (just a matrix transformation)? 0 VS shader instructions?

Because without this info I don't really know what they are telling me.

I guess that the 325 millions from a R300 could be the 4 VS instructions from the vector-matrix transformation using its 4 vertex shader (so it would be 4 vertex/4 clocks = 1 vertex/clock).

But I don't know how to explain GeForce4 Ti numbers.
 
Unfortunately without a lot of detail there is no way to tell.

It could be a setup limit, or a limit based on the number of FMAD units, or any number of things.

And as a number it's not a useful comparison anyway, just because using a minimal shader card A out performs card B it doesn't mean that with a 100 instruction shader that the same is true.
 
I think Tom's Hardware screwed up by listing "Mverts/clock" instead of "verts/clock". We know the R300 can do 1 vertex/clock, or 325 Mverts at 325 MHz. Nvidia claimed 136 Mvert/sec for the Ti4600 @ 300MHz, which works out to .453 vertices per clock. The GeForceFX can do 350 Mvert/sec @ 500MHz, which works out to 0.7 vertices/clock.
 
megadrive0088 said:
how many Vertex Shader pipelines does NV30 have?

Unknown or not aplicable.

It seems to use the same approach that 3DLabs P10 an array (or pool) of FP single precision units (I supose that they will be FMACs). Or that is what it is stated in Toms and other places. But there is no numbers of how many of those units are either.

Although I can see that there can be some benefit using this approach when executing scalar instructions (RCP and others), even more perhaps if they are able to detect instructions with masked components the fetch/decode logic seems for more complex. And that just for a (small?) performance improvement or to lower the transistor counts (less units could be required for the same performance). I think that ATI approach with a scalar unit and a SIMD unit is far better (without any other knowledge about NV30 real architecture).
 
Back
Top