It's not that fp16 is significantly faster then fp32 (or fp32 being significantly slower)... The problem on NV3x is that it only has 2 fp32 registers (or 4 fp16 registers) free and after that performance goes down. I'm sure you won't see much of a difference if you have code with 2 registers and 32 fp32 instructions and same code but using 32 fp16 instructions. Fx12 is however two times faster then floating point on nv30.
You also won't see much (if any) difference between fp16 and fp32 if you are doing just "color operations". However if you want to do something that's not just like "color operations" you can bump into precision limitations VERY quickly and they will be VERY obvious (for example there is no point of using textures larger then 1024x1024 for dependant reads in fp16). Try the Mandelbrot demo from Humus on a Radeon (with fp24) and on GeForce FX 5900 (and hopefully it will be run in fp32 to illustrate my point ) and zoom in a bit...
And Uttar: T&L pipeline is far from fx12...
You also won't see much (if any) difference between fp16 and fp32 if you are doing just "color operations". However if you want to do something that's not just like "color operations" you can bump into precision limitations VERY quickly and they will be VERY obvious (for example there is no point of using textures larger then 1024x1024 for dependant reads in fp16). Try the Mandelbrot demo from Humus on a Radeon (with fp24) and on GeForce FX 5900 (and hopefully it will be run in fp32 to illustrate my point ) and zoom in a bit...
And Uttar: T&L pipeline is far from fx12...