NVidia is going to really have to beef up their compiler to schedule all this, so I expect initial drivers won't demonstrate full performance unless assembly code is hand written for the NV40.
Plus, to use the free nrm_pp they'll have to go detect all dp4_pp,rsq_pp,mul_pp sequences (possibly not sequential tokens) and change them to nrm_pp due to the fact that FXC never generated NRM until 9.0c (unreleased) and most hand coders never used DX9 "macros" because they were warned not to use them.
They should provide a driver checkbox option ("force low precision normalization") and even detect dp4/rsq/mul and force it to NRM_PP, it's 3 slots into 1!
Plus, to use the free nrm_pp they'll have to go detect all dp4_pp,rsq_pp,mul_pp sequences (possibly not sequential tokens) and change them to nrm_pp due to the fact that FXC never generated NRM until 9.0c (unreleased) and most hand coders never used DX9 "macros" because they were warned not to use them.
They should provide a driver checkbox option ("force low precision normalization") and even detect dp4/rsq/mul and force it to NRM_PP, it's 3 slots into 1!