After hearing humus's tweak, I was thinking there might be some way to improve NV40's performance also. I checked the shader file in doom3, found that there're many dp3/rsq/mul sequences to do normalization. Then I used "NV_fragment_program2" option to rewrite some of them and replace all such sequences by a nrm instruction. To my surprise, there's ZERO improvement in demo1 and a customized demo I recorded myself. I have to ask is it due to the free fp16 normalization isn't enabled, or the current driver is already awared of the dp3/rsq/mul sequence and did the replacement under the table.
edit: I was using H(half) suffix in the instructions, and decelared all temp registers as short, so it's fp16 for sure.
edit: I was using H(half) suffix in the instructions, and decelared all temp registers as short, so it's fp16 for sure.