Mintmaster
Veteran
Interesting. Can you explain this a bit more to me? I know you can do approximate normalization with a Taylor series (for GF2 etc.) and I tried doing a Newton Raphson expansion, but I just can't find the savings. Are you just talking about the RSQ part that gets faster? Then I understand, but you still need to sum the squares and scale, and I don't see the big savings over DP3 and MUL. Furthermore, you need a separate RSQ unit anyway, and you said yourself that it won't be needed much beyond normalization.DemoCoder said:As for normalization, it can be computed more efficiently compared to DP/RSQ/MUL via Newton-Raphson.
Anyway, like I said before, I see why NVidia did FP16 normalization, but beyond that it seems pointless.
I see what you're saying regarding Xenon's VMX unit, but I'm positive that the DP hardware shares the per-component multiplication hardware, and just puts a few adders at the output (which can be enabled or disabled depending on the instruction) to sum the components. This is why I think it would be pointless to separate DP3 and MUL in pixel shaders. NVidia says MAD is the most commonly used instructions, so ADD also fits well in that grouping.
The only thing that seems to be worth optimizing in this way are the "complex" scalar functions, and they may be small anyway.