As much as I know how difficult it is to alter Chalnoths "belief" systems, ERPS's explanation is the same that I have had previously from NVIDIA – when they moved to a vertex shader platform the transformation end had always been done with vertex programs, whilst they left some specific hardware for lighting. That was present from NV2x since GF3 only had one vertex shader its performance would have been significantly lower than hardwired T&L from NV1x (especially the GTS Ultra) – performance scaling indicated that this was still present in NV3x. I have no specific information on NV40, but the die constraints seem to indicate that that they would want to remove this at this point and the NV4x platform appears to be the cleanest sweep of legacy generation than I’ve seen from NVIDIA in a while – the overall VS performance may now completely negate the performance differential as it is anyway.
IIRC ATI never went along these lines because their first programmable part, 8500, already had two VS’s and hence fairly good vertex program T&L performance in the first place, although I seem to remember that both VS appeared to only work in parallel in a few applications.