Chalnoth said:
So, I think that Valve spending 5x the development time was a mistake. I think they were attempting to tweak the assembly without full knowledge of the hardware. I think it took a lot of time, and wasn't terribly productive.
You don't think Nvidia DevRel gave Valve all the resources they possibly could to make the most anticipated bleeding-edge game of the year run decently on their products?
Of course Valve had full knowledge of the underlying hardware. The problem is likely that when using HLSL that doesn't buy you enough when you're trying to target NV3x.
Let me just say that the NV3x generation is not a generation that should be programmed to using assembly.
At the current time I think we can say exactly the opposite: your best hope of succesfully targeting NV3x in DX9 is to use PS 2.0 assembly rather than HLSL, even though you still can't hope for better than mediocre performance. The most pressing cause of poor NV3x fragment shader performance is clearly the extremely tight restrictions on register usage. Clever programming in PS 2.0 assembly can address this issue to the fullest extent possible, although it would seem very unlikely that most shaders of any complexity can be reasonably programmed without overstepping the bounds of NV3x's 4 FP16 or 2 FP32 full-speed registers. Meanwhile, any architecture-neutral HLSL compiler is going to use far more temp registers, which is necessary to enable most of the optimizations that compilers typically perform. An HLSL compiler targeted at generating optimal PS 2.0 code for NV3x could concievably do much better, but probably not as good as a determined human assembly programmer.
The second cause of NV3x's poor PS performance--at least in the case of NV30, NV31 and NV34--is of course the inability to make any use whatsoever of the FX12 units in a PS 2.0 shader. But this is a fundamental limitation of the API itself, and neither HLSL nor PS 2.0 assembly can circumvent it. The only "solution" here (other than to rewrite the game in OpenGL) is for Nvidia's drivers to cheat and generate FX12 machine instructions anyway, either by special-casing shaders from known high-profile games, or by some sort of general optimization (ack!).
Third, there are likely some scheduling tricks (other than those necessary to keep register count as low as possible) that could help out NV3x performance a bit. Of course these are just as accessible to a human assembly programmer who knows the relavent performance characteristics of the hardware (and, again, it is inconcievable that Valve would not) as to a good optimizing compiler; and, again, the only chance that that compiler will generate such code is if it is specifically targeted at NV3x to the exclusion of other architectures.
So I'm not sure what you mean by suggesting that programming in HLSL (or Cg) can buy NV3x performance it can't otherwise achieve when programmed in PS 2.0 assembly. Unless you're suggesting shaders be shipped in uncompiled Cg form and runtime compiled without going through the intermediate step of passing through PS 2.0. (I'm fairly certain that HLSL does not have this ability; it's always compiled to PS 2.0 assembly at compile time.) In that case, Nvidia drivers could presumably circumvent the DX9 spec by issuing FX12 instructions as discussed above, although now with the extra context available to it from having the full Cg code instead of the intermediately-compiled assembly version. Indeed, this ability to potentially circumvent the DX9 specs is presumably the reason Nvidia pushed Cg so hard to the exclusion of MS's HLSL. Fortunately enough, that push has failed in the marketplace, and it's clear that the vast majority of DX9 games will have their shaders written in HLSL, not Cg or PS 2.0 assembly. Nor, given the performance of Cg and HLSL in PCChen's (IIRC) synthetic tests or in TR:AOD, does it seem Nvidia even got around to taking much advantage of the nefarious possibilities inherent in Cg; on average it does no better, and sometimes worse, than HLSL.
There is apparently much room for improvement in Nvidia's NV3x-targeted HLSL compiler; and there may be significant optimizations still waiting to be had by their runtime assembly compiler as well (although at that point it will often be too late to save decent performance on NV3x). It just seems unlikely any of this will go far at all towards making up the huge gulf between NV3x and R3x0 in PS 2.0 performance.