cthellis42 said:
Oh come now, Walt... They're DX9
compliant all right--it's just that being compliant makes them run at, like, half-speed.
Now, why did I know this question would come up?...
OK, here's how I see it...
The API calls for FP24. Not fp16, and not fp32. Even if you argue that fp32 exceeds the API specification, you still must concede that fp32 is not fp24, and is therefore not compliant with the API.
OK, so why fp24 in the API and not fp16 or fp32? Well, the people formulating the API decided fp16 wasn't precision enough for DX9, and that fp32, while certainly precision enough, was unlikely to perform well due to current technical know how and manufacturing processes--and so fp24 was decided upon for DX9 compliance.
ATi goes out and builds an fp 24 part, and nVidia builds a hybrid fp16/32 part, with zero fp24 hardware support. So why did it take this route? Here's what I think...
nVidia felt that going to fp16 would be all the precision they needed with what they projected would be essentially DX8.1 games for the life of the product, but they also decided to sort of double-up in the architecture and throw fp32 in there as well, albeit without enough register/transistor support in the chip to really make fp32 competitive in the 3d gaming market (and it wanted to keep transistor count down as yields were yet to be determined and we all know what those have been this year.) They wanted to be able to talk about fp32 support in the chip as exceeding the API specification, but planned all along to run at fp16--if not FX12. The intention all along was to use fx12/fp16 while its competitors would be using fp24, and nVidia assumed this would provide it with a performance advantage in and of itself. I think this is further butressed by the fact that originally the nv3x drivers didn't do fp32 (since rectified to get WHQL compliance.)
However, ATi comes out with an 8x1 architecture that does only fp 24, and runs it faster than nVidia's 4x2 fp16--so nVidia lost that gambit straight up. But the important thing to me is that the consensus that fp24 should be the API turned out to be correct from the standpoint of performance, and so nVidia screwed itself more or less in trying to one-up everybody else with fx12/fp16, because they made architectural assumptions about their competitors that turned out to be faulty.
So, yes--fp32--as slow as Christmas as it is--exceeds the API specification, but nV3x is not technically in compliance with the API because the chip can't do fp24. So when running full precision DX9 nV3x has to do fp32 and runs at 1/3 to 1/2 the speed of R3x0, which is running at fp24 all the time. So...the consensus behind fp24 for the API was correct--even for nVidia--in terms of performance-to-rendering-precision benefits, and nVidia would have been wise to heed it. That said, even with partial precision nV3x isn't competitive with R3x0, and so there are things other than fp precision in the architecture which have caused this outcome.
The problem for developers such as Valve is that everybody is expecting a certain degree of DX9 performance from nVidia's hardware because that's what nVidia's been telling everybody to expect all year. Well, the chip is not technically API compliant for DX9 (there may be other things than simply fp24 that it lacks--not clear on that) because it doesn't do fp24, but *must do* fp32, which exceeds the API spec but is overkill for what the specification demands--which not surprisingly kills performance. This puts the developer in the unfortunate situation of having to explain this complex reality to its customers, who have been led to believe something by nVidia about their hardware which is not technically--or accurately--true. So in on order for Valve to explain why its DX9 software runs better on ATi, it's necessary to do what Gabe did.
I agree with Gabe--I would set the game up to natively select DX8.1 for the 5900U, and allow the user to select full DX9 support electively. Even though I spent a lot of time on the mixed codepath--I'd drop it and go with 8.1. default support in this manner, as there doesn't seem to be much if any performance difference between the nV3x path and the DX8.1 path.
It's complex, but I really think the partial-precision being the only workable performance option means that one of the reasons the chip does not perform so well under DX9 is because it does fp32 instead of fp24. That's not the only reason, certainly, but it does contribute.