Luminescent
Veteran
Running at full precision, do any of you think the NV30 will ever perform at the level of the R300 (ARB2) in the future? Do you think its fp32 performance is stuck the way it is for good?
Joe DeFuria said:And just a few more comments:
I don't understand this statemet of Carmack's, my emphasis added:
The reason for this is that ATI does everything at high precision all the time, while Nvidia internally supports three different precisions with different performances. To make it even more complicated, the exact precision that ATI uses is in between the floating point precisions offered by Nvidia, so when Nvidia runs fragment programs, they are at a higher precision than ATI's, which is some justification for the slower speed.
That seems contradictory to me. Why / how is it that when nVidia runs fragment programs, "they are at a higher precision than ATIs"...when at the same time nVidia offers both a lower and a higher precision mode
Doesn't make sense to me.
The current NV30 cards do have some other disadvantages: They take up two slots, and when the cooling fan fires up they are VERY LOUD. I'm not usually one to care about fan noise, but the NV30 does annoy me.
Given the "environment of terror" that Doom-III is supposed to have, I think the noise of the NV30 is a significant drawback for the consumer...
mboeller said:The R200 path has a slight speed advantage over the ARB2 path on the R300, but only by a small margin, so it defaults to using the ARB2 path for the quality improvements. The NV30 runs the ARB2 path MUCH slower than the NV30 path. Half the speed at the moment. This is unfortunate, because when you do an exact, apples-to-apples comparison using exactly the same API, the R300 looks twice as fast, but when you use the vendor-specific paths, the NV30 wins.
vender-specific means for me fast fixed point. He does not say that the NV30 uses FP here.
The reason for this is that ATI does everything at high precision all the time, while Nvidia internally supports three different precisions with different performances. To make it even more complicated, the exact precision that ATI uses is in between the floating point precisions offered by Nvidia, so when Nvidia runs fragment programs, they are at a higher precision than ATI's, which is some justification for the slower speed. Nvidia assures me that there is a lot of room for improving the fragment program performance with improved driver compiler technology.
he does not mention that the fast path is the FP16 path but one of the three NV30 paths. So it could very well be the fixed point path too. With regards to the different FP-modes of the NV30 he does not mention how fast they are compared to the R300.
The reason is that, using the ARB path, NV30's fragment processor stays at 32 bit per component mode while R300's processor stays at 24 bit per component mode. It is only when one uses NV30 specific fragment path that some calculations are shifted to 16 bit per component mode, making NV30 faster overall.
Joe DeFuria said:The reason is that, using the ARB path, NV30's fragment processor stays at 32 bit per component mode while R300's processor stays at 24 bit per component mode. It is only when one uses NV30 specific fragment path that some calculations are shifted to 16 bit per component mode, making NV30 faster overall.
Is that somehow a limitation of how the ARB extension interacts with nVidia hardware, or something that can be changed in future nVidia drivers? In short, is nVidia's "ARB" fragment path always going to be limited to fp32?
That seems like a pretty significant limitation, considering FX's performance in fp32. That would virtually guarantee that any OpenGL app wanting good performance with floating point support on NV30 is going to have to code the nVidia specific extensions.
Joe DeFuria said:And just a few more comments:
I don't understand this statemet of Carmack's, my emphasis added:
The reason for this is that ATI does everything at high precision all the time, while Nvidia internally supports three different precisions with different performances. To make it even more complicated, the exact precision that ATI uses is in between the floating point precisions offered by Nvidia, so when Nvidia runs fragment programs, they are at a higher precision than ATI's, which is some justification for the slower speed.
That seems contradictory to me. Why / how is it that when nVidia runs fragment programs, "they are at a higher precision than ATIs"...when at the same time nVidia offers both a lower and a higher precision mode
Doesn't make sense to me.
.
Read page 4 of the thread again. It should be a temporary problem. In fact, the overall performance of the ARB path should improve substantially with more optimized drivers.
Sounds like the NV30-specific path is more register-combiner oriented than fragment program.
Does the Register Combiner path allow for FP?
I think you're jumping the gun a little bit. Let the drivers mature somewhat and then re-evaluate.antlers4 said:This question has been answered: B. In FP16 mode, the NV30 dispatches somewhat fewer shader instructions per cycle than the R300 does; in FP32 it's speed is halved.
demalion said:The truth, AFAICS, is that Chalnoth has indeed consistently recognized 24-bit per component as sufficient for fragment processing, and has challenged the necessity for 32-bit per component. I don't recall a change in this when the nv30's 128-bit support was announced, but I do remember, and verified, this from when the R300's 96-bit was established. Confident that it is known that I'm not afraid to criticize Chalnoth, I'll take this opportunity for laziness in posting a link and ask you take my word for it, or search for "component" with his name for yourself.
What might be confusing this is two things:
1) He initially phrased his mentioning of 24-bit per component capability on the R300 as being a tradeoff required by the R300's 0.15 micron process, amongst a long tirade of other criticisms of the R300 (the power connector, and his statements of "disappointment' based on the phrasing "without limits" as mentioned by ATI).
2) He has tended to advocate 32-bit FP values being used for vertex processing.
Joe DeFuria said:On a related note, I'm very impressed with the R-300's ability to essentially maintain performance in floating point mode, relative to register combiner mode. (That means that the FP mode is pretty well optimized...or of course it could mean the 'register combiner mode' is very unoptimized. )
antlers4 said:An important fact that has come out about NV30 FP shader performance: the question that's been debated for months is: Whether A) FP16 is twice as fast as FP32, or B) FP32 is half the speed of FP16? Of course, this is with respect to the R300, with the assumption that one of those modes (FP16 or FP32) would perform comparably, per cycle, with the R300. This question has been answered: B. In FP16 mode, the NV30 dispatches somewhat fewer shader instructions per cycle than the R300 does; in FP32 it's speed is halved.
For developers doing forward looking work, there is a different tradeoff --
the NV30 runs fragment programs much slower, but it has a huge maximum
instruction count. I have bumped into program limits on the R300 already.
I'm really beginning to see why the R300 was such a successful product. They took the bold step of eliminating the fixed-functionality on all their previous chip generations and running all their operations with FP24. This clean, forward-looking design allowed them to achieve excellent performance on a mature process--their whole transistor budget was devoted to getting their FP24 path running fast enough to support everything. It's remarkable that their drivers were as good as they were at launch, considering how big a break this was with previous designs.
I think the reason for CG might be a little more clearer.
For maximum performance out of the NV30, the developers must use the Nvidia extensions, requiring more work and taking up time. Most developers will not normally spend the time to do this.
The solution of course is to provide developers a way to seamlessly code for both the NV30 extensions and ARB_fragment_shader. Cue CG
Joe DeFuria said:I agree 100%. Additionally though, I believe this is not really a new strategy for ATI, so they have experience doing this sort fo thing. IIRC, ATI's vertex shader on the R-200, for example, does all the fixed T&L pipeline work via "emulation." In contrast, I believe the NV2X retains the "fixed function block" of the NV1X.
According to Carmack, nVidia has stated that performance enhancements are coming....so why should developers worry about NV30 extensions at all?