Carmack's comments on NV30 vs R300, DOOM developments

Running at full precision, do any of you think the NV30 will ever perform at the level of the R300 (ARB2) in the future? Do you think its fp32 performance is stuck the way it is for good?
 
Joe DeFuria said:
And just a few more comments:

I don't understand this statemet of Carmack's, my emphasis added:

The reason for this is that ATI does everything at high precision all the time, while Nvidia internally supports three different precisions with different performances. To make it even more complicated, the exact precision that ATI uses is in between the floating point precisions offered by Nvidia, so when Nvidia runs fragment programs, they are at a higher precision than ATI's, which is some justification for the slower speed.

That seems contradictory to me. Why / how is it that when nVidia runs fragment programs, "they are at a higher precision than ATIs"...when at the same time nVidia offers both a lower and a higher precision mode :?:

Doesn't make sense to me.

The current NV30 cards do have some other disadvantages: They take up two slots, and when the cooling fan fires up they are VERY LOUD. I'm not usually one to care about fan noise, but the NV30 does annoy me.

Given the "environment of terror" that Doom-III is supposed to have, I think the noise of the NV30 is a significant drawback for the consumer...

The reason is that, using the ARB path, NV30's fragment processor stays at 32 bit per component mode while R300's processor stays at 24 bit per component mode. It is only when one uses NV30 specific fragment path that some calculations are shifted to 16 bit per component mode, making NV30 faster overall.

Not much to say about the noise except that hopefully that will change in a two or three months.
 
mboeller said:
The R200 path has a slight speed advantage over the ARB2 path on the R300, but only by a small margin, so it defaults to using the ARB2 path for the quality improvements. The NV30 runs the ARB2 path MUCH slower than the NV30 path. Half the speed at the moment. This is unfortunate, because when you do an exact, apples-to-apples comparison using exactly the same API, the R300 looks twice as fast, but when you use the vendor-specific paths, the NV30 wins.

vender-specific means for me fast fixed point. He does not say that the NV30 uses FP here.

The reason for this is that ATI does everything at high precision all the time, while Nvidia internally supports three different precisions with different performances. To make it even more complicated, the exact precision that ATI uses is in between the floating point precisions offered by Nvidia, so when Nvidia runs fragment programs, they are at a higher precision than ATI's, which is some justification for the slower speed. Nvidia assures me that there is a lot of room for improving the fragment program performance with improved driver compiler technology.

he does not mention that the fast path is the FP16 path but one of the three NV30 paths. So it could very well be the fixed point path too. With regards to the different FP-modes of the NV30 he does not mention how fast they are compared to the R300.

If it was one of the three, he would have said so. The NV30 path probably uses a mix of all three modes.
 
The reason is that, using the ARB path, NV30's fragment processor stays at 32 bit per component mode while R300's processor stays at 24 bit per component mode. It is only when one uses NV30 specific fragment path that some calculations are shifted to 16 bit per component mode, making NV30 faster overall.

Is that somehow a limitation of how the ARB extension interacts with nVidia hardware, or something that can be changed in future nVidia drivers? In short, is nVidia's "ARB" fragment path always going to be limited to fp32?

That seems like a pretty significant limitation, considering FX's performance in fp32. That would virtually guarantee that any OpenGL app wanting good performance with floating point support on NV30 is going to have to code the nVidia specific extensions.
 
Joe DeFuria said:
The reason is that, using the ARB path, NV30's fragment processor stays at 32 bit per component mode while R300's processor stays at 24 bit per component mode. It is only when one uses NV30 specific fragment path that some calculations are shifted to 16 bit per component mode, making NV30 faster overall.

Is that somehow a limitation of how the ARB extension interacts with nVidia hardware, or something that can be changed in future nVidia drivers? In short, is nVidia's "ARB" fragment path always going to be limited to fp32?

That seems like a pretty significant limitation, considering FX's performance in fp32. That would virtually guarantee that any OpenGL app wanting good performance with floating point support on NV30 is going to have to code the nVidia specific extensions.

Read page 4 of the thread again. It should be a temporary problem. In fact, the overall performance of the ARB path should improve substantially with more optimized drivers.
 
Joe DeFuria said:
And just a few more comments:

I don't understand this statemet of Carmack's, my emphasis added:

The reason for this is that ATI does everything at high precision all the time, while Nvidia internally supports three different precisions with different performances. To make it even more complicated, the exact precision that ATI uses is in between the floating point precisions offered by Nvidia, so when Nvidia runs fragment programs, they are at a higher precision than ATI's, which is some justification for the slower speed.

That seems contradictory to me. Why / how is it that when nVidia runs fragment programs, "they are at a higher precision than ATIs"...when at the same time nVidia offers both a lower and a higher precision mode :?:

Doesn't make sense to me.

.

When Carmack is talking "fragment programs" he means the ARB2 path, which is running FP32 on Nvidia and FP24 (internally) on R300. Sounds like the NV30-specific path is more register-combiner oriented than fragment program.
 
Read page 4 of the thread again. It should be a temporary problem. In fact, the overall performance of the ARB path should improve substantially with more optimized drivers.

Should improve based on what premise, because nVidia said so?

What specifically is the temporary problem: that the NV30 ARB code-path is limited to fp32 for now, but will offer other modes in the future, or that fp32 performance is lacking, and that fp32 performance should increase?
 
It sounds like running the ARB2 path on the NV30 is only an academic exercise. Even if it is forced to FP16, it will still be slower than the NV30-specific path. I don't think the NV30-specific path in Doom 3 will ever go away (unless the card and its follow-ons are such a flop that there is no commercial reason to continue to support it :))

The ARB2 path in Doom 3 is of interest to those with R300s and any not-yet-announced cards which might be able to run it at full precision and acceptable speed. At this point, ARB2 is the custom path for R300 support; the ARB2 standard is close to the optimum for R300s.

An important fact that has come out about NV30 FP shader performance: the question that's been debated for months is: Whether A) FP16 is twice as fast as FP32, or B) FP32 is half the speed of FP16? Of course, this is with respect to the R300, with the assumption that one of those modes (FP16 or FP32) would perform comparably, per cycle, with the R300. This question has been answered: B. In FP16 mode, the NV30 dispatches somewhat fewer shader instructions per cycle than the R300 does; in FP32 its speed is halved.
 
Does the Register Combiner path allow for FP?

That's what I would like to know.

It sounds to me like the NV30 path is basically the "nVidia version" of the R-200 Path. The NV30 path is the only nVidia specific path that allows for rendering in a "single pass".

This should say a lot to everyone about the flexibility of the R-200 pipelines, relative to the GeForce4 / NV20 path. (Nothing about performance, of course.)

People shouldn't forget that Doom3's rendering quality is targetted to be excellent without needing floating point. The big "win" for image quality and Doom3, is being able to render in a single pass. This is why NV30 quality should be a notch better than NV20....not because of the available increased precision, but the fact that fewer passes are taken.

On a related note, I'm very impressed with the R-300's ability to essentially maintain performance in floating point mode, relative to register combiner mode. (That means that the FP mode is pretty well optimized...or of course it could mean the 'register combiner mode' is very unoptimized. ;))

I predicted something a while back, and I'll restate it now: Because R-300/NV-30 pipeline flexibility is much closer to R-200 pipeline flexibility than the NV2x, I believe this will give ATI fewer driver headaches "upgrading" drivers to support the new architecture. I think we may be seeing evidence of that here.
 
antlers4 said:
This question has been answered: B. In FP16 mode, the NV30 dispatches somewhat fewer shader instructions per cycle than the R300 does; in FP32 it's speed is halved.
I think you're jumping the gun a little bit. Let the drivers mature somewhat and then re-evaluate.
 
demalion said:
The truth, AFAICS, is that Chalnoth has indeed consistently recognized 24-bit per component as sufficient for fragment processing, and has challenged the necessity for 32-bit per component. I don't recall a change in this when the nv30's 128-bit support was announced, but I do remember, and verified, this from when the R300's 96-bit was established. Confident that it is known that I'm not afraid to criticize Chalnoth, I'll take this opportunity for laziness in posting a link and ask you take my word for it, or search for "component" with his name for yourself. :LOL:

What might be confusing this is two things:

1) He initially phrased his mentioning of 24-bit per component capability on the R300 as being a tradeoff required by the R300's 0.15 micron process, amongst a long tirade of other criticisms of the R300 (the power connector, and his statements of "disappointment' based on the phrasing "without limits" as mentioned by ATI).

2) He has tended to advocate 32-bit FP values being used for vertex processing.

You are right.
Chalnoth, i misremembered, sorry.
 
Joe DeFuria said:
On a related note, I'm very impressed with the R-300's ability to essentially maintain performance in floating point mode, relative to register combiner mode. (That means that the FP mode is pretty well optimized...or of course it could mean the 'register combiner mode' is very unoptimized. ;))

This is not surprising, since it's been said again and again that all the functionality on the R300 gets converted to pixel shaders and goes through the FP pipeline. There really isn't an opportunity for fixed-function to be that much faster.

I'm really beginning to see why the R300 was such a successful product. They took the bold step of eliminating the fixed-functionality on all their previous chip generations and running all their operations with FP24. This clean, forward-looking design allowed them to achieve excellent performance on a mature process--their whole transistor budget was devoted to getting their FP24 path running fast enough to support everything. It's remarkable that their drivers were as good as they were at launch, considering how big a break this was with previous designs. (note: part of this success was deciding that FP24 was "good enough" for everything in their intended market; they couldn't have kept their transistor budget and speed goals at FP32)

It looks like in the NV30, on the other hand, the FP units are bolted on to a core that is a souped-up NV25. They didn't have the confidence that their FP units could run stuff fast enough to have a big enough performance delta over their existing chips--and guess what, the FP units aren't fast enough. They seem to see the FP units as more for specialized, "cinematic" rendering. This might come in handy for the CGI market, but is kind of useless for consumer gaming. The final result is a big chip that has to be clocked at 500 Mhz to beat/exceed the R300, and so runs hot and power hungry even at .13 micron.

I wonder if in the NV35 (or the NV31 or NV34) they will follow ATIs lead and run everything through the FP path.

[edited to add last parenthetical note in second paragraph]
 
antlers4 said:
An important fact that has come out about NV30 FP shader performance: the question that's been debated for months is: Whether A) FP16 is twice as fast as FP32, or B) FP32 is half the speed of FP16? Of course, this is with respect to the R300, with the assumption that one of those modes (FP16 or FP32) would perform comparably, per cycle, with the R300. This question has been answered: B. In FP16 mode, the NV30 dispatches somewhat fewer shader instructions per cycle than the R300 does; in FP32 it's speed is halved.

I noticed this too, and was very surprised. I thought that maybe I was interpreting this incorrectly, but then JC also said:
For developers doing forward looking work, there is a different tradeoff --
the NV30 runs fragment programs much slower, but it has a huge maximum
instruction count. I have bumped into program limits on the R300 already.

If NV30 runs fragment programs much slower (I really didn't expect this), I really have to question the real-time use of its huge maximum instruction count, especially when you can do things like multipassing, and you can only use 16 textures over all these instructions.
 
I'm really beginning to see why the R300 was such a successful product. They took the bold step of eliminating the fixed-functionality on all their previous chip generations and running all their operations with FP24. This clean, forward-looking design allowed them to achieve excellent performance on a mature process--their whole transistor budget was devoted to getting their FP24 path running fast enough to support everything. It's remarkable that their drivers were as good as they were at launch, considering how big a break this was with previous designs.

I agree 100%. Additionally though, I believe this is not really a new strategy for ATI, so they have experience doing this sort fo thing. IIRC, ATI's vertex shader on the R-200, for example, does all the fixed T&L pipeline work via "emulation." In contrast, I believe the NV2X retains the "fixed function block" of the NV1X.

It's not only impressive that they managed to have as good compatibility that they have, but also impressive that they seem to at least maintain performance of the "emulated" features on the new hardware.
 
I think the reason for CG might be a little more clearer.

Both NV30 and R300 can use ARB_fragment_program. As such, it's 'obvious' for developers to code to that extension. However, the NV30 is unable to compete with the speed of the R300 when using that extension.

For maximum performance out of the NV30, the developers must use the Nvidia extensions, requiring more work and taking up time. Most developers will not normally spend the time to do this.

The solution of course is to provide developers a way to seamlessly code for both the NV30 extensions and ARB_fragment_program. Cue CG

-Colourless
 
Hmm... while informative, I'm shaking my head (like you guys here) with new questions raised pertaining to JC's .plan

Will shoot him an email and hope he replies (dang, hope he remembers "Reverend"!).
 
I think the reason for CG might be a little more clearer.

That all depends on how good the CG compiler is at generaing AB code vs. nVidia extension code.

the other way to look at it, is that this makes the reason for NON nVidia companies to not support Cg all the more clear.

For maximum performance out of the NV30, the developers must use the Nvidia extensions, requiring more work and taking up time. Most developers will not normally spend the time to do this.

I agree.

The solution of course is to provide developers a way to seamlessly code for both the NV30 extensions and ARB_fragment_shader. Cue CG

The best solution is for nVidia to just increase the performance of their ARB_fragment_shader, and forget NV30 extensions entirely. According to Carmack, nVidia has stated that performance enhancements are coming....so why should developers worry about NV30 extensions at all?
 
Joe DeFuria said:
I agree 100%. Additionally though, I believe this is not really a new strategy for ATI, so they have experience doing this sort fo thing. IIRC, ATI's vertex shader on the R-200, for example, does all the fixed T&L pipeline work via "emulation." In contrast, I believe the NV2X retains the "fixed function block" of the NV1X.

Is that right, Joe? I could have sworn it was the other way around, i.e. the R200 does have a FF TCL unit and it was one of the things they cut with RV250.

MuFu.
 
According to Carmack, nVidia has stated that performance enhancements are coming....so why should developers worry about NV30 extensions at all?

Which raises the additional question: why is CARMACK supporting NV30 extensions at all? The only reasonable explanation IMO, is that he is not confident that nVidia ARB2 path performance will be increased to the point where it is competitive. At least, not by the time Doom is released....
 
Back
Top