Carmack's comments on NV30 vs R300, DOOM developments

mboeller said:
Ähhh...SORRY

but were does Carmack say, that the NV30-path uses FP at all?

I read his comments completely different. For me it loks like that the NV30-path is nothing more than the NV30-specific fixed-function path. Only with this mode the NV30 was faster in Shadermark then the R300.

These are the exact words of Carmack:

The R200 path has a slight speed advantage over the ARB2 path on the R300, but
only by a small margin, so it defaults to using the ARB2 path for the quality
improvements. The NV30 runs the ARB2 path MUCH slower than the NV30 path.
Half the speed at the moment. This is unfortunate, because when you do an
exact, apples-to-apples comparison using exactly the same API, the R300 looks
twice as fast, but when you use the vendor-specific paths, the NV30 wins.

The reason for this is that ATI does everything at high precision all the
time, while Nvidia internally supports three different precisions with
different performances. To make it even more complicated, the exact
precision that ATI uses is in between the floating point precisions offered by
Nvidia, so when Nvidia runs fragment programs, they are at a higher precision
than ATI's, which is some justification for the slower speed. Nvidia assures
me that there is a lot of room for improving the fragment program performance
with improved driver compiler technology.
 
mboeller said:
Ähhh...SORRY

but were does Carmack say, that the NV30-path uses FP at all?

I read his comments completely different. For me it loks like that the NV30-path is nothing more than the NV30-specific fixed-function path. Only with this mode the NV30 was faster in Shadermark then the R300.

It would not make much sense to do that.
 
Its quite likely that NV30 path will be dropped from the final release altogether, if both vendors optimize their ARB_fragment_program sufficiently enough. Note that he states even R200 path had slight speed advantage over ARB2 path on R300.
So there should be enough room for optimizations on _both_ latest&greatest cards. Could ARB_precision_hint_fastest be used on NV30 to make ARB_fragment_program execute in 16-bit mode or would it be violating some clause in extension spec ? ( from spec: " .. the maximum representable magnitude of colors must be at least 2^10, while the maximum representable magnitude of other floating-point values must be at least 2^32. " )
I actually find it unfortunate that R200, NV10 and NV20 paths obviously must remain, as there is no way to produce full-featured final image with ARB_ extensions only on target chips.
I wonder what path will the original Radeon default to ?

All in all, regardless of the final implementation will work, i foresee endless debates of which whether NV30 R300 is faster/better, and no apples-to-apples comparsion can be ever made, as one uses either 16-bit or 32-bit fp, other 24-bit. Probably it will remain so that 16 > 24 > 32 speedwise.
 
The R200 path has a slight speed advantage over the ARB2 path on the R300, but only by a small margin, so it defaults to using the ARB2 path for the quality improvements. The NV30 runs the ARB2 path MUCH slower than the NV30 path. Half the speed at the moment. This is unfortunate, because when you do an exact, apples-to-apples comparison using exactly the same API, the R300 looks twice as fast, but when you use the vendor-specific paths, the NV30 wins.

vender-specific means for me fast fixed point. He does not say that the NV30 uses FP here.

The reason for this is that ATI does everything at high precision all the time, while Nvidia internally supports three different precisions with different performances. To make it even more complicated, the exact precision that ATI uses is in between the floating point precisions offered by Nvidia, so when Nvidia runs fragment programs, they are at a higher precision than ATI's, which is some justification for the slower speed. Nvidia assures me that there is a lot of room for improving the fragment program performance with improved driver compiler technology.

he does not mention that the fast path is the FP16 path but one of the three NV30 paths. So it could very well be the fixed point path too. With regards to the different FP-modes of the NV30 he does not mention how fast they are compared to the R300.
 
Interesting, but the more interesting to me is
The newly-ratified ARB_vertex_buffer_object extension will probably let me do
the same thing for NV_vertex_array_range and ATI_vertex_array_object.

Where is this extension ?! It's not in the OpenGL registry yet, neither is the latest ARB talk available :(
 
Another interesting point Carmack mentions, that have been discussed before is the use of floating point framebuffers.

The future is in floating point framebuffers. One of the most noticeable thing this will get you without fundamental algorithm changes is the ability to use a correct display gamma ramp without destroying the dark color precision.

Then he goes on to mention the problem with it today:

Unfortunately, using a floating point framebuffer on the current generation of cards is pretty difficult, because no blending operations are supported, and the primary thing we need to do is add light contributions together in the framebuffer. The workaround is to copy the part of the framebuffer you are going to reference to a texture, and have your fragment program explicitly add that texture, instead of having the separate blend unit do it. This is intrusive enough that I probably won't hack up the current codebase, instead playing around on a forked version.

Anyway, does he presume unlimited amounts of memory bandwidth? :!:
 
Another thing that leaves me a little in the dark is Carmack's mention of NV30 running fragment programs much slower in ARB_fragment_program.

The NV30 runs the ARB2 path MUCH slower than the NV30 path.
Half the speed at the moment.

I strongly pressume that's because ARB_fragment_program defaults to using FP 32 bit over FP 16 bit, but I'm not into the ARB_fragment_program and wonder why Carmack doesn't force it to use FP 16 bit? Anyone?
 
LeStoffer said:
Another thing that leaves me a little in the dark is Carmack's mention of NV30 running fragment programs much slower in ARB_fragment_program.

The NV30 runs the ARB2 path MUCH slower than the NV30 path.
Half the speed at the moment.

I strongly pressume that's because ARB_fragment_program defaults to using FP 32 bit over FP 16 bit, but I'm not into the ARB_fragment_program and wonder why Carmack doesn't force it to use FP 16 bit? Anyone?

yeah thought about it too , the answer probably is that he can't force 16bit FP , that's why he probably uses nvidias own extension.
 
vender-specific means for me fast fixed point. He does not say that the NV30 uses FP here.

why? The nvidia extension exposes the floating point functionality , so i think it makes sense to use it. He get good performance + good quality.
 
Wasn't it said in the article that there is room for improvement in NV30's ARB2 performance? Technically speaking, the NV30 has about as many fpu's per pipeline as the R300 and can issue 1 op per clock as the R300. The performance delta between them should not be as great as it is now. It seems the NV30's software does not compile ARB2 extensions nearly as well as it could. It may be possible that Nvidia worked on their proprietary functionality first, and now they presume to work on ARB2. Nonetheless, the R300's performance impresses me more each day, while the NV30 remains half asleep.
 
Ingenu said:
Interesting, but the more interesting to me is
The newly-ratified ARB_vertex_buffer_object extension will probably let me do
the same thing for NV_vertex_array_range and ATI_vertex_array_object.

Where is this extension ?! It's not in the OpenGL registry yet, neither is the latest ARB talk available :(

That's what caught my eyes too.
This is great news. I figured it was going to be finalized for the latest ARB meeting. Still waiting for the ARB meeting notes though. Can't wait for specs and drivers :)
 
tEd said:
LeStoffer said:
Another thing that leaves me a little in the dark is Carmack's mention of NV30 running fragment programs much slower in ARB_fragment_program.

The NV30 runs the ARB2 path MUCH slower than the NV30 path.
Half the speed at the moment.

I strongly pressume that's because ARB_fragment_program defaults to using FP 32 bit over FP 16 bit, but I'm not into the ARB_fragment_program and wonder why Carmack doesn't force it to use FP 16 bit? Anyone?

yeah thought about it too , the answer probably is that he can't force 16bit FP , that's why he probably uses nvidias own extension.

From the GL_ARB_fragment_program spec:

(22) Should we provide applications with a method to control the
level of precision used to carry out fragment program computations?

RESOLVED: Yes. The GL implementation ultimately has control over
the level of precision used for fragment program computations.
However, the "ARB_precision_hint_fastest" and
"ARB_precision_hint_nicest" program options allow applications to
guide the GL implementation in its precision selection. The
"fastest" option encourages the GL to minimize execution time,
with possibly reduced precision. The "nicest" option encourages
the GL to maximize precision, with possibly increased execution
time.

If the precision hint is not "fastest", GL implementations should
perform low-precision operations only if they could not
appreciably affect the final results of the program. Regardless
of the precision hint, GL implementations are discouraged from
reducing the precision of computations so aggressively that final
rendering results could be seriously compromised due to overflow
of intermediate values or insufficient number of mantissa bits.

Some implementations may provide only a single level of precision,
in which case these hints may have no effect. However, all
implementations will accept these options, even if they are
silently ignored.

More explicit control of precision, such as provided in "C" with
data types such as "short", "int", "float", "double", may also be
a desirable feature, but this level of detail is left to a
separate extension.

So the ARB path can hint which precision it want and I can almost garantuee that using ARB_precision_hint_fastest will use FP16 on the GFFX.
 
From first page:

MfA: framebuffer contents as input to pixel shaders
I've also wondered that. Maybe the multisampling problems are the reason why we haven't seen it. But I've wondered that since before multisampling entered the home-3D-scene, so it can't be the whole reason.

Chalnoth: "...the fact that Hierarchichal-Z is disabled on the R300 when the stencil buffer is not cleared with the z-buffer..."
Where have you got this "fact" from? It's not in the R300 docs for DX9 optimizations. It looks like there are other problems however.
 
Chalnoth said:
Still, I'd like to know if the fact that Hierarchichal-Z is disabled on the R300 when the stencil buffer is not cleared with the z-buffer is hurting its performance, or whether JC has found a way around this limitation.

A lot of nVidia optimizations docs say that always clear the stencil buffer together with the Z-buffer - otherwise the clearing will be very slow.

So I wonder why you'd want to do something that hurts performance on nVidia and ATI cards as well...
 
Hyp-X said:
Chalnoth said:
Still, I'd like to know if the fact that Hierarchichal-Z is disabled on the R300 when the stencil buffer is not cleared with the z-buffer is hurting its performance, or whether JC has found a way around this limitation.

A lot of nVidia optimizations docs say that always clear the stencil buffer together with the Z-buffer - otherwise the clearing will be very slow.

So I wonder why you'd want to do something that hurts performance on nVidia and ATI cards as well...

The shadow algorithm needs stencil buffer cleared between lights (you can skip it if the shadow volumes don't cross but that's just an optimization) but don't want the z-buffer cleared as you'd have to initialize it again, with the values you already had there before the clear...
 
jpaana:
Yes, that's true. So you need to clear the stencil buffer separately, and that's slower than if you do it together with Z. Same for ATI and nVidia. I can't see any mentioning about it disabeling Hierarchical-Z.

Now, Carmacks' reverse...
 
Basic,

the R300 developer documentation has this to state on the isue of HZ and Stencil operations:

By far the most important part of HYPER Z is the hierarchical depth testing. This technique allows culling of large pixel blocks very efficiently based on the hierarchical view of the depth buffer that is stored on the chip. Unlike previous chips, RADEON 9500/9700 performs multiple hierarchical depth tests at the various places in the pipeline, making this technique even more effective.

There are a couple of rules that have to be followed to reap the benefits of the Hierarchical Z.
First, and the most important, do not change sense of the depth comparison function in the course of a frame rendering. That is if using D3DCMP_LESS depth function, do not change it to D3DCMP_GREATER for some part of a frame.
Second, D3DCMP_EQUAL and D3DCMP_NOTEQUAL depth comparison functions are not very compatible with Hierarchical Z operation, so avoid them if possible or replace them with other depth comparisons such as D3DCMP_LESSEQUAL.
In addition, few other things interfere with hierarchical culling; these are - outputting depth values from pixel shaders and using stencil fail and stencil depth fail operations.
Last but not least, for the highest Hierarchical Z efficiency place near and far clipping planes to enclose scene geometry as tightly as possible, and of course render everything front to back.

I've ask Sireric about it and he said to me the following:

The docs aren't quite exact. They mean that Hierarchical Z is of no use in some cases. Basically, when Z test is set to "equal" or "not equal" or when pure stencil test is done (as I said, there's no stencil Hierarchical Z), then HZ doesn't help.

Anyway, we seem to be doing fine with HZ in Doom3.
 
Lots of different "sub-topics" going on in this thread, so I may as well just post my basic impressions of the .plan file.

I was expecting a clear performance victory for the NV30 in non AA Doom-III rendering. It doesn't appear to be the case. In fact, it appears that NV30 is pretty much on par with R-300. Very hard to judge from the "qualitative" nature of Carmack's comments, but it seems to me that the R-300 will have a slight edge on quality, and the NV30 a slight edge on performance. Again, I assume this is in NON AA / non aniso performance.

That's disappointing for the NV30, especially considering the *COUGH* Doom-III "benchmarks" released by nVidia during the FX launch that showed GeForce FX being what, twice as fast? :rolleyes:

This brings me to the topic of AA. IIRC, Carmack hast stated that he really "doesn't care" about AA / Aniso. That is, he builds his rendering engines, and tests for performance and rendering correctness without AA / aniso in mind. It's up to the "hardware" to be able to handle those things. It's not that he doesn't believe AA / Aniso is beneficial, of course, it's just that from the developer perspective, his concern is non AA / aniso performance and quality, and then if the hardware can handle enabling such features, great. If not, it's not his problem.

So sadly, he typically doesn't comment on performance with such hardware features enabled. This .plan is no exception.

Of course, WE are very interested in the performance implications of AA / aniso with Doom3. So we are left without one very important piece of the puzzle. It would be nice to know if he expects EITHER of these cards to be able to handle Doom3 with any type of AA / aniso, or if one card seems to do a better job with it or not, or if there tend to be bugs, etc.

All along I had anticipated that NV30 would be a performance leader in Doom3 without AA/aniso, but R-300 would catch up or be the leader with AA/aniso. (Simply based on the relative pixel fill rate and bandwidth of each card.) As it turns out, thus far, the NV30 has a slight advantage without AA/aniso, and I'm left wondering about AA/Aniso performance.
 
Dave:
Yes, I've read that. And that's the reason for what I've said to Chalnoth. Or rather, they don't give more reasons for it to not working, and that's the reason. What sireric said is a nice confirmation on what I thought (that there is no hierarchical stencil), thanks for that info.

It would be interesting to know what happens if you change the depth comp func though. Will it disable HierZ for the rest of the frame, or will it be used as much as possible? It could be possible to reenable it if you switch back to the original depth comp func.

Btw, the comment on "Carmacks' reverse" was about the stencil filling passes. "Carmacks' reverse" mean that you do a stencil operation for every hidden pixel, and an architecture that is built to effectively throw away hidden pixels doesn't help much there.
HierZ should work in the other passes.
 
And just a few more comments:

I don't understand this statemet of Carmack's, my emphasis added:

The reason for this is that ATI does everything at high precision all the time, while Nvidia internally supports three different precisions with different performances. To make it even more complicated, the exact precision that ATI uses is in between the floating point precisions offered by Nvidia, so when Nvidia runs fragment programs, they are at a higher precision than ATI's, which is some justification for the slower speed.

That seems contradictory to me. Why / how is it that when nVidia runs fragment programs, "they are at a higher precision than ATIs"...when at the same time nVidia offers both a lower and a higher precision mode :?:

Doesn't make sense to me.

The current NV30 cards do have some other disadvantages: They take up two slots, and when the cooling fan fires up they are VERY LOUD. I'm not usually one to care about fan noise, but the NV30 does annoy me.

Given the "environment of terror" that Doom-III is supposed to have, I think the noise of the NV30 is a significant drawback for the consumer...
 
Back
Top