Carmack's comments on NV30 vs R300, DOOM developments

RussSchultz said:
Whether that means it will improve to where FP32 vector ops on the arb2 path get 1 cycle execution or not is unknown. As far as we know, the compiler needs to re-order something to prevent a pipeline stall and that could be enough to double performance. VLIW processors can be tricky beasts and writing compilers for them are not trivial. (DSP C compilers generally suck, for example)

I could believe this more easily if the problem was just in Doom III. ShaderMark is showing poor performance for DX9 shaders across the board, including some relatively trivial ones. Yet the performance differences from shader to shader match up with those from the 9700, so it looks like the shaders are doing the right amount of "work". I find it hard to believe that NVidia would come up with a driver that would guarantee worst-case instruction scheduling, no matter what shader you threw at it :)
 
All the details have been covered elsewhere. Let me try to cover them all in one place.

First, let me note Shadermark is Direct X, not OpenGL My analysis of the shadermark performance would be based on the fact that the performance is LESS than half of R300 performance...to me this indicates directly an opportunity for driver efficiency optimization (i.e., relatively trivial given the gross performance deficiency, and likely fixed very soon).

The following comments relate to OpenGL specifically.

---

Carmack: GFFX:NV30 fragment path -> slightly ahead of R300:ARB2 fragment path, with R300 sometimes leading.

Note the path the least likely to benefit from driver optimizations: the one based on the extensions nVidia specified, which are likely to be closely mapped to hardware functionality. This indicates what seems like a reasonable performance ceiling with fully optimized precision and functionality implementation for the "ARB2" path.

Carmack: GFFX:ARB2 fragment path -> half the speed of GFFX:NV30 fragment path, with Carmack specifically stating the problem being due to performance due to higher precision in the ARB2 path.

This indicates where "ARB2" path performance is now with regards to fragment shading.

So, why think progress from ARB2->NV30 for the NV30 might be possible?

Mentioned elsewhere in thread: ARB2 has precision hint specification allowing specifying either "maximum precision" or "maximum performance" (nicest/fastest), and it is up to the drivers to take the "maximum performance" hint and effectively decide where precision can be sacrificed.

But what kind of optimizations might be used for the ARB2 path with the "fastest" hint?

[url=http://www.beyond3d.com/previews/nvidia/nv30launch/index.php?p=2 said:
Interview 1[/url]]
There was talk that FP16 (64-bit floating point rendering) could run twice the speed of FP32 (128-bit floating point rendering), is that the case?

Yes it is.

...

I assume that anything available currently using the the 32-bit format will be run in FP16 mode?

Actually, no. We have native support for 32-bit integer, which is how we get the performance on the older apps. If we were to run them as FP16 then they wouldn't run as fast. So we have dedicated hardware with native support for 32-bit per pixel integer, 64-bit per pixel floating and 128-bit per pixel floating.


Also, another factor in the "NV30" code path's performance that either might not be able to be reflected in the "ARB2" code path at current and might provide opportunities for future performance enhancement depending on how many assumptions nVidia can safely make:

[url=http://www.beyond3d.com/articles/nv30r300/index.php?p=9 said:
R300 versus NV30 on paper[/url]]
Here we can find that register combiner unit is still available, even in fragment program mode, because it is commonly used and provides a powerful blending model. For example, it allows for four operands, fast 1-x operations, separate operations on color and alpha components, and more. These operations could be performed by fragment programs, but would require multiple instructions and program parameter constants. Supporting both methods simultaneously allows a programmer to write a program to obtain texture colors and then use the combiners to obtain a final fragment color.

As such, there are two different types of fragment programs: one "normal" and one for combiners. For combiner programs, the texture colors 0 through 3 are taken from texture output registers 0 through 3, respectively. The other combiner registers are not modified in fragment program mode.

Can we tell where the performance will end up?

Nope, or atleast I can't. Likely Carmack could, but he didn't choose to speculate and quoted nVidia's assurances. I can only guess that the ceiling is "NV30" fragment shading path performance.

...

Though the last quote is "paper" analysis, I think overall it can be seen that in regards to the "ARB2" path there is very good reason to believe there is room for optimization based on floating point precision handling in future drivers. It also seems safe to assume that the performance of the "NV30" path seems to be a very good indication of the ceiling such enhancement would offer, and that the guessing game nVidia might play to achieve that is not likely to absolutely match it in the general case (and as long as the "NV30" path is there, game specific optimization for the "ARB2" path seems a waste of time).

I hope providing substantion can end the comparisons of these discussions to shakespearian analysis ;) ....I tried to pick statements that are direct, easy to understand, and informative, with little speculation left. :LOL:

I also hope this is presented clearly enough so as to not cloud the issue. :-?
 
demalion said:
All the details have been covered elsewhere. Let me try to cover them all in one place.
...

Though the last quote is "paper" analysis, I think overall it can be seen that in regards to the "ARB2" path there is very good reason to believe there is room for optimization based on floating point precision handling in future drivers. It also seems safe to assume that the performance of the "NV30" path seems to be a very good indication of the ceiling such enhancement would offer, and that the guessing game nVidia might play to achieve that is not likely to absolutely match it in the general case (and as long as the "NV30" path is there, game specific optimization for the "ARB2" path seems a waste of time).

I hope providing substantion can end the comparisons of these discussions to shakespearian analysis ;) ....I tried to pick statements that are direct, easy to understand, and informative, with little speculation left. :LOL:

I also hope this is presented clearly enough so as to not cloud the issue. :-?

Very well stated, and it appears we've probably identified the ceiling in terms of performance (the "NV30" path) and that to attain that I believe it would be necessary to "hint" 16bit instead of 32bit--thus the 32bit performance hit will not be in effect with the ramification that 32bit is not available. Great for benchmark PR.
 
All i have to say is i have full confidence that the r400 will be the best card for doom 3 at the time of the r400's release. I will also say that at the time of the r350 release it to will also be the best card for doom 3. Am i right or am i wrong ?
 
Hellbinder[CE said:
]The Nv30 path is FASTER than the ARB2 path becuase its running in FP16 mode..

Or did you somehow miss all that..

I got the impression that the NV30 path was using the register combiners and was using ints (12-bit per channel?). When Carmack was describing the rendering paths, he says "floating point fragment shaders, minor quality improvements, always single pass" when talking about ARB2, but only said "full featured, single pass" when describing the NV30 path, similar to what he said about the R200 path.

Does anyone know if the NV30 path in Doom3 uses integers? Do the NV30 register combiners allow floating-point math?
 
Mintmaster said:
I got the impression that the NV30 path was using the register combiners and was using ints (12-bit per channel?). When Carmack was describing the rendering paths, he says "floating point fragment shaders, minor quality improvements, always single pass" when talking about ARB2, but only said "full featured, single pass" when describing the NV30 path, similar to what he said about the R200 path.

I was under the same impression. From the nv30 VS R300 (or maybe another article) I was under the impression then even 16bit Floating was way slower than the integer one..
 
Let's assume that the GFFX takes 2 cycles to execute an op using FP32 precision while the R300 takes 1 cycle to execute in FP24. We would expect pixel shaders in FP32 to run at 1/2 the speed of FP24 on the R300. However, the GFFX runs at 1.5x the clockspeed of the R300, so we would expect the GFFX shaders to run at 75% speed of the R300 shader.

There can be only a few explanations for this discrepency (per Carmack's comments):

A: NVidia spent most of their time working on the NV30 fragment shader extension first, and only recently started working on the ARB extension. Thus, the optimizer in the ARB extension is less mature.

B: the GFFX isn't really running at 500Mhz, but about the same rate as the R300.

C: 4-component FP32 instructions don't run at 1/2 speed of FP24, but even slower. Does anyone think NVidia only included 1 FP16 unit per pipeline? (3-4 clocks for a 96-128bit op). What are those 120M transistors for?

D: Bandwidth bottleneck (texture ops stalling pipeline)

E: Nvidia's hyper-Z less effective when running Carmack's shaders leading to wasted computations?


F: instruction decode/dispatch in NV30's "more general" architecture can't keep functional units fed fully (doubtful given the lack of branches)


There are too many possibilities to conclude at this time that the speed differential is a fundamental HW limitation, or a driver problem, however given Nvidia's historical increases in driver performance, chances are there is atleast some gain to be had in the ARB2 extension. We can test this theory very simply by running FP32 shaders using the NV30 fragment shader.

If using NV30 fragment shader leads to a 50% drop in FP32 mode, then we can conclude either one of two things: 1) Fp32 ops run at 1/2 speed of R300. or 2) NV fragment shader2 is immature also, and NVidia spent more of their time optimizing the 12 and 16-bit code paths, possibly because they know these are the fastest and will make the most impressive PR demos.

In any case, given the early state of the drivers, there's too many unknowns to judge.
 
C: 4-component FP32 instructions don't run at 1/2 speed of FP24, but even slower. Does anyone think NVidia only included 1 FP16 unit per pipeline? (3-4 clocks for a 96-128bit op). What are those 120M transistors for?

Its looking more and more likely that they have just wasted space on the inclusion of a specific int pipeline along the FP pipeline, whereas ATI just took the route of doing everything over the FP pipeline. I'd always doubted that Geoff Ballews responce to my question was actually saying what we thought it was, but its increasingly looking like it really is, and this seems to me to be incredibly wasteful. Someone in this thread (or another) praised ATI for doing everything over the FP pipeline and dumping the integer, but does this really merit praise? It just seems like plain old common sence to me, and nothing that hasn't been done before.

If its really the case that NV30 has two separate paths per pipeline then I'm starting to wonder about the rest of the NV3x chips as well.
 
I'm wondering if some of the lower-end NV3x cores will dump the fixed-function integer pipeline and do everything in the programmable FP (like ATi).
 
Well, it's a nice simplification of your core design to just do everything at one precision, but I'm not sure I agree that it is best thing, or merits praise, since it is in fact, the simpler/easier thing to do.

Imagine if the TNT only ran in 32-bit color, and they simplified their design by automatically converting all requests for 16-bit into 32-bit. Even if such a card could run a 32-bit framebuffer at the same speed as other cards @ 16-bit, you are wasting potential performance by using too much precision for what is being requested. The application developer should be in control of what precision is used.

As a programmer, I make the decision all the time as to whether I want to use bytes, shorts, longs, floats, and doubles. I make these decisions based on the precision I need and the performance I want to extract. If I know that I only need integer precision, or 16 bit FP precision, I would expect the be able to get some performance benefit.

Ideally, you could execute a 128 bits worth of FP ops per cycle per pipeline, in parallel with 32-bits of integer, texture fetch, and texture address calculation. You could split up that 128-bits of FP work into either 1 op at full precision and 2 ops at half precision.

It looks like the NV30 is a poor implementation of this idea. I expect that the R400 will probably remove most of the pixel shader resource limits, be PS3.0 compliant, and support the DX9 partial precision hint by allowing instructions to run at half precision, but 2x the speed. That is, I expect the R400 to do what the NV30 is doing (or attempting to do), and that the R300 took the fixed precision approach to simplify their design and get to market sooner.

I think ATI made the right decision, and the result is they were able to ship their card quickly this generation, while shader execution throughput with long shaders isn't an issue yet. However, I bet in the future, they will spend more time upgrading their pixel pipeline to be more flexible with respect to allocation of work amongst their functional units.
 
nutball said:
I'm wondering if some of the lower-end NV3x cores will dump the fixed-function integer pipeline and do everything in the programmable FP (like ATi).
I've asked myself (and others) the same question <a href="http://www.beyond3d.com/forum/viewtopic.php?p=71491&highlight=#71491">here</a href>, but I got no answer (and I assume there will be none for quite some time).

It will be very intersting how the pipelines of the rest of the NV3x chips are arranged. Of course it would be helpful to know what is happening exactly on the NV30 in this regard.
Waiting for beyond3d review of GeforceFX. :)
 
I'm wondering if some of the lower-end NV3x cores will dump the fixed-function integer pipeline and do everything in the programmable FP (like ATi).
Perhaps, although it would seem strange to take such a performance hit on the DX8 games NV31/34 will actually be well-suited for playing, and move everything to FP for the DX9 games which won't be out until they are obsolete. OTOH as DaveB implies the transistor waste would be truly painful for such mainstream/budget parts. :wince: (Suitable smilie requested!)

Of course, if the FP16/int12 pipeline could be sped up to match the register-combiner int8 pipeline, all would be well, but as there is apparently some reason why that is not the case with NV30, presumably it won't be fixed until at least NV35/36.

If then. :oops:
 
DaveBaumann said:
Its looking more and more likely that they have just wasted space on the inclusion of a specific int pipeline along the FP pipeline, whereas ATI just took the route of doing everything over the FP pipeline. I'd always doubted that Geoff Ballews responce to my question was actually saying what we thought it was, but its increasingly looking like it really is, and this seems to me to be incredibly wasteful.

Yes, this has been my concern ever since we saw the NV30 OpenGL specs a couple of months ago with those old register combiners staying put along side the fragment processor. The point was that we at that time already knew that ATI had managed to make a R300 FP pipeline that work very well with plain int.

Anyway, I think that nVidia went this route because they needed the FP32 for professional use, and right there they probably lost the option of going with 'one' pipeline.
 
depth_test said:
That is, I expect the R400 to do what the NV30 is doing (or attempting to do), and that the R300 took the fixed precision approach to simplify their design and get to market sooner.

Your initial point was along the lines of “whats all that transistor space doing its it not doing 1 128bit op per cycleâ€￾, I think its beginning to look like a fair amount of space is wasted on the integer pipeline and the FP pipeline is optimized to do one FP16 per cycle. The performance of the ARB path, the previous comments (2 in instruction per cycle) and their reluctance to answer the question is gradually pointing to this. If R400 does supper a greater than FP24 pipe (which I’m not sold on yet) then I would expect it to be optimized to one 128bit instruction per cycle.

nutball said:
I'm wondering if some of the lower-end NV3x cores will dump the fixed-function integer pipeline and do everything in the programmable FP (like ATi).

Unfortunately, this would go against NVIDIA’s prior methods for making lower end parts.
 
Imagine if the TNT only ran in 32-bit color, and they simplified their design by automatically converting all requests for 16-bit into 32-bit. Even if such a card could run a 32-bit framebuffer at the same speed as other cards @ 16-bit, you are wasting potential performance by using too much precision for what is being requested.

No you're not: instead you "wasted" the extra transistors by providing a native 32-bit pipeline. Except that you're not actually wasting transistors, as you needed one anyway; instead you're actually saving transistors by not implementing a seperate, pure 16-bit pipeline. (N.B. you still save performance using 16-bit on a 32-bit pipeline because of the lower bandwidth costs...but same for R300 with int4/int8/FP24.) If you could manage to do twice the work in the same time by using a 2:1 packed 16-bit format in your 32-bit pipeline, now *that* would be worth doing! (And presumably NV30 does this with FP16 vs. FP32 shaders.)

As a programmer, I make the decision all the time as to whether I want to use bytes, shorts, longs, floats, and doubles.

Well, the only one of those data types with smaller granularity than the CPU's execution units is a byte. And, BTW, the only reason to use bytes is to get a smaller memory footprint; in a modern CPU, addressing all your data as bytes will pretty severely cost performance due to performance penalties for unaligned memory access. (This is said to be a primary cause of poor P4 performance on "legacy" code.)
 
Dave H said:
instead you're actually saving transistors by not implementing a seperate, pure 16-bit pipeline. (N.B. you still save performance using 16-bit on a 32-bit pipeline because of the lower bandwidth costs...but same for R300 with int4/int8/FP24.) If you could manage to do twice the work in the same time by using a 2:1 packed 16-bit format in your 32-bit pipeline, now *that* would be worth doing! (And presumably NV30 does this with FP16 vs. FP32 shaders.)
And that is exactly the point. Lower precision operations should run faster in many circumstances. If an application is only requires half the precision, you should be able to allocate their transistors that aren't needed to some other task.

As a programmer, I make the decision all the time as to whether I want to use bytes, shorts, longs, floats, and doubles.

Well, the only one of those data types with smaller granularity than the CPU's execution units is a byte. And, BTW, the only reason to use bytes is to get a smaller memory footprint; in a modern CPU, addressing all your data as bytes will pretty severely cost performance due to performance penalties for unaligned memory access. (This is said to be a primary cause of poor P4 performance on "legacy" code.)

[/quote]

Not true. Both MMX and SSE2 can run lower precision ops at faster speed. It is true that float and double on some CPUs run at the same speed, but it is not true in general. It is also true on many CPUs that the functional units are one size (e.g. 32-bit), and that scalar byte ops won't run faster. However, once we leave the realm of scalar processing and start looking at vector ops and ILP, the situation is different.

Likewise, even at fixed precision, if I request a dp3 instead of a dp4, the extra unused functional unit should be available for reuse. Or, if I do an operation with a destination mask, likea add r0.w, r1, r2, I am only using 1 FP unit, the other three should be reusable.
 
I'm wondering if the following is possible... I'm just inventing numbers, but I'm trying to get to a point where the performance numbers make sense.

The GFFX seems to have 32 FP calculators for the PS.
As far as we know, the R300 got 8 pipelines, each with one FP24 calculator...

So, let's imagine the GFFX needs 2 cycles for a FP16 and got dedicated integer support. That's might mean the GFFX got an "integer" calculator in each pipeline, thus it can do:
8x integer ops/cycle, 32x 16-bit floating point ops/cycle.

That would seem too utopic. So let's add in another element. The R300 is supposed to be able to do up to 3 instructions at the same time, if in optimal cases. That would indicate 3 FP24 operations/cycle maximum.

So...
The R9700P FP24 power would be: 3*325*8 = 7800
The GFFX FP16 power would be: 32*500 = 16000
The GFFX FP32 power would be: 32*500/2 = 8000

That would still make the GFFX godly... And if it was so good, Doom 3 performance would also be much better.
Another explanation is that the GFFX needs 2 cycles for FP16, 4 cycles for FP32 and got dedicated hardware for integer which can do it in one cycle, but with only 8 calculators. So that would give us...

R9700P FP24: 7800
GFFX FP16 + INT: 16000/2 + 8*500 = 8000+4000 = 12000
GFFX FP32 + INT: 8000/2 + 8*500 = 4000+4000 = 8000

That would mean that if you're doing everything in FP32, performance is 4000. If you do everything in FP16, performance is 8000. If you use the integer pipeline at the same time, it's 8000 and 12000 respectively.

This would indeed give 50% of the R9700P if using FP32 all the time. Now let's see what it would give us if we use: 65% FP16, 35% FP32. And we can use integer in parallel, although it isn't possible at all times because sometimes it isn't usefull. So let's imagine 50% of the integer capacity can be used too.

35% of FP32 = 800
65% of FP16 = 5200
50% of Int = 2000
800+5200+2000 = 6000+2000 = 8000 -> pretty much on par with the Radeon 9700 Pro, but can be slightly faster if more of the integer pipeline is used.

Odd calculations and lame assumptions. It doesn't seem very logical, but at least the final numbers make sense...


Uttar
 
Uttar, you are using the wrong numbers, the fragment shading processor does not include the texture addressing unit and the texture interpolator. The 3 ops per clock number you gave is for the three of these units, but the fragment color processor is what the NV30/R300 use for actual shading arithmetic (this is where the NV30 gains its fp16/fp32 flexibility), so we must isolate it from the rest of the architecture. Both processors, indeed, can do a texture interpolation and texture address per clock (NV30 can do 2 address ops, according to Digit-Life), aside from the color fragment op.

The color fragment op is considered to be issued/executed at 1/clock with both the R300 and NV30 architectures (at the precision they're most comfortable with ;)). Because it is given in a VLIW format (a packed way of issuing more than one instructions in one longer instruction) the work of the 128-bit precision RGBA op is divided amongst 4 32-bit units as scalar/vector ops. Sireric already explained the fragment shader pipeline specifics (for a single fragment shading pipeline) of the R300 for me (wish Nvidia would do the same; Ati does not shy away). Here is the skinny:

"There are 4 FMAD units, three reserved for vector units, 1 for scalar units. However, the scalar unit can kick in to give you 4 vector ops (dot4). Now, beyond the fmad, the scalar unit has a bunch of other units, including all the exotic functions (inv,log,exp, etc...), which can operate in parallel with the MAD. Those don't share the FMAD since it could not meet our timing requirements mixed with lut's. So, a simplified MAD was merged in to perform table lookups."

It is in this thread, bottom of 1st page:
http://www.beyond3d.com/forum/viewtopic.php?t=3042&highlight=sireric

We can thus conclude that the R300 also has 32 PS units, however, with seemingly special abilities. The units, aside from being able to issue a 4-component vector op/cycle, can also execute a complex fp op (exp, rcp, log, etc.) at full 24-bit precision.

If you follow along to this thread, http://www.beyond3d.com/forum/viewtopic.php?t=4064 , there is speculation as to the nature of NV30's fragment processing arrangement. According to my findings, I made a conclusion towards the end of the thread which may indicate why the NV30 differs a little from the R300 in architecture arrangement.
 
Back
Top