Carmack's comments on NV30 vs R300, DOOM developments

Discussion in 'Architecture and Products' started by boobs, Jan 30, 2003.

  1. antlers

    Regular

    Joined:
    Aug 14, 2002
    Messages:
    457
    Likes Received:
    0
    I could believe this more easily if the problem was just in Doom III. ShaderMark is showing poor performance for DX9 shaders across the board, including some relatively trivial ones. Yet the performance differences from shader to shader match up with those from the 9700, so it looks like the shaders are doing the right amount of "work". I find it hard to believe that NVidia would come up with a driver that would guarantee worst-case instruction scheduling, no matter what shader you threw at it :)
     
  2. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    All the details have been covered elsewhere. Let me try to cover them all in one place.

    First, let me note Shadermark is Direct X, not OpenGL My analysis of the shadermark performance would be based on the fact that the performance is LESS than half of R300 performance...to me this indicates directly an opportunity for driver efficiency optimization (i.e., relatively trivial given the gross performance deficiency, and likely fixed very soon).

    The following comments relate to OpenGL specifically.

    ---

    Carmack: GFFX:NV30 fragment path -> slightly ahead of R300:ARB2 fragment path, with R300 sometimes leading.

    Note the path the least likely to benefit from driver optimizations: the one based on the extensions nVidia specified, which are likely to be closely mapped to hardware functionality. This indicates what seems like a reasonable performance ceiling with fully optimized precision and functionality implementation for the "ARB2" path.

    Carmack: GFFX:ARB2 fragment path -> half the speed of GFFX:NV30 fragment path, with Carmack specifically stating the problem being due to performance due to higher precision in the ARB2 path.

    This indicates where "ARB2" path performance is now with regards to fragment shading.

    So, why think progress from ARB2->NV30 for the NV30 might be possible?

    Mentioned elsewhere in thread: ARB2 has precision hint specification allowing specifying either "maximum precision" or "maximum performance" (nicest/fastest), and it is up to the drivers to take the "maximum performance" hint and effectively decide where precision can be sacrificed.

    But what kind of optimizations might be used for the ARB2 path with the "fastest" hint?


    Also, another factor in the "NV30" code path's performance that either might not be able to be reflected in the "ARB2" code path at current and might provide opportunities for future performance enhancement depending on how many assumptions nVidia can safely make:

    Can we tell where the performance will end up?

    Nope, or atleast I can't. Likely Carmack could, but he didn't choose to speculate and quoted nVidia's assurances. I can only guess that the ceiling is "NV30" fragment shading path performance.

    ...

    Though the last quote is "paper" analysis, I think overall it can be seen that in regards to the "ARB2" path there is very good reason to believe there is room for optimization based on floating point precision handling in future drivers. It also seems safe to assume that the performance of the "NV30" path seems to be a very good indication of the ceiling such enhancement would offer, and that the guessing game nVidia might play to achieve that is not likely to absolutely match it in the general case (and as long as the "NV30" path is there, game specific optimization for the "ARB2" path seems a waste of time).

    I hope providing substantion can end the comparisons of these discussions to shakespearian analysis ;) ....I tried to pick statements that are direct, easy to understand, and informative, with little speculation left. :lol:

    I also hope this is presented clearly enough so as to not cloud the issue. :-?
     
  3. Hellbinder

    Banned

    Joined:
    Feb 8, 2002
    Messages:
    1,444
    Likes Received:
    12
    The Nv30 path is FASTER than the ARB2 path becuase its running in FP16 mode..

    Or did you somehow miss all that..
     
  4. Windfire

    Regular

    Joined:
    Feb 16, 2002
    Messages:
    353
    Likes Received:
    1
    Location:
    Seattle, WA
    Very well stated, and it appears we've probably identified the ceiling in terms of performance (the "NV30" path) and that to attain that I believe it would be necessary to "hint" 16bit instead of 32bit--thus the 32bit performance hit will not be in effect with the ramification that 32bit is not available. Great for benchmark PR.
     
  5. jvd

    jvd
    Banned

    Joined:
    Feb 13, 2002
    Messages:
    12,724
    Likes Received:
    9
    Location:
    new jersey
    All i have to say is i have full confidence that the r400 will be the best card for doom 3 at the time of the r400's release. I will also say that at the time of the r350 release it to will also be the best card for doom 3. Am i right or am i wrong ?
     
  6. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I got the impression that the NV30 path was using the register combiners and was using ints (12-bit per channel?). When Carmack was describing the rendering paths, he says "floating point fragment shaders, minor quality improvements, always single pass" when talking about ARB2, but only said "full featured, single pass" when describing the NV30 path, similar to what he said about the R200 path.

    Does anyone know if the NV30 path in Doom3 uses integers? Do the NV30 register combiners allow floating-point math?
     
  7. NocturnDragon

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    393
    Likes Received:
    17
    I was under the same impression. From the nv30 VS R300 (or maybe another article) I was under the impression then even 16bit Floating was way slower than the integer one..
     
  8. depth_test

    Newcomer

    Joined:
    Feb 1, 2003
    Messages:
    14
    Likes Received:
    0
    Let's assume that the GFFX takes 2 cycles to execute an op using FP32 precision while the R300 takes 1 cycle to execute in FP24. We would expect pixel shaders in FP32 to run at 1/2 the speed of FP24 on the R300. However, the GFFX runs at 1.5x the clockspeed of the R300, so we would expect the GFFX shaders to run at 75% speed of the R300 shader.

    There can be only a few explanations for this discrepency (per Carmack's comments):

    A: NVidia spent most of their time working on the NV30 fragment shader extension first, and only recently started working on the ARB extension. Thus, the optimizer in the ARB extension is less mature.

    B: the GFFX isn't really running at 500Mhz, but about the same rate as the R300.

    C: 4-component FP32 instructions don't run at 1/2 speed of FP24, but even slower. Does anyone think NVidia only included 1 FP16 unit per pipeline? (3-4 clocks for a 96-128bit op). What are those 120M transistors for?

    D: Bandwidth bottleneck (texture ops stalling pipeline)

    E: Nvidia's hyper-Z less effective when running Carmack's shaders leading to wasted computations?


    F: instruction decode/dispatch in NV30's "more general" architecture can't keep functional units fed fully (doubtful given the lack of branches)


    There are too many possibilities to conclude at this time that the speed differential is a fundamental HW limitation, or a driver problem, however given Nvidia's historical increases in driver performance, chances are there is atleast some gain to be had in the ARB2 extension. We can test this theory very simply by running FP32 shaders using the NV30 fragment shader.

    If using NV30 fragment shader leads to a 50% drop in FP32 mode, then we can conclude either one of two things: 1) Fp32 ops run at 1/2 speed of R300. or 2) NV fragment shader2 is immature also, and NVidia spent more of their time optimizing the 12 and 16-bit code paths, possibly because they know these are the fastest and will make the most impressive PR demos.

    In any case, given the early state of the drivers, there's too many unknowns to judge.
     
  9. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Its looking more and more likely that they have just wasted space on the inclusion of a specific int pipeline along the FP pipeline, whereas ATI just took the route of doing everything over the FP pipeline. I'd always doubted that Geoff Ballews responce to my question was actually saying what we thought it was, but its increasingly looking like it really is, and this seems to me to be incredibly wasteful. Someone in this thread (or another) praised ATI for doing everything over the FP pipeline and dumping the integer, but does this really merit praise? It just seems like plain old common sence to me, and nothing that hasn't been done before.

    If its really the case that NV30 has two separate paths per pipeline then I'm starting to wonder about the rest of the NV3x chips as well.
     
  10. nutball

    Veteran Subscriber

    Joined:
    Jan 10, 2003
    Messages:
    2,492
    Likes Received:
    979
    Location:
    en.gb.uk
    I'm wondering if some of the lower-end NV3x cores will dump the fixed-function integer pipeline and do everything in the programmable FP (like ATi).
     
  11. depth_test

    Newcomer

    Joined:
    Feb 1, 2003
    Messages:
    14
    Likes Received:
    0
    Well, it's a nice simplification of your core design to just do everything at one precision, but I'm not sure I agree that it is best thing, or merits praise, since it is in fact, the simpler/easier thing to do.

    Imagine if the TNT only ran in 32-bit color, and they simplified their design by automatically converting all requests for 16-bit into 32-bit. Even if such a card could run a 32-bit framebuffer at the same speed as other cards @ 16-bit, you are wasting potential performance by using too much precision for what is being requested. The application developer should be in control of what precision is used.

    As a programmer, I make the decision all the time as to whether I want to use bytes, shorts, longs, floats, and doubles. I make these decisions based on the precision I need and the performance I want to extract. If I know that I only need integer precision, or 16 bit FP precision, I would expect the be able to get some performance benefit.

    Ideally, you could execute a 128 bits worth of FP ops per cycle per pipeline, in parallel with 32-bits of integer, texture fetch, and texture address calculation. You could split up that 128-bits of FP work into either 1 op at full precision and 2 ops at half precision.

    It looks like the NV30 is a poor implementation of this idea. I expect that the R400 will probably remove most of the pixel shader resource limits, be PS3.0 compliant, and support the DX9 partial precision hint by allowing instructions to run at half precision, but 2x the speed. That is, I expect the R400 to do what the NV30 is doing (or attempting to do), and that the R300 took the fixed precision approach to simplify their design and get to market sooner.

    I think ATI made the right decision, and the result is they were able to ship their card quickly this generation, while shader execution throughput with long shaders isn't an issue yet. However, I bet in the future, they will spend more time upgrading their pixel pipeline to be more flexible with respect to allocation of work amongst their functional units.
     
  12. mr

    mr
    Newcomer

    Joined:
    Oct 7, 2002
    Messages:
    143
    Likes Received:
    0
    I've asked myself (and others) the same question <a href="http://www.beyond3d.com/forum/viewtopic.php?p=71491&highlight=#71491">here</a href>, but I got no answer (and I assume there will be none for quite some time).

    It will be very intersting how the pipelines of the rest of the NV3x chips are arranged. Of course it would be helpful to know what is happening exactly on the NV30 in this regard.
    Waiting for beyond3d review of GeforceFX. :)
     
  13. Dave H

    Regular

    Joined:
    Jan 21, 2003
    Messages:
    564
    Likes Received:
    0
    Perhaps, although it would seem strange to take such a performance hit on the DX8 games NV31/34 will actually be well-suited for playing, and move everything to FP for the DX9 games which won't be out until they are obsolete. OTOH as DaveB implies the transistor waste would be truly painful for such mainstream/budget parts. :wince: (Suitable smilie requested!)

    Of course, if the FP16/int12 pipeline could be sped up to match the register-combiner int8 pipeline, all would be well, but as there is apparently some reason why that is not the case with NV30, presumably it won't be fixed until at least NV35/36.

    If then. :shock:
     
  14. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,262
    Likes Received:
    22
    Location:
    Land of the 25% VAT
    Yes, this has been my concern ever since we saw the NV30 OpenGL specs a couple of months ago with those old register combiners staying put along side the fragment processor. The point was that we at that time already knew that ATI had managed to make a R300 FP pipeline that work very well with plain int.

    Anyway, I think that nVidia went this route because they needed the FP32 for professional use, and right there they probably lost the option of going with 'one' pipeline.
     
  15. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Your initial point was along the lines of “whats all that transistor space doing its it not doing 1 128bit op per cycleâ€￾, I think its beginning to look like a fair amount of space is wasted on the integer pipeline and the FP pipeline is optimized to do one FP16 per cycle. The performance of the ARB path, the previous comments (2 in instruction per cycle) and their reluctance to answer the question is gradually pointing to this. If R400 does supper a greater than FP24 pipe (which I’m not sold on yet) then I would expect it to be optimized to one 128bit instruction per cycle.

    Unfortunately, this would go against NVIDIA’s prior methods for making lower end parts.
     
  16. Dave H

    Regular

    Joined:
    Jan 21, 2003
    Messages:
    564
    Likes Received:
    0
    No you're not: instead you "wasted" the extra transistors by providing a native 32-bit pipeline. Except that you're not actually wasting transistors, as you needed one anyway; instead you're actually saving transistors by not implementing a seperate, pure 16-bit pipeline. (N.B. you still save performance using 16-bit on a 32-bit pipeline because of the lower bandwidth costs...but same for R300 with int4/int8/FP24.) If you could manage to do twice the work in the same time by using a 2:1 packed 16-bit format in your 32-bit pipeline, now *that* would be worth doing! (And presumably NV30 does this with FP16 vs. FP32 shaders.)

    Well, the only one of those data types with smaller granularity than the CPU's execution units is a byte. And, BTW, the only reason to use bytes is to get a smaller memory footprint; in a modern CPU, addressing all your data as bytes will pretty severely cost performance due to performance penalties for unaligned memory access. (This is said to be a primary cause of poor P4 performance on "legacy" code.)
     
  17. Dave H

    Regular

    Joined:
    Jan 21, 2003
    Messages:
    564
    Likes Received:
    0
    Why?
     
  18. depth_test

    Newcomer

    Joined:
    Feb 1, 2003
    Messages:
    14
    Likes Received:
    0
    And that is exactly the point. Lower precision operations should run faster in many circumstances. If an application is only requires half the precision, you should be able to allocate their transistors that aren't needed to some other task.

    [/quote]

    Not true. Both MMX and SSE2 can run lower precision ops at faster speed. It is true that float and double on some CPUs run at the same speed, but it is not true in general. It is also true on many CPUs that the functional units are one size (e.g. 32-bit), and that scalar byte ops won't run faster. However, once we leave the realm of scalar processing and start looking at vector ops and ILP, the situation is different.

    Likewise, even at fixed precision, if I request a dp3 instead of a dp4, the extra unused functional unit should be available for reuse. Or, if I do an operation with a destination mask, likea add r0.w, r1, r2, I am only using 1 FP unit, the other three should be reusable.
     
  19. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I'm wondering if the following is possible... I'm just inventing numbers, but I'm trying to get to a point where the performance numbers make sense.

    The GFFX seems to have 32 FP calculators for the PS.
    As far as we know, the R300 got 8 pipelines, each with one FP24 calculator...

    So, let's imagine the GFFX needs 2 cycles for a FP16 and got dedicated integer support. That's might mean the GFFX got an "integer" calculator in each pipeline, thus it can do:
    8x integer ops/cycle, 32x 16-bit floating point ops/cycle.

    That would seem too utopic. So let's add in another element. The R300 is supposed to be able to do up to 3 instructions at the same time, if in optimal cases. That would indicate 3 FP24 operations/cycle maximum.

    So...
    The R9700P FP24 power would be: 3*325*8 = 7800
    The GFFX FP16 power would be: 32*500 = 16000
    The GFFX FP32 power would be: 32*500/2 = 8000

    That would still make the GFFX godly... And if it was so good, Doom 3 performance would also be much better.
    Another explanation is that the GFFX needs 2 cycles for FP16, 4 cycles for FP32 and got dedicated hardware for integer which can do it in one cycle, but with only 8 calculators. So that would give us...

    R9700P FP24: 7800
    GFFX FP16 + INT: 16000/2 + 8*500 = 8000+4000 = 12000
    GFFX FP32 + INT: 8000/2 + 8*500 = 4000+4000 = 8000

    That would mean that if you're doing everything in FP32, performance is 4000. If you do everything in FP16, performance is 8000. If you use the integer pipeline at the same time, it's 8000 and 12000 respectively.

    This would indeed give 50% of the R9700P if using FP32 all the time. Now let's see what it would give us if we use: 65% FP16, 35% FP32. And we can use integer in parallel, although it isn't possible at all times because sometimes it isn't usefull. So let's imagine 50% of the integer capacity can be used too.

    35% of FP32 = 800
    65% of FP16 = 5200
    50% of Int = 2000
    800+5200+2000 = 6000+2000 = 8000 -> pretty much on par with the Radeon 9700 Pro, but can be slightly faster if more of the integer pipeline is used.

    Odd calculations and lame assumptions. It doesn't seem very logical, but at least the final numbers make sense...


    Uttar
     
  20. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Uttar, you are using the wrong numbers, the fragment shading processor does not include the texture addressing unit and the texture interpolator. The 3 ops per clock number you gave is for the three of these units, but the fragment color processor is what the NV30/R300 use for actual shading arithmetic (this is where the NV30 gains its fp16/fp32 flexibility), so we must isolate it from the rest of the architecture. Both processors, indeed, can do a texture interpolation and texture address per clock (NV30 can do 2 address ops, according to Digit-Life), aside from the color fragment op.

    The color fragment op is considered to be issued/executed at 1/clock with both the R300 and NV30 architectures (at the precision they're most comfortable with :wink:). Because it is given in a VLIW format (a packed way of issuing more than one instructions in one longer instruction) the work of the 128-bit precision RGBA op is divided amongst 4 32-bit units as scalar/vector ops. Sireric already explained the fragment shader pipeline specifics (for a single fragment shading pipeline) of the R300 for me (wish Nvidia would do the same; Ati does not shy away). Here is the skinny:

    "There are 4 FMAD units, three reserved for vector units, 1 for scalar units. However, the scalar unit can kick in to give you 4 vector ops (dot4). Now, beyond the fmad, the scalar unit has a bunch of other units, including all the exotic functions (inv,log,exp, etc...), which can operate in parallel with the MAD. Those don't share the FMAD since it could not meet our timing requirements mixed with lut's. So, a simplified MAD was merged in to perform table lookups."

    It is in this thread, bottom of 1st page:
    http://www.beyond3d.com/forum/viewtopic.php?t=3042&highlight=sireric

    We can thus conclude that the R300 also has 32 PS units, however, with seemingly special abilities. The units, aside from being able to issue a 4-component vector op/cycle, can also execute a complex fp op (exp, rcp, log, etc.) at full 24-bit precision.

    If you follow along to this thread, http://www.beyond3d.com/forum/viewtopic.php?t=4064 , there is speculation as to the nature of NV30's fragment processing arrangement. According to my findings, I made a conclusion towards the end of the thread which may indicate why the NV30 differs a little from the R300 in architecture arrangement.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...