NV30/31/34/35 Fragment Processor Diagram(Speculation)

Discussion in 'Architecture and Products' started by Zephyr, Mar 31, 2003.

  1. Zephyr

    Newcomer

    Joined:
    Aug 18, 2002
    Messages:
    74
    Likes Received:
    0
    NV30/31/34/35 fragment processor diagram, just my speculation, any comments r welcome, especially comments came from ATi's engineers :) I am sure that nVidia's engineers cannot comment this :lol:

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]

    The fragment processor is oganized by many ALUs, and it is highly flexible.
     
  2. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    3 comments:
    • Where do you get 6/12 z/stencil/shader units for NV31/35 from? Seems to me to really impractical wrt memory accesses (non-power-of-2 pipes accessing memory => problems with address calculations and lots of misaligned memory accesses => less than optimal performance)
    • Is there any evidence around that NV30 functions as an 8x1 (as opposed to a deeply pipelined 4x2) in the PS1.x shader modes?
    • For the NV34/35, I would expect it to be possible to emulate fixed-function operation with shader functionality, so I see no particular reason why fixed-function mode would expose fewer pixel pipelines than the shader modes ..?
     
  3. Zephyr

    Newcomer

    Joined:
    Aug 18, 2002
    Messages:
    74
    Likes Received:
    0
    NV31 has a much better PS1.0 score (3Dmark2001se) than NV25 does, so i think it should have 6 shading units. AFAIK, NV35 borrows the successful idea from NV31, so it's possible that NV35 should have 12 shading units which is a good compromse for both PS1.1~1.3 and PS2.0.

    NV30's 8 fx color shading units can explain PS1.x(not including 1.4) scores from shadermark/3DMark 2001 well.

    I am not sure whether NV34/35 keep the anique fixed-function parts. For now, I think they keep them.
     
  4. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    Oddly enough I've unofficially heard 6 units in relation to NV31 a number of times before. It also fits with the Digitimes article, that many dismissed offhand, stating that one of the reasons NV30 was delayed was to beef up the pipelines from 6 to 8 - thats a good indication that a 6 pipe chip was in the works.
     
  5. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    Zephyr, interesting way to view the pipelines. As you know nVidia's OpenGL NV_30 specs suggest that that for each pipeline you have one Fragment Processor and two Register Combiners (one tied to the old PS 1.3 path and the other to the Fragment program).

    Under some circumstances you might be able to run the two paths in parallel and we can thus call it two pipeline instead of one.

    But: I still don't fully understand this approch because you would be rendering pixels from the same polygon and thus probably can't mix int PS 1.3 and FP 2.0 operations in the same clock cycle. (You would normally have tied a specific shader per polygon AFAIU).

    :!:
     
  6. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    BTW: Zephyr, Dave (and Kristof): It would be awesome if the NV30 vs R300 tech compare is updated. Most is already in there sure, but maybe you would want to have some other dev guys look it over and update it accordingly. Right now there is a bit more speculation than the otherwise great work (and beyond3d) deserve.
     
  7. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    I will ignore the NV35 diagrammatic-speculation (for obvious reason) but I am curious how you arrived at such organized speculations. What tests did you use to come to these speculations? Do you have all three boards (discounting the NV35)?
     
  8. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    I think Dave is working on this... hopefully we'll see more than just performance-comparison numbers... I'm just as curious about the hardware themselves than just pure performance numbers.
     
  9. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Still unconvinced about the 6 pipes: for Z/stencil, the overhead of adding 2 extra Z/stencil units (which are, after all, rather cheap) (to bump the number up to 8 and that way get aligned memory accesses) would seem to me to be less expensive in HW than adding support for misaligned memory accesses to the memory controller.

    6 shader operations seems a bit more plausible, as you could conceivably put a bunch of FIFOs to buffer up data across the pipeline count mismatches, although myself I would guess that the NV31's pixel pipes just have a richer/more efficient collection of functional units available for pixel shading than what NV25 has got, allowing more shader ops per pipeline per clock.

    As for the NV30 4x2 vs 8x1: 8 shading units could very well be grouped in a 4x2 fashion, just like the texture units apparently are - the only way to tell the difference is to test the performance falloff with increasing number of shading instructions - going from e.g. 5 to 6 instructions would have no performance hit with 4x2 but ~17% performance hit with 8x1. If they are organized as 4x2, and the 2 shaders in each pipeline are connected in series, you could have that during any given clock cycle (assuming shader latency >= 1 cycle), the 8 shading units indeed work on 8 different pixels, just like Nvidia is claiming .. :?
     
  10. Zephyr

    Newcomer

    Joined:
    Aug 18, 2002
    Messages:
    74
    Likes Received:
    0
    yes, I agree with u. there are still some unanswered questions about NV30. I need much more numbers to verify them.

    I only have a R300 card, but I analyze many numbers from TOMS, HARDOCP, ANAND, DIGIT-LIFE, IXBTLABS and BEYOND3D (with its forum, of course). Besides, cho, my friend, provides me many useful numbers of NV31/34 by my request.[/quote]
     
  11. Zephyr

    Newcomer

    Joined:
    Aug 18, 2002
    Messages:
    74
    Likes Received:
    0
    the reason why i ask for comments is i need verification, especially special PS test scores. IMHO, only test numbers can tell us whether it is right or not :)
     
  12. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    nv30:
    Hmm...your label "pipeline" and "pixel" seems to be based on the count of processing units, not their functionality (or do you have benchmarks in mind that correspond to the outline?). When you term something "operation at pixels/clock" I think it should generally be "operations per clock" from what I remember of exhibited functionality....they aren't synonymous.

    Are the texture ops texture address ops? If so, which benchmarks show 8 texture ops per clock? I think it uses the same ALUs used for fp32 fragment processing for them, and can do 8 fp operations per clock total between them. Also, I think the register bits for using the fp units for fragment shading are limited to what is necessary for 4 fp16 ops (pack/unpack allowing it to be used for 2 fp32 ops, and maybe more sometimes with component masking...don't recall benchmarks clarifying if that is the case).

    Do we have indication somewhere that floating point processing has independent register combiners? I.e., double what the GF 4 has? I thought it just added outputting z buffer values in place of color values as functionality (which was also NV2A functionality)...the equivalent of a new combining op rather than more units.


    Anyways, I think the tables at the bottom should be one table, because I think the shader functionalities are interdependent...based on that, I'm going to put my understanding as well as I can manage right now.

    AFAIK, c) should be "8 (4 during fp ops) texture ops per clock" and d) should be "4 floating point ops per clock + 4 integer ops per clock".

    nv31:
    Seems to have half the fp processing capabilities of the nv30, and half the texture fetching capability.

    I'd think that c) would be "4 (2 during fp ops) texture ops per clock" and d) would be "2 fp ops per clock + 4 integer ops per clock".

    Could you show where you saw the performance advantage for the nv31 compared to nv25? I looked in a few places and didn't see that. Here is the hardware.fr benchmark set.

    nv34:
    Seems to be put forth as something like 1/2 of the nv30 (the nv31 is a bit more than "half" I think).

    So maybe c) would be "4 (2 during fp ops) texture ops per clock" and d) would be "2 fp ops per clock + 2 integer ops per clock".

    nv35:
    Hmm...what is the failing for nv30? Needs more color output per clock and more simple/"legacy" color processing capabilities. I think the way it will do that is another set of register combiners.

    I'd guess that would be c) with "8 (4 during fp ops) texture ops per clock" and d) with "4 fp ops per clock + 8 integer ops per clock" (simply because I don't see how they'd have the transistor budgets for more fp units).

    I think in all cases the number of pixels per clock is limited by the integer/register combiner output, which is tied to minimum required bus width (integer op outputs * 32).

    Sorry for any "Monday Mistakes" in advance.
     
  13. thepkrl

    Newcomer

    Joined:
    Mar 14, 2003
    Messages:
    12
    Likes Received:
    0
    I have some results that might help you.

    I've been testing NV30 (5800 Ultra) fragment program performance with driver 43.45 (results are the same as for 42.92 with which I started). Testing is done with OpenGL NV_fragment_program.

    I have tested performance for all instructions with FP32, FP16, FX12 with both dependent operations and parallel independent operations. There is no difference between FP32/FP16 (but see about registers). FX12 operarations are significantly faster (3-4x).

    Operations/cycle
    FP FX
    4 12-16 mul/dp3/dp4
    4 12 mad/add/sub/max/min/flr/frc
    4 12 seq/sge/sgt/sle/slt/sne/str/sfl
    2 8 lrp
    4 - sin/cos/ex2/lg2/dst/rcp/x2d/ddx/ddy
    2 - rsq/lit/pow
    1 - rfl
    4 - pack/unpack/kil
    8 - tex/txp
    0.8 - txd

    It seems all instructions run at full speed even if they depend on the previous instruction (this includes dependent texture lookups). Constants and conditionals seem free, at least in short programs.

    The interesting exception is MUL/DP3/DP4. Independent instructions are executed at 16 ops/cycle, dependent at 11-12 ops/cycle. Note that a group of 16MUL+12ADD FX12-units would provide just the resources needed for the FX12 performance.

    FP and FX instructions do run in parallel. With 4 FP-ops/cycle, adding 8 FX-ops/cycle seems to be free, adding 12 FX-ops/cycle slows the shader. This number is a result of just a few tests, I haven't yet tried benchmarked these cases in much detail.

    Texture fetches and FP-ops do not work in parallel, so FP unit is probably involved in texture fetches somehow (perhaps DDX,DDY calculation). FX-ops do work in parallel with texture fetches.

    It seems all instructions are executed as vector ops with no support for parallel RGB and A instructions. Operations with just one element enabled are as fast as with all elements enabled.

    Register usage is the key to performance, as has been mentioned earlier. For maximum performance, it seems you can only use 2 FP32-registers or 4 FP16-registers. Every two new registers slow down things, and going over 8 regs slows even more:

    4.2 cyc/pix: 1reg (2 movs, 16 adds)
    4.5 cyc/pix: 2reg (2 movs, 16 adds)
    5.8 cyc/pix: 3reg (2 movs, 16 adds)
    5.5 cyc/pix: 4reg (2 movs, 16 adds)
    7.5 cyc/pix: 5reg (2 movs, 16 adds)
    7.1 cyc/pix: 6reg (2 movs, 16 adds)
    9.9 cyc/pix: 7reg (2 movs, 16 adds)
    9.9 cyc/pix: 8reg (2 movs, 16 adds)
    15.0 cyc/pix: 9reg (2 movs, 16 adds)

    In the above test the N registers are used in order. If the register usage order is very mixed, performance seems to drop even more. This suggest there are about 2-4 real registers for each pixel in flight (depending if output register is counted or if extra temporaries are reserved). If more registers are used, data is moved between active registers and some slower memory buffer, which adds extra instructions.

    Reading color registers seems to take an extra FP-op (32-bit int to FP conversion?). Texture registers can be accessed without extra cost (presumably as they are floats already).

    In general it seems that the performance is rather predictable, meaning modifying instruction order has little effect, as long as just a few registers are used. It seems that FX does execute the instructions one at a time and has a sufficient number of instruction in flight so that there dependencies are not important.
     
  14. thepkrl

    Newcomer

    Joined:
    Mar 14, 2003
    Messages:
    12
    Likes Received:
    0
    As for architectural speculation on the NV30 fragment shader:

    Earlier Geforces had TEXTURE SHADER followed by REGISTER COMBINERS. Texture shader is really very much like a simple fragment shader unit, as it can fetch textures and do some limited floating point operations.

    It would make sense, that with NV30 the texture shader was replaced with the fragment shader and the register combiners were left intact. Stencil/z/antialiasing etc are all handled after the register combiners and should therefore behave similarly on NV30 as on Geforce 4. And this seems to be the case.

    The 4-pixel color write limitation probably comes also from the combiners, as that was the number of colors/cycle the Geforce4 could handle. The 8 Z-writes per cycle was already supported in XBox (this I read from these forums, don't know if it's true).

    With NV35, perhaps the combiners are finally removed and replaced with more processing power in the fragment shader, or perhaps by moving combiner functionality inside the fragment. This would remove the 4-pixel/clock limitation and also make PS1.4 execution easier, as texture and combiner operations could be freely mixed.

    Some supporting facts from NVidia docs:

    The original NV_fragment_program spec supported "combiner fragment programs", which allowed outputting four colors from the fragment program, which were then processed in register combiners (which "operates as in NV20", quote from OpenGLforNV30.pdf). NVidia docs also suggest that these programs are more efficient, as the combiners support more complex operations (at less precision).

    It would be possible that combiner programs are compiled into fragment shader programs, but in that case why add that extension instead of just adding the combiner operations directly to the fragment shader language? Using the old combiners would have made hardware design easier.

    Interesting related facts from current fragment program specs:
    02/01/02 pbrown Removed support for combiner fragment programs (!!FCP1.0).
    07/24/02 pbrown Removed PK4UBG and UP4UBG instructions.
    (the 02/01/02 should probably be 02/01/03 as it is the last entry)

    First of all they removed the combiner fragment programs I just mentioned. They are not available in 43.35 (or earlier) drivers, except in NV30 emulation mode. Why do this? One probability is, that NV35 no longer benefits from combiner fragment programs as it has no separate combiners.

    The removed PK4UBG and UP4UBG instructions were used for texture gamma correction (mentioned in OpenGLforNV30.pdf). After their removal it seems there is no efficient way to do gamma corrected textures in OpenGL (you could do individual POW-operations per channel, but that would be prohibitively slow). I do hope they return in some form. Perhaps they don't work correctly in NV30? And perhaps NV30 could do gamma corrected AA downsampling using fragment programs if they did...

    You can find the referenced nvidia docs at:
    http://www.nvidia.com/dev_content/nvopenglspecs/GL_NV_fragment_program.txt
    http://developer.nvidia.com/docs/IO/3260/ATT/OpenGLforNV30.pdf
     
  15. Zephyr

    Newcomer

    Joined:
    Aug 18, 2002
    Messages:
    74
    Likes Received:
    0
    I borrow the languages from nVidia's reply :)

    From thepkrl's test, maybe u r right

    For now, I dont have strong evidences that NV30 has a double number of register combiners. But it is very possible that NV30 has them. Sure, It needs verification, and ur words also need verification.

    I think my lists are more clear, and I dont agree with ur words "8 (4 during fp ops) texture ops per clock" and "4 floating point ops per clock + 4 integer ops per clock" now, including NV31/34's similar words.

    very strange.

    [​IMG]

    u also can see a similar result at digit-life.com(http://www.digit-life.com/articles2/gffx/nv31-nv34.html).

    there are rumors that NV35 will cut down some old fashion parts in NV30.[/url]
     
  16. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    If not surprising still very interesting findings. Key question: Do the Register Combiners that are coupled with the Fragment Program have the ability to run the ins in FP? If not CineFX have a surprising large handicap on full FP shaders as already speculated.
     
  17. Zephyr

    Newcomer

    Joined:
    Aug 18, 2002
    Messages:
    74
    Likes Received:
    0
    Good job, thepkrl, ur words are very helpful.
     
  18. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    I really don't think the diagrams posted above are very accurate (I haven't looked at many of the comment posts).

    I believe that the current "PS 2.0" functionality takes the place of the old pixel shader that was in the NV2x. This fragment shader always executes floating-point ops, and part of nearly any pixel shader program will be executed in this portion of the processor.

    The NV3x also has at least four additional shader processors, called register combiners in OpenGL. These work at 12-bit fixed point, and are for doing calculations only after all texturing operations are completed (I think...I really don't know if this is the way it works precisely, this is all speculation, but it seems to make sense...).

    In the end, this means that the best performance will be leveraged when an equal amount of calculations are done in FP and fixed point.

    The primary question still is, how much pure processing power is there in the fragment processor? I have yet to see an adequate theoretical description that could explain the experimental results.
     
  19. thepkrl

    Newcomer

    Joined:
    Mar 14, 2003
    Messages:
    12
    Likes Received:
    0
    FP32 additions (FP16 is same speed)
    3.98 fragm/cycle 0.25 cycle/fragm: 1add-FP32
    1.90 fragm/cycle 0.53 cycle/fragm: 2add-FP32
    1.26 fragm/cycle 0.79 cycle/fragm: 3add-FP32
    0.95 fragm/cycle 1.06 cycle/fragm: 4add-FP32
    0.76 fragm/cycle 1.32 cycle/fragm: 5add-FP32
    0.63 fragm/cycle 1.59 cycle/fragm: 6add-FP32
    0.54 fragm/cycle 1.85 cycle/fragm: 7add-FP32

    FX12 additions
    3.98 fragm/cycle 0.25 cycle/fragm: 1add-FX12
    3.98 fragm/cycle 0.25 cycle/fragm: 2add-FX12
    3.98 fragm/cycle 0.25 cycle/fragm: 3add-FX12
    1.90 fragm/cycle 0.53 cycle/fragm: 4add-FX12
    1.88 fragm/cycle 0.53 cycle/fragm: 5add-FX12
    1.88 fragm/cycle 0.53 cycle/fragm: 6add-FX12
    1.26 fragm/cycle 0.79 cycle/fragm: 7add-FX12

    Texture loads with two paired texture fetches followed by FX12 adds
    3.98 fragm/cycle 0.25 cycle/fragm: 1tex
    3.97 fragm/cycle 0.25 cycle/fragm: 2tex-paired
    1.88 fragm/cycle 0.53 cycle/fragm: 3tex-paired
    1.88 fragm/cycle 0.53 cycle/fragm: 4tex-paired
    1.25 fragm/cycle 0.80 cycle/fragm: 5tex-paired
    1.25 fragm/cycle 0.80 cycle/fragm: 6tex-paired
    0.94 fragm/cycle 1.06 cycle/fragm: 7tex-paired

    Texture loads with individual texture fetches followed by FX12 add
    3.98 fragm/cycle 0.25 cycle/fragm: 1tex
    1.88 fragm/cycle 0.53 cycle/fragm: 2tex-nonpaired
    1.26 fragm/cycle 0.80 cycle/fragm: 3tex-nonpaired
    0.94 fragm/cycle 1.06 cycle/fragm: 4tex-nonpaired
    0.75 fragm/cycle 1.33 cycle/fragm: 5tex-nonpaired
    0.63 fragm/cycle 1.59 cycle/fragm: 6tex-nonpaired
    0.54 fragm/cycle 1.86 cycle/fragm: 7tex-nonpaired

    Program details (only some listed, the rest are similar):

    "1add-FP32",
    "ADD o[COLR],R0,R0;",

    "7add-FP32",
    "ADD R0,R0,R0;",
    "ADD R0,R0,R0;",
    "ADD R0,R0,R0;",
    "ADD R0,R0,R0;",
    "ADD R0,R0,R0;",
    "ADD R0,R0,R0;",
    "ADD o[COLR],R0,R0;",

    "1add-FX12",
    "ADDX o[COLH],H0,H0;",

    "7add-FX12",
    "ADDX H0,H0,H0;",
    "ADDX H0,H0,H0;",
    "ADDX H0,H0,H0;",
    "ADDX H0,H0,H0;",
    "ADDX H0,H0,H0;",
    "ADDX H0,H0,H0;",
    "ADDX o[COLH],H0,H0;",

    "1tex",
    "TEX o[COLH],f[TEX0],TEX0,2D;",

    "2tex-paired",
    "TEX H0,f[TEX0],TEX0,2D;",
    "TEX H1,f[TEX1],TEX0,2D;",
    "ADDX o[COLH],H0,H1;",

    "7tex-paired",
    "TEX H0,f[TEX0],TEX0,2D;",
    "TEX H1,f[TEX1],TEX0,2D;",
    "ADDX H2,H2,H0;",
    "ADDX H2,H2,H1;",
    "TEX H0,f[TEX2],TEX0,2D;",
    "TEX H1,f[TEX3],TEX0,2D;",
    "ADDX H2,H2,H0;",
    "ADDX H2,H2,H1;",
    "TEX H0,f[TEX4],TEX0,2D;",
    "TEX H1,f[TEX5],TEX0,2D;",
    "ADDX H2,H2,H0;",
    "ADDX H2,H2,H1;",
    "TEX H0,f[TEX6],TEX0,2D;",
    "ADDX o[COLH],H2,H0;",

    "2tex-nonpaired",
    "TEX H0,f[TEX0],TEX0,2D;",
    "ADDX H1,H1,H0;",
    "TEX H0,f[TEX1],TEX0,2D;",
    "ADDX o[COLH],H1,H0;",

    "7tex-nonpaired",
    "TEX H0,f[TEX0],TEX0,2D;",
    "ADDX H1,H1,H0;",
    "TEX H0,f[TEX1],TEX0,2D;",
    "ADDX H1,H1,H0;",
    "TEX H0,f[TEX2],TEX0,2D;",
    "ADDX H1,H1,H0;",
    "TEX H0,f[TEX3],TEX0,2D;",
    "ADDX H1,H1,H0;",
    "TEX H0,f[TEX4],TEX0,2D;",
    "ADDX H1,H1,H0;",
    "TEX H0,f[TEX5],TEX0,2D;",
    "ADDX H1,H1,H0;",
    "TEX H0,f[TEX6],TEX0,2D;",
    "ADDX o[COLH],H1,H0;",

    Fixed point operations can be executed between texture fetches, and the result is faster than using FP16

    0.95 fragm/cycle 1.06 cycle/fragm: tex+madfx12+dep.tex+add cwrite [10:d1] [b0 x0]
    0.63 fragm/cycle 1.59 cycle/fragm: tex+madfp16+dep.tex+add cwrite [11:d0] [b0 x0]

    "tex+madfx12+dep.tex+add",
    "TEX H0,f[TEX0],TEX0,2D;",
    "TEX H1,f[TEX1],TEX0,2D;",
    "MADX H0,H0,H0,H1;",
    "MADX H0,H0,H1,H0;",
    "TEX H0,H0,TEX0,2D;",
    "TEX H1,H1,TEX0,2D;",
    "ADD o[COLH],H0,H1;",

    "tex+madfp16+dep.tex+add",
    "TEX H0,f[TEX0],TEX0,2D;",
    "TEX H1,f[TEX1],TEX0,2D;",
    "MAD H0,H0,H0,H1;",
    "MAD H0,H0,H1,H0;",
    "TEX H0,H0,TEX0,2D;",
    "TEX H1,H1,TEX0,2D;",
    "ADD o[COLH],H0,H1;",

    I believe the register combiners are still 9 bit (as they are described as exactly similar to NV2x in NVidia docs). Unfortunately it is not possible to test fragment shaders and register combiners at the same time to see if they any use shared resources, as NVidia removed the support for this (it was originally documented to exist).
     
  20. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    Which drivers, thepkrl? Sorry if I missed it.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...