GeForce FX: 8x1 or 4x2?

Discussion in 'General 3D Technology' started by Dave Baumann, Feb 10, 2003.

  1. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    depends from downstream of what :)

    becouse stages are looped here can be 2 schemes

    [tex][tex] or [fp alu]
    /--loop
    | [stage 1 alu]
    | [stage 2 alu]
    \--loop or write

    /--loop
    | [tex][tex] or [fp alu]
    | [stage 1 alu]
    | [stage 2 alu]
    \--loop or write

    seems that here second becouse you capable of dependent tex reads
    but question is is fp outside of loop or not (can you send something form second stage back not only to tex fetcher but to fp???)

    probably not inside... but...

    may be better to draw it in such way

    /--loop
    | [stage 1 alu] <- [ tex tex or fp ]
    | [stage 2 alu] <-/
    \--loop or write
     
  2. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    But may be they thought that it was faster for old games to use the fixed pipeline. It isn't as the first NV3x iteration is going to play DX9 games at full speed either (as GeForce 256 DDR isn't expected to run current DX7 at full speed). They may be thought that using the FP path for the old fixed function would be too slow and taking into account current results with PS2.0 they could be right.

    BTW, I wasn't saying that NV30 really is ULIW (or whatever) just summarizing a bit what UncleSam said and my own thoughts if that was the real implementation.
     
  3. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    yes - i am also think that fp/i and tex can be used only in OR order becouse of common silicon. i am simply pack all this interpolation math into [tex] on scheme becouse its more specific than generic fp alu.

    seems that this new generic fp alu USE math blocks from tex fetchers (and, more likely?? filters)

    but tex math not _generic fp/i alu as for nv30 so i draw different schemes to clear that

    seems that for this also comes constrain about no filtering on float tex's
     
  4. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    I have a problem understanding what 'stage' means here. We know from the OpenGL NV30 specification doc that the register combiner can be used after the NV_fragment_program hardware. So the result from FP based PS can be fed to the old integer based PS in a pipelined way. Of course that could be just the API interface and internally it could be doing something different. The problem I have with the way you explain how I12 and FP/TEX units operate is that I don't know if they are being operating in parallel or in a pipelined way: the input to [stage 1 alu] is the output of TEX/FP?
     
  5. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Surely is got to be the second since it can operate on 16 textures, with only two samplers.

    Well, I wonder if two integer processors is more or less than one FP? i.e. could they have replaced these two integer units with one FP and process ecerything over two FP units? I guess not at FP32 width, which is why it seems that R300's FP24 is a bit of a 'sweet spot' for now.

    Sorry, I was addressing UncleSam's post previously, yours snuck in previously.

    Anyway, can we reach some sort of summarised conclusion here for the feeble of mind (lack of sleep). Have we decided that it is indeed 4 pipes with a number of processors per pipe (putting the theory of a single pipe with a 'sea' of processors out the window)? And how does this comparatively relate back to R300's pixel pipe?

    [​IMG]
     
  6. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    IMHO best solution is to have generic fp alu that can also operate in i mode by switching off operations over exponent part of fp.

    for example you have F32 and if you want mantissa of them as integer

    something regular like

    [loop begin/setup]
    [tex]->[stage1 fp/i]
    [tex]->[stage1 fp/i]
    ...
    [emit/loop]

    but you need to have temporary registers copy for each stage and full fp alu for each stage and dedicated fp on filtering and fetch in each tex...

    its a lot of silicon :( but very powerful and flexible in use

    whats about independence of stages - i am dont understand clear what you mean. these 2 stages are chunked second get pixel that pass first clock ago....
     
  7. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    Reusing ALUs from the TMUs? The TMUs implement the PS2.0 instructions? But I think that would have a very large penalty when using both texture and FP (that is, always)!!!.

    The constrain in R300 I think was because the filtering is done with fixed point precision. Adding FP precission to the bilinear filtering hardware would have mean a lot of transistors. The same should apply to NV30. I don't thnk it is because what you say. In fact if the TMUs where FP capable that would mean that FP filtering could be also used, but of course with a very large penalty.
     
  8. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    probable it more flexible (kind of VLIW programmed anyway) as we can make 8 Z outputs, but if John say that we always do 4 colured pixels per clock its more easy to represent it as 4 pipes.

    so lets be 4 pipes! :D

    but each of 4 pipes can produce 2 Z instead of color+Z if we need so

    pipes share microcode, in each row it consist of setup for 2 i stages and two tex or one fp alu ops and additional info about interconnection (data flow) between units

    difference from ati is that we not limited by phases - we can fetch tex-s anytime and can execute up to 1024 microcode lines (but each constant costs one of them - seems that constants placed into microcode in any way - as constant in instruction or simply share common code/data memory)
     
  9. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    I think thats an optimisation in GF4's ability of having 4 z units per pipe for MSAA purposes. I've put a question to NV to establish if this ability for 2Z + 2 Stencil is affected when MSAA is in operation, hich should give us a responce.
     
  10. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    you right!

    i am surely need to sleep :lol:
     
  11. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    ok, lets wait!
     
  12. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    In fact I don't think that the 8 Z and Stencil only path goes through the shaders at all. It must be a different path/bus from the interpolators to the Z/Stencil unit after the PS hardware. That bus would be also smaller that the input bus for PS hardware as it just need pixel coordinates (x, y, z) and can ignore all the other interpolators/parameters (color, textures).
     
  13. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    Why not:

    /--loop
    | [tex][tex] or [fp alu]
    \--loop or "write"
    /--loop
    | [stage 1 alu]
    | [stage 2 alu]
    \--loop or write

    Actually GF4 does no looping in register combiners - it combines pipelines instead! (4x2 = 8, so no need for looping)
    Does anyone knows if GFFX supports more than 8 register combiner stages?
    If not, it's probably working the same way as the GF4...
     
  14. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    Maybe it's the penaly that we are seeing.
    Also I'm still wondering if the FX can co-issue vec3 and scalar on the same cycle similarly to the R300.

    I don't think the reused part of the TMU is the filters.
    Fetching textures needs a lot of calculation per-pixel: W divide, LOD calculatin, direction and degree of anisotropy.
    Because the input coordinates are FP it needs considerable FP processing power - that's what the R300 diagram calls "floating point address processor".
    I think that's what really is shared in the NV30.

    The actual fetchers and bilinear multipliers (which are fixed-point) is hardly reusable.
     
  15. binmaze

    Newcomer

    Joined:
    Feb 12, 2003
    Messages:
    88
    Likes Received:
    0
    confused...

    As a layman, i can't understand anything of this dynamic pipeline thing and its significance or impact.

    Is there anyone would like to enlighten me?
    1. This dynamic pipeline structure type(not necessarily NV30) is good for future?

    2. I heard NV30 would be good for Doom3 and prolly be good for the games that use D3 engine. But what about other games?
    Future games will use much the same tech as D3 or not? or it won't matter either use the tech or not?

    3. Is there any massive driver issue due to this complex structure? Therefore, there'd be a drastic performance improve when fixed?

    4. What's the good way to upgrade from NV30 to NV35? I heard it wouldn't gain much with 256bits. And it's oc'd to the limit by the factory itself, and produce great heat and noise with that aggressive cooler. And again, already quite big enough and with complex layers(10? 12?)
    We now hear there'd be no Low-K tech soon enough.

    So what you can do to make NV35 beat R350?

    Thanks in advance.
     
  16. UncleSam DL iXBT

    Newcomer

    Joined:
    Feb 27, 2003
    Messages:
    36
    Likes Received:
    0
    not the same becouse 2.0 shaders can be long enought and tex fetches can be more than 8 so it better to loop here i think

    if we combine pipelines we need to send fp alu state from one to another
    its pretty much of silicon

    much easier to loop in one pipeline

    and IMHO scheme with two loops in not useful also - much cheaper to loop together
     
  17. WaltC

    Veteran

    Joined:
    Jul 22, 2002
    Messages:
    2,710
    Likes Received:
    8
    Location:
    BelleVue Sanatorium, Billary, NY. Patient privile
    It's like we're building a consensus full of sound and fury, signifying...

    Given that 100% of 3D gaming software that I'm aware of always uses single and multitexturing--and never to any fixed proportion--why would a 4x2 organization ever be superior to 8x1...? At worst 4x2 will be half as fast, at best, as fast. In a mixture between worst and best cases, which is what all 3D gaming software currently is (I don't see this changing in the next year) , 8x1 is clearly the better plan, IMO. Thus the Dustbuster, although it may be as though it never was, was born... (re: nVidia's overarching need for 500MHz.) Simply ramp R300 to 500MHz, compare, and you will have your proof (likely no more easier to do than for nv30, as long as R300 is constrained to .15 microns. At .13 microns, though...maybe a different story entirely, for R300.)

    It seems to me that entirely too much apology is being expended to understand nv30's probable 4x2 organization, rather than effort expended to understand why nVidia could not do an 8x1 organization at either .15 microns or .13. An examination of nv30's Integer pipeline coupled with a look at nVidia's ciruit count, and a look at nVidia's approach to its shaders, might yield a better look at the nv30 design philosophy--as opposed to arguing over 4x2 versus 8x1--between which 8x1 is better.
     
  18. Joe DeFuria

    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,994
    Likes Received:
    71
    I agree.

    The only reason to go with a 4x2 vs.an 8x1 is....cost. It's cheaper not only in terms of the chip, but in the bandwidth needed to fully satisfy the chip. 8x1 demands more bandwidth to keep it fully satisfied than 4x2 would.

    The problem with that though, is that if some other engineering team, say, at ATI, can produce an 8x1 card, and also produce it with the bandwidth needed to satisfy it....and produce it earlier....and consuming less power....and perhaps even cheaper....then you're in trouble. ;)
     
  19. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, Walt, keep in mind that in single-texturing the FX will be memory bandwidth-limited anyway, so moving to 8x1 there would not make much difference. Once you put in memory bandwidth limitations, the FX's worst-case scenario becomes 75% of the performance of an FX with 8x1.

    You also need to realize that depending on the way the pipelines are linked, running more pixels in parallel may kill performance for very small polygons.

    As for the GeForce FX vs. Radeon 9700 Pro, one thing that I would still like to see is trilinear fillrate. If the FX actually can filter 8 textures per clock with trilinear filtering, then for most situations it will have more raw fillrate per clock. If it can only filter 8 textures per clock with bilinear filtering, then it will (theoretically) be somewhat slower than the 9700 Pro on a clock for clock basis, depending on the rendering situation.
     
  20. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Nope. bilinear.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...