R350/NV35 Z Fillrate with FSAA

Discussion in 'Architecture and Products' started by Dave Baumann, Aug 19, 2003.

  1. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Actually, Ailuros, I think Tag's right there.
    Remember: Rampage was a 4x1, so if you had 2 textures, it took 2 clocks ( yet another example of how the NV20 had some serious advantages over Rampage, even if Rampage had some over the NV20 too ) , that's +100%.
    So if you use 2x MSAA, and you assume the 200Mhz DDR memory rumor to be roughly accurate, I'd say the hit of 2x MSAA most likely wouldn't be +100% memory bandwidth, because you wouldn't have more texture fetches, and a few other things wouldn't be doubled.
    So, you could say that the performance hit for 2x MSAA when using dual texturing, and the real one, not the fillrate hit, is quite small indeed!

    Sure, it wouldn't be inexistant probably. But saying it's 'free' actually makes semi-sense, considering that term probably came from marketing...

    Rev: I agree fully with you :) For me, Rampage is an interesting part. I don't care about whether it'd have crushed the NV20 or not, and whether it'd have been released on time, and so on. That's history. Some of the techs in it, even if often outdated or already implemented/modified in current products, are interesting, though.


    Uttar
     
  2. Tagrineth

    Tagrineth murr
    Veteran

    Joined:
    Feb 14, 2002
    Messages:
    2,537
    Likes Received:
    25
    Location:
    Sunny (boring) Florida
    OK, I'm game.

    R300 uses two 128-bit busses, and NV3x uses four 64-bit busses to achieve 256-bit-ness.

    NV2x uses four 32-bit busses, by the way. 8)

    Textures are actually repeated less than you may expect. SLI evolved beyond the old even, odd scanline V2 days... even V5 is able to use bands up to 64 pixels thick, and Rampage increased that to 128. Textures would in fact very often fall within those bands, allowing them to be stored only in one 'repository'.

    Nobody I know has any idea what you're talking about. 8)

    See also Uttar's post, probably right above this one.

    Again, see Uttar's post. Plus you said yourself that GP-1 is low speced - it's something like one pixel pipeline at 100MHz, sorta like Neon-250. And whaddya know, it's actually a bit faster than N250. Cool!

    See Uttar's post 8)

    On its own, though, yes, Spectre would've taken a pretty good hit with MSAA...

    No serious IHV would ever use a flawed, obviously bad quality AF implementation officially in its drivers... nor would they not allow you to use trilinear filtering no matter how hard you try... OH, WAIT! BUT THAT IS DONE! :eek:

    The visual error can be totally unnoticeable if you control the settings carefully enough. People on their Voodoo5's have gained 20-30 fps (up FROM ~40fps) with careful tweaking of the unlocked HSR settings. What makes you think that would be impossible to improve with some professional tweaking from people who actually KNOW what the settings mean, rather than blindly experimenting?
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    A flawed HSR technique can result in much more massive problems than a flawed anisotropic filtering technique. Specifically, if the algorithm screws up on the HSR, triangles will go a-missing.
     
  4. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Triangle setup is still not completely shared between chips.

    Typical answer I'd expect from any of the ex-3dhq wizzards.

    What about it? NV20 had the same multitexturing fillrate as dual chip Spectre and the only other significant advantage over it would have been the filter on scanout trick, which by the way works under conditionals too even on NV25. It saves bandwidth only AFAIK when the average performance rate doesn't drop below 2/3rd of the resolution refresh rate.

    Rampage made it barely into OGL debug mode, so other than theory there's no real performance numbers to refer to (even off the record), let alone a Rampage+Sage combination.

    The peak numbers touted from either/or was Sage 50MVPS vs 40MVPS on NV20. Where each of the two would have been able to sustain a higher number in reality is open to everyone's imagination, yet the NV20's "texture shader units" weren't particularly weak either, rather the contrary.

    Series2 used Supersampling. Go and recheck your facts up to which resolution SSAA was possible on Series2 and 3 (locked in drivers) and maybe it might ring a bell why SSAA can only be bandwidth free on a TBDR. Again MSAA is fillrate and bandwidth free on a TBDR, else explain how the hell it managed to even get a decent framerate for it's specs in 1280*1024 with AA on with a meger 200MTexels/sec fillrate.

    In terms of Supersampling how many fps did the K2 yield with 4xSSAA on in the same game in 32bpp and how many a dual chip V5? 350MTexels/sec for the K2 and 667MTexels/sec for the V5.


    3d-FX ---> GF-FX :p

    Seriously though NV3x have different anisotropic filtering modes, where it's left to the user's choice which mode he prefers, instead of being restriced to a mediocre sollution that operates only in conjuction with MSAA.

    They dropped geometry when the rendering took too long. Anyone with a basic understanding how the q3a engine's BSP operates and what true hardware HSR can or cannot do would know the difference.
     
  5. Heathen

    Regular

    Joined:
    Jul 6, 2002
    Messages:
    380
    Likes Received:
    0
    You sure about that? :wink:
     
  6. Tim

    Tim
    Regular

    Joined:
    Mar 28, 2003
    Messages:
    875
    Likes Received:
    5
    Location:
    Denmark
    I don’t know if she is sure but I do know that she is wrong R3x0 is 4*64bit.
     
  7. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,320
    Likes Received:
    525
  8. Tagrineth

    Tagrineth murr
    Veteran

    Joined:
    Feb 14, 2002
    Messages:
    2,537
    Likes Received:
    25
    Location:
    Sunny (boring) Florida
    True. How much space does that burn up, then? Can't be *that* bad... or am I wrong?

    'kay. =P

    Recursive texturing and massive way-ahead-of-its-time bandwidth not an advantage?

    SAGE was pretty well known for being efficient with added lights (much like Naomi2's Elan) at the time. I wouldn't be surprised it some of its more creative optimisations are already in NV3x's stunningly fast fixed-function TCL... but it would have to be, considering it was relying on the AGP bus for most of its bandwidth (I think 3dfx put it at a realistic, sustained 10 MPolys/sec, but I could be a few million off).

    That wasn't what I was referring to. I was referring to any performance at all.

    Both were playable with 4x AA at 800x600. K2 doesn't let you go any higher without dropping to 2x... but K2 and V5's frame rates are never that different, with or without AA. K2 is generally slightly faster, though not by much.

    Recursive texturing and an intelligent cache make Rampage do AF with a much lower bandwidth hit than usual, plus the TMU's are capable of trilinear per tick, unlike NV2x.

    You could do the same thing in many strictly front-to-back games. And AFAIR Rampage(/SAGE) has some kind of ordering assist, not sure if it's in the driver or hardware level, but I'm 90% sure it's possible.


    I pulled that from memory, oh well, my point was that I know that they aren't truly 256-bit either.
     
  9. Althornin

    Althornin Senior Lurker
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,326
    Likes Received:
    5
    You seem to be missing a major point here....
    They ARE truly 256bit - rampage wasnt. LKook, in one case, you have a 256bit bus to one chip (undeniable) that is broken into 4 64bit channels simply for efficiency - they can be used (IIRC) to transfer either 4 64bit datums, or 2 128bit datums, or one 256bit datum, or some combo of the above, in each cycle.
    In the other case, you are trying to say that two entirely seperate 128bit busses to two seperate chips are a 256bit bus. Please tell me you get the difference.

    In one case, its like you have 4 train tracks going to one station - each can carry a load, and they can work together to carry a bigger load. In the other, you've got one train going to one station, and the toehr going to another station....
     
  10. aths

    Newcomer

    Joined:
    Feb 8, 2002
    Messages:
    128
    Likes Received:
    3
    Location:
    Germany (at the Baltic Sea)
    May be better say "1xAA" ;) because it calculates just one subpixel.

    NV30 can support 8 texel or "zexel" per clock, not 16 "zexel". The table shows, that NV30 can write 16 z-samples, of course only with 2x AA, because with 2x AA the 16 z-samples are belong to 8 "zexel". To process a Z-pass without textures, more work is needed but just the Z-test.


    According to Demirug, the GeForces probably only discards complete 2x2 tiles with its Early Z.


    I honestly doubt that. The additional gain in egde smoothing gets significantly low after 4x RG, 6x sparsed or even 8x sparsed. I think, 8x is the maximum in the conceivable future.
     
  11. ram

    ram
    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    218
    Likes Received:
    0
    Location:
    Switzerland
    Aths, I think you missed my point. The question was why NV30/35 can write 8 independent zixels, while NV2x only can write 4 of them. FX can sample and write 16 zixel per clock, but the later only with MSAA. Assuming the ROPs are decoupled from the rest of the pipeline, the question remains what part of the chip restricts the chip of handling only 16 >independent< zixels.

    Among other things, the purpose of a ROP is to do framebuffer blending operations and z-/stencil-tests. Obviously, there is enough logic to handle 16 color, z-/stencil values / clock, as 4X MSAA is possible without a fillrate drop if there is no bandwith bottonleck.
    Obviously. The question is what exactly has been enhanced, if not the ROPs. And what exactly limits the ROPs from handling 16 indepentent zixels. The limitation to 8 indepentend zixel still could be just a limitation in the ROPs, which might have something to do the amount of read ports at the ROPs. If the ROPs are organized as 4x4, there might be only read ports for four x/y-values (+offset/aa mode), which have been enhanced to eight with GFFX. Another place would be the samplers, which can only output 8 x/y values. Or even a limitation with z-buffer compression logic.
     
  12. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Are you familiar with the term redundancy at all?

    As I said Tilers had loopback functions way before that and nowadays chips have advanced even beyond that. As for bandwidth I'd rather have a very effective bandwidth saving technique (if not a full TBDR str8 away) with 10GB/sec raw bandwidth, than 30GB/sec raw bandwidth with no advanced bandwidth saving techniques at all.

    Fixed function what on NV3x's? LOL :D What on God's green earth do we have Vertex Shaders for nowadays?


    I love it when people try to elegantly flip out of it......

    With the only other difference that you needed two chips to achieve what a single chip TBDR was able to. KYRO2 was able for 4xOGSS up to 1024*768 and 2x Vertical up to 1280*1024; where the K2 supposedly flips back to 2x in 1024 is beyond me. Shall I scratch another one on the scoreboard milady ;)

    With the only other difference that it hardly deserves the term AF as a starter. Did you know that Xabre has antialiasing for free? 1x AA over the whole scene which is effectively what I could call in that sense AA for free too.

    In wild imaginations anything is possible.

    Gee my oh my.... :roll:
     
  13. aths

    Newcomer

    Joined:
    Feb 8, 2002
    Messages:
    128
    Likes Received:
    3
    Location:
    Germany (at the Baltic Sea)
    "Because" NV30 can work as 4x2 or 8x0. Afaik, you need some additional interpolators to write more Z-values, but I have no in-depth knowledge of actual Z-logic.

    So NV25 also can :) but only with 4x MSAA, and without textures.

    Some simplifications in the control logic, I assume. (Or may be the Triangle Setup cannot handle 16 "pixel" or "zixel" per Clock anyway?)

    Who still plays with AA disabled? Your question is interessting indeed, but I think this is not a serious limitation of NV30.
     
  14. ram

    ram
    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    218
    Likes Received:
    0
    Location:
    Switzerland
    AFAIK, assuming no bandwith bottonleck, NV25's ability to sample and write 16 (partly dependend) zixel/clock in the 4XMSAA case isn't limited by the usage of textures. Basicly, there could be up to two textures per pixel. Its the same with the NV30/35.
     
  15. aths

    Newcomer

    Joined:
    Feb 8, 2002
    Messages:
    128
    Likes Received:
    3
    Location:
    Germany (at the Baltic Sea)
    Yeah, but there is a serious bandwith bottleneck. 4x MSAA with 32-Bit-Rendering an textures means 16 32-Bit-Subpixels (512 Bit) alone. In fact, NV25 dont need its 16 ROPs.
     
  16. ram

    ram
    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    218
    Likes Received:
    0
    Location:
    Switzerland
    Right. I'd say it needs at least parts of them to do the 16 z-compares/clock. As 16 z-writes aren't needed in practice, the write ability probably is limited to 8 or even just 4 zixel/clock with the nv25. It should be possible to check that with one of the benchmarks doing z-only passes, 4xAA, B2F rendering @ halfed core clock.
     
  17. aths

    Newcomer

    Joined:
    Feb 8, 2002
    Messages:
    128
    Likes Received:
    3
    Location:
    Germany (at the Baltic Sea)
    8 ROPs are ok, to have the fillrate for a Z-Pass with 2x AA enabled.
     
  18. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    There's no reason that NV25 would be designed to do 16 Z-compares per clock because there's not enough bandwidth.
     
  19. aths

    Newcomer

    Joined:
    Feb 8, 2002
    Messages:
    128
    Likes Received:
    3
    Location:
    Germany (at the Baltic Sea)
    Nvidia states that NV25 can calculate 16 AA-Samples per Clock. With MSAA, every subpixel has to be z-tested, so I deduce that NV25 got 16 ROPs. Indeed, there is not enough bandwith to practically utilize this.

    But Integer logic like a ROP seems to be small, on the other hand, and NV can advertise with insane AA-Fillrate. NV cuts out the possibility to calculate 2 AF-Samples per Pipe on NV25. So bilinear AF costs the performance of trilinear AF (which means, to lower the fillrate by 75%.)
     
  20. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    They also claimed that 4x AA was free on the NV3x, and this hasn't been shown to be the case. In fact, the results shown in this thread show that only 2x AA is "free" on the NV3x.
    It doesn't matter how large you think it is: You don't over engineer things by principle.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...