The Official NVIDIA G80 Architecture Thread

Discussion in 'Architecture and Products' started by Arun, Nov 8, 2006.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    There's a fair number of niggly errors, definitely v1.0-itis, but still very readable.

    Jawed
     
  2. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Anyone noted these quite moderate occlusion results:

    [​IMG]

    [​IMG]

    [​IMG]

    [​IMG]
     
    #82 fellix, Nov 9, 2006
    Last edited by a moderator: Nov 9, 2006
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Those numbers (I'm guessing which you mean, since they aren't actually showing) on their own don't seem to mean much as far as I can tell.

    I do wonder, though, if this might be why G7x is so poor at Oblivion's foliage?

    Jawed
     
  4. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Err, isn't it 8 quads to a batch?

    Can it reorder batch elements on the fly? I believe Chalnoth threw that out, and with 2x16 it ought to be *possible* (though complex) to swap a single 16 section around. I suppose the test for that is fairly simple -- two consecutive branches corresponding to, say, even 16x2 regions and odd 16x2 regions, respectively. I wouldn't expect it to manage that, but it would be cute if it did.

    Is the 32 batch size dictated by 8 texture units x 4 color components? That would seem to imply vector instructions issued wide, rather than serialized, which seems incredibly unlikely to me. Or is there something important about keeping a single texture unit across a whole quad? /me lost :(
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    HSR speed can be determined quite accurately from Humus' GL_EXT_reme benchmark. Just subtract the render times (i.e. 1/fps) between the 8x overdraw and 3x overdraw (front to back) and multiply by the core clock. Then divide 5 times the screen res by that number and you get the rejection rate per clock.

    I've found it works really since R300, maybe even earlier.
     
  6. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I looked further into the papers related to the interpolator/SF unit by Stuart Oberman et al., and I was partially wrong, but so was psurge, that I can see :) It's really 4 units per cluster, each capable of interpolating an attribute for one quad of pixels per clock, or doing a 3 hybrid iterations minimax quadratic approximation for one value for one pixel per clock.

    So our values were definitely right, and all the units are shared for interpolation and SF; but the hardware internally works on quads, and does some really smart tricks to reuse that for a 3 hybrid iterations approximation for SF. Clearly, there's some hardware being idle in either case (SF or Attribute Interpolation), but overall, it does still look like a very efficient tradeoff.

    As for HSR and HierZ, I got a few ideas on how that works internally, but I still need some testing time.


    Uttar
     
  7. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    Because branching doesn't work like we want under GLSL with ATI... They try to unroll loops and predicate branches in all public drivers. We have DX versions of the test, they used to behave, but there was a subtle change in the DX spec that is breaking the test for *both* vendors, so we need to figure out how to fix that...
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
  9. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    Isn’t this the logical result of having scalar units?
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yeah - but I'm not convinced Uttar is thinking through the rates correctly.

    There are 16 attribute interpolation "pipes", but only 4 SF pipes per cluster. 4 interpolation pipes are ganged per SF pipe.

    Jawed
     
  11. ector

    Newcomer

    Joined:
    Nov 3, 2002
    Messages:
    111
    Likes Received:
    2
    Location:
    Sweden
    If you look up so many gradients, why not precompute them into a second volume texture?
     
  12. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I wasted the last 3+ hours digging through the Stuart Oberman papers, so I'm pretty confident I got them right now, actually... :)

    You have 4 units per cluster. Each of them can interpolate 1 scalar attribute per clock for one quad (->4 scalar attribute interpolations going on there, or 16 total for the cluster) OR apply a special function to one scalar value for one pixel, per clock.

    Since the local scheduler's minimum granularity is 16 objects, I would assume that it sees those four "units" as a single entity, and a SF on 16 objects would take 4 cycles in its eyes, while a scalar interpolation of one attribute for 16 objects would take only one cycle.


    Uttar
     
  13. CouldntResist

    Regular

    Joined:
    Aug 16, 2004
    Messages:
    264
    Likes Received:
    7
    One question:
    Pixel shaders work in quads, but vertex shaders don't. Does this mean that the "special functions" will be 4 times more expensive when used in vertex shader than in pixel shader?
     
  14. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    No, they will not.
     
  15. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Tell me about it, just something else, if the g80 was planned for last year, what else does nV have planned?
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I invested the time a few weeks back :wink:

    Agreed. Actually, to be fair, it's Rys's diagram:

    http://www.beyond3d.com/reviews/nvidia/g80-arch/image.php?img=images/g80-diag-full.png

    which I think may be confusing things, at "10". The SF and interpolation rate is the same (i.e. the pipeline is equal in length for both), but it's the 3 auxilliary paths dedicated to interpolation that result in the 4x multiplier - rather than SF's taking 4x longer to compute.

    It's clearer to think of 4 quad dedicated interpolation pipes, with each quad dependent upon a single SF pipe.

    Yeah.

    Jawed
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Vista's release date has been pushed way, way back. ATI and NVidia should both have been planning on releasing something a long while back. What they've each done with the extra time, who knows?...

    It's also worth pondering how recently stuff was last added into (or more likely) taken out of D3D10.

    Jawed
     
  18. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    True, but if ATi is 4 months behind right now, I don't think we have to look to far to see where they were along the way, if nV's problem was due to chip size. Which in all honesty doesn't really make much sense to me if it was chip size that means yeilds of the g71 weren't as good as nV was saying orginally.......

    But that doesn't make much sense either because thier net profits are threw the roof even with the increased recall numbers.
     
    #98 Razor1, Nov 9, 2006
    Last edited by a moderator: Nov 9, 2006
  19. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Jawed, I'm not sure where you got that impression, but it looks horribly wrong to me.
    As reference, I'll take slide 33 of this presentation: http://rnc7.loria.fr/oberman_invited.pdf (which is a really nice presentation btw, fwiw, nice overall architecture/algorithms summary.)
    Slide 31 is also of some relevance. And if you actually want the original paper, it's all there still: http://66.102.9.104/search?q=cache:...ction+arith17&hl=en&ct=clnk&cd=1&client=opera
    And there's also the original Stuart Oberman paper on enhanced minimax quadratic approximation if the problem is you aren't sure how the algorithm for it works interally. It's basically the same for the multifunction unit, just with some smart sharing.

    What you don't seem to be realizing is that what you call the "auxiliary" paths are actually (partially) used for SF. But SF needs to do 3 iterations (it's not strictly three iterations; the second iteration is very specific, as well as some other things, which is why the paper calls it "three hybrid passes" - the details are available on page 3 of the paper if you feel like wasting some time...) - so, it makes use of two of the "auxiliary" paths for this. Considering there is at least one other "auxiliary" path available, it's easy to see that the algorithm could be extended to 4 iterations if needed for higher precision in the future, although the only use of that would be GPGPU, imo. None of the papers even allude to that possibility.

    Rys' diagram, just like mine, assumes 16 units that need 4 clocks for SF and 1 for interpolation, but clearly it's 4 units (or a bigger one, from the scheduler's POV as I said above) that does one pixel quad of interpolation or one SF per clock. At least as far as I can see, of course.


    Uttar
    P.S.: I'm now 99% sure that the MUL is doing Special Function setup (to put the values in range). The patents clearly hint also at the MUL functionality of the multipurpose ALU being put to use for that. Finally, I think that except for CUDA, it makes sense to only expose the MUL when you're doing SF, as the MUL would be idling 3/4th of the time when it has to setup a sincos etc., since the SF couldn't keep up. It'd be interesting if they could expose it more generally in the future though, especially in the VS - hmm.
     
  20. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    What errors have you spotted?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...