The Official NVIDIA G80 Architecture Thread

Discussion in 'Architecture and Products' started by Arun, Nov 8, 2006.

  1. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    I'd be willing to bet that there's some difficulty fetching enough registers to go wider within the cluster. If you want to tweak something in there, you add more texture addressing units, or decrease the issue width to a quad. If the R600 really winds up being good at either modifying issue width or TLP, then the G80 arch is going to need some revisiting for gpgpu competitiveness.

    I'm not really expecting any of that in G81, though. If I wanted to "get crazy", I'd go with just a straightforward scaling. 384->512 and 900->1200 would imply (assuming bw:computation balance doesn't change) 10ish clusters at 750/1800. Of course, the chip wouldn't be any smaller at 80nm than g80 at 90nm if they did that, and I doubt it would run cooler either! Unless G80 has a couple of clusters in there for yield purposes that they can unlock in 80nm, I don't think I would expect even that.

    Of course, another possible rabbit is 65nm. Hmm, nah. Kirk hasn't been out vigorously denying 65nm.... ;-)
     
  2. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    It's not like anyone would bother to revisit the architecture just for GPGPU purposes IMO. That doesn't sell any significant amount of cards.
     
  3. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    Well, I know that Orton said he expects physics, which is a subset of gpgpu, to shift significant amounts of product by mid-2007. I have no reason to think NVIDIA doesn't see it similarly.
     
  4. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    The physics will be rather sparely used in games (lowest common denominator, yadda, yadda...), so I think that won't really be a criteria, even if ATI should do it say, twice as fast. It'll be about the "common" gaming usage as always, shader perf, textureing etc. My 2 cents.
     
  5. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    Well, I didn't mean it in a competitive analysis fashion (tho possibly Orton did!). Just that certainly ATI, and probably NV, are thinking its enough of a difference maker from a sales and marketing pov to be revisting architectures about.
     
  6. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I'd be willing to bet you just implied the R520->R580 transistion never existed, and that ATI redesigned the entire chip in less than 3 months!
    Hmm :)


    Uttar
     
  7. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    That move came at the price of a wider batch/poorer dbr scaling, though.
    If R600 really turns out to have exceptional dbr performance, going wider in g81 is going to be ... interesting.

    Consider an alternative. "The missing MUL" would require 2 operands over 16 channels to activate. If, instead, the 8x2 shaders became 8x3, the additional 8 shaders per cluster would require a max of 3 ops each. That's actually a lower maximum in the way of bandwidth requirements (not counting sfu). Given that activation of the MUL isn't going to raise performance nearly as much in reality as on a gflop scoresheet, I know which one I would wish for :)
     
  8. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    54
    Location:
    Canada
    You keep making these bold claims. How do you know this is true? Haven't some instructions on G80 gone from single cycle to 4 cycles? Keep in mind that G80 has to also share the 128 ALUs for vertex shaders.
     
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,110
    Location:
    New York
    The worst case scenario for a vec3+scalar is to co-issue two scalar instructions => 50% ALU utilization where G80 would be at 100% utilization. They aren't really guesses - these things are pretty obvious. It doesn't have anything to do with cycles/instruction either. Now what's also obvious is that utilization on a vec3+scalar arrangement may not be bad at all on average.
     
  10. armchair_architect

    Newcomer

    Joined:
    Nov 28, 2006
    Messages:
    128
    Likes Received:
    8
    Why's that?

    I don't believe the register file and ALUs are organized like a junior-high dance: all the registers on one side and all the ALUs on the other. I'd expect them to be paired pretty tightly -- some subset of the register file is local to each unit in the SIMD array. This assumes that each vertex/pixel is permanently assigned to one ALU + register segment so that most accesses are local. That way the total size and bandwidth of the register file scales with ALUs, but the port requirements per segment are constant.
     
  11. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    The worst case is a scalar instruction that could not pair with any other instruction.
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,110
    Location:
    New York
    Yaar.
     
  13. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Yeah, so did I, until I saw that patent Jawed threw around.
    Regardless, 90->80 doesn't give you a lot of room. Doubling ALU count + register width sounds like a lot of trannies to me, so I was keeping one of them constant....
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The worst case on G80 is 0% utilisation of the primary ALU:

    DP4 r0.w, r1, r2
    RSQ r0.w, r0.w
    MUL r3, r4, r0.w

    the MUL has to wait until the RSQ has completed (which has to wait until the DP4 has completed). So while the SF ALU is working on the RSQ, the primary ALU is idle, which is four clocks.

    16 fragments x 4 clocks looks unhealthy in comparison with a vec3+SF GPU: 4 batches, each of 4 fragments x 1 clock (both GPUs considered as having 16-SIMD ALUs). The latter is 4x faster on the MUL. But that's an extreme case, generally G80 wins out significantly.

    Jawed
     
  15. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    Did you actually measure the performance of that code on a G80? You'd be limited to 24 pixels/clock from the ROPs, not from the shader.

    If you repeat the same piece of code 100 times (say), and if the compiler couldn't optimize any of it away, you're still looking at [strike]5[/strike] 8 clocks to run one iteration per ALUs.

    Edit: Minor clarification.
    Edit2: Fixed scalar vs vector MUL confusion.
     
    #335 Bob, Dec 21, 2006
    Last edited by a moderator: Dec 21, 2006
  16. reltham

    Newcomer

    Joined:
    Nov 21, 2006
    Messages:
    14
    Likes Received:
    0
    Location:
    San Diego
    Why do you think it has 16 shaders tied to the same instruction? Isn't G80 MIMD?

    Also, your example is contrived and probably could easily be prevented at final compile time in the driver... not to mention the scheduling could hide it.

    Maybe I misunderstood...
     
  17. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,502
    Likes Received:
    24,397
    Yes. We're talking worst case for ALU utilization.
     
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,110
    Location:
    New York
    I don't think I fully get the example (brain is fried - just got home from work) but couldn't you contrive something similiar to get similiarly poor utilization of the vec3? Also if I understand what you're trying to demonstrate - this 4-clock idle time during the RSQ is specific to G80's implementation, not to scalar architectures in general.
     
  19. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    Normally, i would put this in a R5xx or R600 thread, but everything (which isn't much) that is said about them is already pretty much established by now.
    What i didn't know was that Nvidia apparently had "something up the sleeve" in March.

    http://www.vr-zone.com/?i=4400

    Since they said that immediately after the R600 bit, we can assume it's a response to the former, and not G84/G86.
    What's the word on this thing, an overclocked 90nm G80, a 80nm die shrink and an overclocked G80-based design (a la G70->G71), or a real -design- improvement (this last one seems a bit too soon to be true) ? Is this even credible ?
    I was under the impression that any response to R600 would come no sooner than June through September 2007.
     
    #339 INKster, Dec 21, 2006
    Last edited by a moderator: Dec 21, 2006
  20. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,502
    Likes Received:
    24,397
    Maybe thats when Nvidia will unleash the true power of the G80 with magical drivers? Could also be shrink to 80 nm with GDDR4, seeing as March would put G80 at 4-5 months.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...