The Official NVIDIA G80 Architecture Thread

Discussion in 'Architecture and Products' started by Arun, Nov 8, 2006.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Whoops, sorry, I went to bed before realising that the MUL performs the same on both GPUs - since G80 is "4x wider" per clock and 4 clocks per fragment is the "normal" rate for a vec4 operation anyway.

    Jawed
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Yeah the vec3+SF architecture is the same speed on the MUL and yes, specific to G80's implementation.

    Jawed
     
  3. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    You think G80 is devoid of any intelligence in its scheduling? We already know it can operate on multiple batches simultaneously, and when the SF unit is used for interpolation during texturing you aren't losing ALU cycles.

    DP4 takes 4 cycles, MUL takes 4 cycles, so the 4 cycles for the RSQ can be hidden completely. One example (for 32 pixel batches):

    cycles 1-8: DP4 for batch 1
    cycles 9-16: DP4 for batch 2, RSQ for batch 1
    cycles 17-24: MUL for batch 1, RSQ for batch 2
    cycles 25-32: MUL for batch 2

    And so on. The primary ALU is at 100% for all cycles. Of course, G80 can handle a lot more than just 2 batches in flight, so we'd see a lot more batches interleaved than this simple example. After all, it can hide hundreds of clocks for texture latency.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    No, the primary ALU and the SF are bound together by co-issue, meaning they have to be from the same thread.

    Jawed
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    How do you know that it works that way?
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Because Bob didn't correct me.

    Jawed
     
  7. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    He did correct you, but you didn't seem to pay much attention to it... He implied this code would only take 5 clocks. As to whether he was thinking of a Vec4 MUL or of a Scalar MUL... :) (I do have my little idea of how you can get to either of those figures, anyway)


    Uttar
     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,057
    Likes Received:
    3,114
    Location:
    New York
    :lol:
     
    #348 trinibwoy, Dec 21, 2006
    Last edited by a moderator: Dec 21, 2006
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    With a logic like that how could you ever be wrong? ;)
     
    #349 nAo, Dec 21, 2006
    Last edited: Dec 21, 2006
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    The entire point of my code sample was to show that a single stream of code (i.e. not something like an unrolled loop with implicitly parallel or partially overlapping loop iterations) with per-clock instruction dependency will cause the primary ALU to sit idle.

    It's an extreme case and Bob's hypothetical case doesn't answer the dependency issue I'm talking about. Bob's answer depends solely on co-issue, not on multiple-thread scheduling.

    Jawed
     
  11. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    Whoops. Fixed now.

    No it doesn't.

    There are 2 major cases where the primary ALU will be idle: Some global hazard happened causing all threads to stall (texturing from system memory, with poor locality, for example), and if your shaders are very heavy on the ALU1 ops, to a ratio of more than 1:4 scalars.

    For example, if you took your original shader and replaced all ops by RSQs, well now you'd have no ALU0 ops and so it will obviously be idle.

    There are other smaller ones that can happen but they tend to not be all the frequent in real life (and can be worked around to some extent by the compiler).
     
  12. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    Indeed!


    Edit: Need more sleep
     
    #352 Bob, Dec 21, 2006
    Last edited by a moderator: Dec 21, 2006
    Geo likes this.
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    LOL, these scalars trip everyone up, eh?

    Jawed
     
  14. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,057
    Likes Received:
    3,114
    Location:
    New York
    Hey, I didn't say that. Nao did ! :D
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...