Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    Indeed, according to Nvidia's Nsight GPU Tracer on my RTX 2060; FMA pipe has load almost as twice as ALU pipe.

    [​IMG]
     

    Attached Files:

    #521 Man from Atlantis, Aug 16, 2020
    Last edited: Aug 16, 2020
    Pete, Lightman, fellix and 6 others like this.
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Nice. Overall utilization is very bad though. 25% FP and 15% INT. All that idle hardware!!
     
  3. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    I'm not sure how nsight calculates SM instruction numbers but, SM occupancy seems fine %100

    [​IMG]
     
    Pete, pharma and BRiT like this.
  4. Benetanegia

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    I believe those are not utilization rates, they are the percentages that each kind of instruction represents against the total amount of instructions. There's a few missing from the chart like load/store, texture operations and who knows what else counts towards the total for the SM.

    As an hypothetical example if an SM did 4 different instructions every cycle, i.e FP, INT, LD/ST, TEX, each one of them would appear as 25% each. If one was underutilized it would get less than 25% and the rest would get a bit more and so on.
     
    Pete, PSman1700, Remij and 7 others like this.
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    That's the normalized number which will always add up to 100% because it's relative to total issued instructions across all pipelines.

    I'm talking about the absolute stats which presumably represent percentage of used vs available instruction slots for each pipeline. These will not add up to 100%. They should ideally add up to over 200% given each issued instruction can keep some pipelines busy for many cycles e.g. SFU.
     
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    That's not true. There are clearly many areas of completely idle SMs (light gray) and active SMs with idle warps (dark gray).

    Also, warp occupancy isnt the same as ALU utilization. You can have lots of warps but they're all blocked by a memory request for example and the ALUs are idling as a result.
     
    Kej, Krteq, Lightman and 2 others like this.
  7. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,679
    Does Nsight actually work with a lot of games? Just need to turn on gpu counters in driver? Assuming it won't work with multiplayer games with anti-cheat, but would be nice to figure out where my bottlenecks are in particular games.
     
    Man from Atlantis likes this.
  8. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    I still think it’s going to be 2 16-wide fp32 + 1 16-wide int32 ALUs per scheduler - with integer macs maybe getting performed by one of the fp32 units. Not sure what to expect from the consumer tensor cores or the ray tracing hw though.
     
  9. Benetanegia

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    No, I'm not talking about the normalized values. Those values are normalized for the 4 instruction types shown on the absolute column and thus those 4 types will always add up to 100%.

    The absolute numbers also likely add up to 100% when you include all of the instruction types, but there's many instruction types missing, like I said, load/store, texture ops, maybe RTX in some of those games, etc.
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    If that’s the case then it only lends more credence to low ALU utilization. The dispatch unit doesn’t co-issue any instructions to my knowledge. If only 24% of the issued instructions are FP and you need to issue an FP instruction every other clock to maximize utilization then it stands to reason that the FP pipe is idle at least 50% of the time.
     
  11. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    FWIW couldn't get it work with Ashes of Singularity Escalation and RDR2. Fingers crossed R* wont ban my account.
     
  12. Benetanegia

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    But it does.
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Which ones?
     
  14. Benetanegia

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    I don't remember perfectly, but there's been tweets and posts from discussion boards from Nvidia engineers, posted here over the years. At least one of them said that RT were totally "independent, much like texture units", so that implies at least those 2 I guess. The situation with Tensor Cores is kind of confusing to me, with some claiming to take the place of a FMA or an INT and sometimes seemeingly being co-issued along those 2.

    All in all, all references to the issue rate that I find are clearly taken verbatim from the "Architeture Tuning Guides" for CUDA, which don't seem to talk about non-compute operations at all. My wild guess is that any op that shares the FMA datapath cannot be co-issued obviously, while the ones that have a different one like textures might.

    EDIT: Also, not sure what the context of those numbers are. If they are from a snapshot of a small period of time, the averages for a frame or what. I mean certainly FMA occupation could obviously reach very low numbers during certain tasks I imagine, like depth generation or whatever, and that would obviously lower the average occupation for a frame or a period long enough to include those idle periods. ALU's are certainly not pegged at 100% or anywhere near all the time, they do sit idle. My contention here is that in the time periods when they are working they must obviously reach ocuppation way highr than 25%.
     
    #534 Benetanegia, Aug 17, 2020
    Last edited: Aug 17, 2020
  15. razor_guy_mania

    Newcomer

    Joined:
    Feb 9, 2007
    Messages:
    2
    Likes Received:
    0

    The second chip on other side still sounds implausible. BUT if we assume for a moment that its true IMHO it could be a DL inference chip. I am just speculating here.

    Some people on this thread have suggested that its a RT accelerator chip or another GPU, but a DL inference chip makes more sense.

    A DL inference chip could enable multiple always-on AI use-cases and save power by not spinning up the whole GPU. Could be useful for applications like RTX voice etc.
     
  16. Benetanegia

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    lol although I didn't think of it when I wrote that, truthfully Occam's Razor does say that if they are hiding a supposed chip on the back, there's probably no chip. As that's the one hypothesis that requires the least amount of assumptions: they are tricking us for clocks.

    I believe some knowledgeable people here in B3d said that having those offchip would be a terrible idea because of latency and loss of synergy/duplication of resources, etc. Especially for RT, shich still requires shading to happen promptly. So I agree DL chip would be far more likely, but I think both make little to no sense. More so when the A100 completely lacks any such chip. If it made sense to have it, wouldn't it make much more sense for the A100 instead of a massive die?
     
    Cuthalu likes this.
  17. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    This chip is very likely a display controller or something (with HDMI 2.1 capabilities for example) - which GA100 lacks btw. Nothing fancy.
     
  18. razor_guy_mania

    Newcomer

    Joined:
    Feb 9, 2007
    Messages:
    2
    Likes Received:
    0
    Any source saying that GA10x lacks HDMI 2.1? I find that hard to believe.

    GA100 is a datacentre part that doesn't need to have display controllers, so they are probably just saving space.
     
  19. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    40
    Likes Received:
    31
    Given expected power draw, could new design allow for exposing underside of the chip, for better cooling?
     
  20. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    GA100 lacks all display outputs.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...