Nvidia Ampere Discussion [2020-05-14]

Makes sense that the FP32/INT32 is 2:1 IMO.
In nvidia's own marketing material, the ratio of INT32 operations in games is only up to 40% IIRC, so Turing had an unbalanced amount of INT32 units (at least for game rendering).

Indeed, according to Nvidia's Nsight GPU Tracer on my RTX 2060; FMA pipe has load almost as twice as ALU pipe.

nsight_rtx2060n2kus.png
 

Attachments

  • HZD.nsight-gfxgpt.zip
    1.4 MB · Views: 11
  • CONTROL_DLSSON_RTXON.nsight-gfxgpt.zip
    709.6 KB · Views: 10
  • CIV_VI.nsight-gfxgpt.zip
    1.3 MB · Views: 9
  • GEARS5.nsight-gfxgpt.zip
    1 MB · Views: 10
Last edited:
Nice. Overall utilization is very bad though. 25% FP and 15% INT. All that idle hardware!!

I believe those are not utilization rates, they are the percentages that each kind of instruction represents against the total amount of instructions. There's a few missing from the chart like load/store, texture operations and who knows what else counts towards the total for the SM.

As an hypothetical example if an SM did 4 different instructions every cycle, i.e FP, INT, LD/ST, TEX, each one of them would appear as 25% each. If one was underutilized it would get less than 25% and the rest would get a bit more and so on.
 
I believe those are not utilization rates, they are the percentages that each kind of instruction represents against the total amount of instructions. There's a few missing from the chart like load/store, texture operations and who knows what else counts towards the total for the SM.

As an hypothetical example if an SM did 4 different instructions every cycle, i.e FP, INT, LD/ST, TEX, each one of them would appear as 25% each. If one was underutilized it would get less than 25% and the rest would get a bit more and so on.

That's the normalized number which will always add up to 100% because it's relative to total issued instructions across all pipelines.

I'm talking about the absolute stats which presumably represent percentage of used vs available instruction slots for each pipeline. These will not add up to 100%. They should ideally add up to over 200% given each issued instruction can keep some pipelines busy for many cycles e.g. SFU.
 
I'm not sure how nsight calculates SM instruction numbers but, SM occupancy seems fine %100

smoccupancy4pj96.png

That's not true. There are clearly many areas of completely idle SMs (light gray) and active SMs with idle warps (dark gray).

Also, warp occupancy isnt the same as ALU utilization. You can have lots of warps but they're all blocked by a memory request for example and the ALUs are idling as a result.
 
Does Nsight actually work with a lot of games? Just need to turn on gpu counters in driver? Assuming it won't work with multiplayer games with anti-cheat, but would be nice to figure out where my bottlenecks are in particular games.
 
I still think it’s going to be 2 16-wide fp32 + 1 16-wide int32 ALUs per scheduler - with integer macs maybe getting performed by one of the fp32 units. Not sure what to expect from the consumer tensor cores or the ray tracing hw though.
 
That's the normalized number which will always add up to 100% because it's relative to total issued instructions across all pipelines

I'm talking about the absolute stats which presumably represent percentage of used vs available instruction slots for each pipeline. These will not add up to 100%. They should ideally add up to over 200% given each issued instruction can keep some pipelines busy for many cycles e.g. SFU.

No, I'm not talking about the normalized values. Those values are normalized for the 4 instruction types shown on the absolute column and thus those 4 types will always add up to 100%.

The absolute numbers also likely add up to 100% when you include all of the instruction types, but there's many instruction types missing, like I said, load/store, texture ops, maybe RTX in some of those games, etc.
 
No, I'm not talking about the normalized values. Those values are normalized for the 4 instruction types shown on the absolute column and thus those 4 types will always add up to 100%.

The absolute numbers also likely add up to 100% when you include all of the instruction types, but there's many instruction types missing, like I said, load/store, texture ops, maybe RTX in some of those games, etc.

If that’s the case then it only lends more credence to low ALU utilization. The dispatch unit doesn’t co-issue any instructions to my knowledge. If only 24% of the issued instructions are FP and you need to issue an FP instruction every other clock to maximize utilization then it stands to reason that the FP pipe is idle at least 50% of the time.
 
Does Nsight actually work with a lot of games? Just need to turn on gpu counters in driver? Assuming it won't work with multiplayer games with anti-cheat, but would be nice to figure out where my bottlenecks are in particular games.

FWIW couldn't get it work with Ashes of Singularity Escalation and RDR2. Fingers crossed R* wont ban my account.
 
Which ones?

I don't remember perfectly, but there's been tweets and posts from discussion boards from Nvidia engineers, posted here over the years. At least one of them said that RT were totally "independent, much like texture units", so that implies at least those 2 I guess. The situation with Tensor Cores is kind of confusing to me, with some claiming to take the place of a FMA or an INT and sometimes seemeingly being co-issued along those 2.

All in all, all references to the issue rate that I find are clearly taken verbatim from the "Architeture Tuning Guides" for CUDA, which don't seem to talk about non-compute operations at all. My wild guess is that any op that shares the FMA datapath cannot be co-issued obviously, while the ones that have a different one like textures might.

EDIT: Also, not sure what the context of those numbers are. If they are from a snapshot of a small period of time, the averages for a frame or what. I mean certainly FMA occupation could obviously reach very low numbers during certain tasks I imagine, like depth generation or whatever, and that would obviously lower the average occupation for a frame or a period long enough to include those idle periods. ALU's are certainly not pegged at 100% or anywhere near all the time, they do sit idle. My contention here is that in the time periods when they are working they must obviously reach ocuppation way highr than 25%.
 
Last edited:
To me Occam's razor says that it is a dual-GPU card.

1) xx90 name is apparently back. Usually reserved for dual-GPU cards.
2) Memory on bosth sides.
3) (independent) Cooling in both sides.
4) Chip on both sides! And leaker covered it, to not disclose what it is. Probably to leverage it latter in some way. Second big leak/reveal for a lot of clicks, etc.
5) Power requirements.

If it looks like a duck and quacks like a duck...


The second chip on other side still sounds implausible. BUT if we assume for a moment that its true IMHO it could be a DL inference chip. I am just speculating here.

Some people on this thread have suggested that its a RT accelerator chip or another GPU, but a DL inference chip makes more sense.

A DL inference chip could enable multiple always-on AI use-cases and save power by not spinning up the whole GPU. Could be useful for applications like RTX voice etc.
 
The second chip on other side still sounds implausible.

lol although I didn't think of it when I wrote that, truthfully Occam's Razor does say that if they are hiding a supposed chip on the back, there's probably no chip. As that's the one hypothesis that requires the least amount of assumptions: they are tricking us for clocks.

BUT if we assume for a moment that its true IMHO it could be a DL inference chip. I am just speculating here.

Some people on this thread have suggested that its a RT accelerator chip or another GPU, but a DL inference chip makes more sense.

A DL inference chip could enable multiple always-on AI use-cases and save power by not spinning up the whole GPU. Could be useful for applications like RTX voice etc.

I believe some knowledgeable people here in B3d said that having those offchip would be a terrible idea because of latency and loss of synergy/duplication of resources, etc. Especially for RT, shich still requires shading to happen promptly. So I agree DL chip would be far more likely, but I think both make little to no sense. More so when the A100 completely lacks any such chip. If it made sense to have it, wouldn't it make much more sense for the A100 instead of a massive die?
 
This chip is very likely a display controller or something (with HDMI 2.1 capabilities for example) - which GA100 lacks btw. Nothing fancy.
 
This chip is very likely a display controller or something (with HDMI 2.1 capabilities for example) - which GA100 lacks btw. Nothing fancy.

Any source saying that GA10x lacks HDMI 2.1? I find that hard to believe.

GA100 is a datacentre part that doesn't need to have display controllers, so they are probably just saving space.
 
Back
Top