Nvidia Ampere Discussion [2020-05-14]

Theoreticly they did but practicaly its not achievable, you always have to deal with a 20-40% decrasse in FP32 performance in most of the programmes. This was realy a big marketing trick.
If the code is pure FP32, Ampere can reach its rated TFLOP. For example, in Geekbench 5 and AIDA64 GPGPU:
4-5-5.png

and in some rendering benchs like V-RAY and Blender, Ampere is also very close to its rated FLOPS:
1-4-14.png
 
Theoreticly they did but practicaly its not achievable, you always have to deal with a 20-40% decrasse in FP32 performance in most of the programmes. This was realy a big marketing trick.
And how exactly do you think INT32 is being run on GPUs which don't have a dedicated INT32 h/w? And how is this different from how they are run on Ampere?
 
@Scott_Arm

The thing with Ampere is, that they didn't doubled the FP32 Units. You always need INT format. The Minimum which wash showed was 20% Using of Int. I think you always have to deal with 80-60% of FP32 which you can use.

Here is a well Cuda FP32 optimized Benchmark and Ampere can only archieve 40-50% Perfromance uplift.

https://www.evolution.ai/post/bench...ith-tensorflow-on-the-nvidia-geforce-rtx-3090
This is most probably using the tensor cores, since it's based on a recent tensorflow framework (unless they disabled them by choice). And there, the 3090 is not much faster on paper than the unconstrained RTX Titan.
Image from the GA104 whitepaper, condensed down to the TFLOPS section.
 

Attachments

  • Titan RTX vs. RTX 3090 - AI training.png
    Titan RTX vs. RTX 3090 - AI training.png
    73.6 KB · Views: 15
And how exactly do you think INT32 is being run on GPUs which don't have a dedicated INT32 h/w? And how is this different from how they are run on Ampere?
There is a lot of confusion on how FP and INT can be scheduled. Nvidia has done a poor job clarifying it. Nvidia's material gives the impression that its either 128 FP or 64+64. So with even a single INT instruction you would lose half of the FP capability of an SM. On AMD GPUs, as i understand it, INT instructions can be issued arbitrarily at the stream processor level.
 
If the code is pure FP32, Ampere can reach its rated TFLOP. For example, in Geekbench 5 and AIDA64 GPGPU:
View attachment 4694

and in some rendering benchs like V-RAY and Blender, Ampere is also very close to its rated FLOPS:
View attachment 4696
You can reach max TFLOP with corner cases, aida64 computes fractals, for which you don't need to read any data from caches/memory.
That is the ideal situation, ALU's never have to wait for data.
The blender Barcelona Pavillion has super simple geometry, also in such cases data fetching is not a bottleneck.
 
And how much FP capability would you loose in a CU containing 2 SIMD units in this case?
You would lose however many stream processors have INT instructions scheduled. AMD can issue any arbitrary mix of FP and INT instructions as far as i understand it.
 
There is a lot of confusion on how FP and INT can be scheduled. Nvidia has done a poor job clarifying it. Nvidia's material gives the impression that its either 128 FP or 64+64. So with even a single INT instruction you would lose half of the FP capability of an SM. On AMD GPUs, as i understand it, INT instructions can be issued arbitrarily at the stream processor level.

There are 4 independent partitions in an SM, each with their own warps, instruction cache, scheduler and dispatcher. They don’t run in lock step in any way. You can do 16 FP + 16 INT in one partition while doing 32 FP in another.

The white paper states “All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.” This can be confusing if you take it to mean that this is the only combination possible but clearly that isn’t the case.
 
You would lose however many stream processors have INT instructions scheduled. AMD can issue any arbitrary mix of FP and INT instructions as far as i understand it.

What do you mean by stream processor? By definition all lanes of each 32-wide SIMD must execute the same instruction. You can’t mix FP and INT within a SIMD in the same clock cycle.
 
There are 4 independent partitions in an SM, each with their own warps, instruction cache, scheduler and dispatcher. They don’t run in lock step in any way. You can do 16 FP + 16 INT in one partition while doing 32 FP in another.

The white paper states “All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.” This can be confusing if you take it to mean that this is the only combination possible but clearly that isn’t the case.

Yes I learned the correct granularity back around launch from this forum. I believe the poster Degustator was replying to was under the impression it was only the two combinations you listed. Nvidia's wording gives that impression.

What do you mean by stream processor? By definition all lanes of each 32-wide SIMD must execute the same instruction. You can’t mix FP and INT within a SIMD in the same clock cycle.

I thought i remembered it being stated here around the Turing launch that AMD GPUs could issue any arbitrary mix of INT and FP in a cycle. Must have been at a SIMD level.
 
You would lose however many stream processors have INT instructions scheduled. AMD can issue any arbitrary mix of FP and INT instructions as far as i understand it.
I'm not sure you do.
You can issue two instructions per clock on two SIMD units at best. It's about as "arbitrary" as it can be.
Ampere can have 2 FP32 instructions or 1 FP32 + 1 INT per clock per each unit in an SM (which there are 4 of in Ampere's SM).
RDNA can have 2 FP32 or 1 FP32 + 1 INT or 2 INTs per clock per each CU in a WGP (which there are 2 of in RDNA WGP).
From this point the only difference is that you can have the same peak INT throughput as that of FP32 on RDNA but only half that on Ampere. Otherwise they are the same.
 
Yes I learned the correct granularity back around launch from this forum. I believe the poster Degustator was replying to was under the impression it was only the two combinations you listed. Nvidia's wording gives that impression.

Got it.

I thought i remembered it being stated here around the Turing launch that AMD GPUs could issue any arbitrary mix of INT and FP in a cycle.

Not within a SIMD. Each Navi CU has 2 independent 32-wide SIMDs which can either execute 32 INT or 32 FP each clock.
 
You would lose however many stream processors have INT instructions scheduled. AMD can issue any arbitrary mix of FP and INT instructions as far as i understand it.
Per-work-item INT instructions on AMD entirely block FP instructions. There's a single SIMD that handles INT and FP for per-work-item calculations.

So the argument about "NVidia losing FP32 because of INT sharing" has a simple answer: "well, doh". Nothing new here.

Ampere has a continuously available FP SIMD. The real trick is keeping data ready for it to use. Running INT on another SIMD helps keep data ready. And when INT is not required, there's a chance to get a burst of extra FP goodness.

I'm not sure you do.
You can issue two instructions per clock on two SIMD units at best. It's about as "arbitrary" as it can be.
Ampere can have 2 FP32 instructions or 1 FP32 + 1 INT per clock per each unit in an SM (which there are 4 of in Ampere's SM).
RDNA can have 2 FP32 or 1 FP32 + 1 INT or 2 INTs per clock per each CU in a WGP (which there are 2 of in RDNA WGP).
From this point the only difference is that you can have the same peak INT throughput as that of FP32 on RDNA but only half that on Ampere. Otherwise they are the same.
SMs and WGPs or CUs don't really compare cleanly.

It's best to forget about CU (or WGP) level in RDNA. Each SIMD has its own instruction issue and all the SIMDs are INT/FP., "dual-action". The instruction issue to RDNA SIMDs is not controlled by the CU.

It's clearer to consider instruction-issue and register file. Ampere has dual instruction issue to two SIMDs. RDNA has single instruction issue to only one SIMD.

(The instruction issue of special functions, (TEX) data-loads, per-hardware-thread and branching evaluation all adds lots of complexity - they do affect the progress of work on the SIMDs, but they aren't directly relevant to an INT versus FP throughput discussion).

Ampere's theoretical FP throughput is far far higher than RDNA2 will be. Ensuring that there's work for the SIMDs to do is looking more and more to be the central problem. Complex, math-intensive, shaders are only getting more common - but render-pass count in games is increasing and that hurts SIMD utilisation.
 
SMs and WGPs or CUs don't really compare cleanly.
Well, they kinda do actually. SM and WGP are base level building blocks of NV and AMD GPUs. They have different architectures of course and different mix and types of h/w inside them.

It's best to forget about CU (or WGP) level in RDNA. Each SIMD has its own instruction issue and all the SIMDs are INT/FP., "dual-action". The instruction issue to RDNA SIMDs is not controlled by the CU.
Sure but that's scheduling differences which arise from the fact that Turing/Ampere's SIMDs are 16 wide while RDNA's are 32 wide.

It's clearer to consider instruction-issue and register file. Ampere has dual instruction issue to two SIMDs. RDNA has single instruction issue to only one SIMD.
Ampere doesn't really have dual issue of instructions, it issues them in consecutive cycles to either of the two SIMDs which are 16 wide in h/w and thus take 2 clocks to run through a warp.

The differences between Ampere and RDNA architectures are obviously pretty huge. But from the point of INT execution Ampere isn't that much different to RDNA now - both will utilize FP32 SIMDs to run INT32 instructions, both have the same wave/warp widths (RDNA has an option of 64 wide too but I don't know when it is being used over a 32 wide one) so both will "loose" the same amount of FP32 throughput due to the need to run INTs sometimes.

Basically, it's not that Ampere's flops are "marketing trick", it's that Turing's flops were "overrated" since they didn't need to run INTs and were dealing with FP32 only.
To get a "proper" comparison in gaming math between Turing and Ampere you need to add these 25-30% of INT instructions to Turing's FP32 throughput.
 
Ampere doesn't really have dual issue of instructions, it issues them in consecutive cycles to either of the two SIMDs which are 16 wide in h/w and thus take 2 clocks to run through a warp.
You're right, it's best to think of Ampere as dual-threaded issue to the two SIMDs - the cadence reduces instruction cache and issuer bandwidth.

Ability to issue to the two SIMDs depends upon operand availability. The heart of maximum SIMD throughput depends on at least two instructions, and their operands, being independently available every cycle. Obviously the operand collector helps here, though it adds latency.
 
Ampere's theoretical FP throughput is far far higher than RDNA2 will be. Ensuring that there's work for the SIMDs to do is looking more and more to be the central problem. Complex, math-intensive, shaders are only getting more common - but render-pass count in games is increasing and that hurts SIMD utilisation.

We speculated earlier in this thread that the 2x FP capability was simply Nvidia choosing the cheapest path to increasing performance over Turing. Maybe Nvidia isn't too bothered about the excess compute capacity. The difference is quite stark though. 150% more flops for 50% more bandwidth.
 
Last edited:
The thing is that we dont have scaling also in FP32. Witcher uses less than 20% INT calculation but only gets 38% increase over the 2080ti in 4k

Source Witcher Benchmark 4k: https://www.guru3d.com/articles-pages/geforce-rtx-3090-founder-review,24.html
Sourcer Picture of INT Usage at Witcher: https://m.hexus.net/tech/reviews/gr...g-architecture-examined-and-explained/?page=2
6883a602-963b-411b-9c65-1f5147bf5431.PNG
 
@Digidi Does Witcher 3 support asynchronous compute?

Edit: Do Turing and Ampere actually support asynchronous compute in the same way as GCN and RDNA do? Would be a good way to utilize more of that alu, assuming you're not already bottlenecked by something else like bandwidth.
 
Last edited:
Back
Top