Nvidia Ampere Discussion [2020-05-14]

Ampere can execute rt+ fp32+tensor ops concurrently unlike turing.
NVIDIA-GeForce-RTX-30-Tech-Session-00036_657EBA0F93174506B1741A4533873B0D.jpg

nvidia-geforce-rtx-3005kip.jpg


https://www.hardwareluxx.de/index.p....html?__twitter_impression=true&start=1&amp=1
 
Last edited:
Am I reading this right that there are half as many tensor cores per SM, but each tensor can do twice as many dense matrix operations per unit time and can also do twice as many spare matrix operations. So effectively each SM has the same dense matrix operations per unit time as with Turing, but can now do twice as spare matrix operations.
 
I may have misread, but thought I read half as many tensor cores but 4x more performant.
 
New Am I reading this right that there are half as many tensor cores per SM, but each tensor can do twice as many dense matrix operations per unit time and can also do twice as many spare matrix operations. So effectively each SM has the same dense matrix operations per unit time as with Turing, but can now do twice as spare matrix operations.
Yes, same as A100.
 
Also, the tensor cores are still going to eat all of your RF bandwidth

Are you 100% sure on that? In that case that should be a departure from A100m, right? Since iirc wasn't A100 high FP16 rate (non tensor) explained by running both FP16 and TC at the same time?

EDIT: From this post: https://forum.beyond3d.com/posts/2128606/

That's correct, we doubled non-Tensor-Core FP16 math up to 4x multiply-add rate relative to FP32. It was straightforward to support given 2xFP16 in the scalar datapath and 2xFP16 that could naturally be provided in the tensor core datapath.
 
Last edited:
That was for A100 if I'm not mistaken. They have half the throughput on the consumer variants.

Gaming-Ampere has only twice the throughput. I think more doesnt make any sense. nVidia has truely sperated their HPC/DL business from the consume lineup:
TSMC vs. Samsung
FP16/FP64/TensorCores vs FP32/RT
Large L2 Cache vs. the same (or less...) size from Turing
 
Are you 100% sure on that? In that case that should be a departure from A100m, right? Since iirc wasn't A100 high FP16 rate (non tensor) explained by running both FP16 and TC at the same time?

EDIT: From this post: https://forum.beyond3d.com/posts/2128606/
This was per my discussion with NVIDIA. When I asked about how much pressure the tensor cores put on the register file, and whether that made it hard to use the CUDA pipes at the same time, I was told that it was the same situation as Turing.
 
Sounds like memory overclocking could be huge, because the card is likely to be bandwidth limited. Can also be fillrate limited, and rops need a lot of bandwidth too. They did increase L1 cache size and/or bandwidth, didn't they? Maybe cache hit rate will improve bandwidth utilization in general.

Is it possible we'll actually see scenarios where they're texture rate limited? How often does that happen? With 1/2 as many texture units per cuda core as Turing, maybe this will be a possibility?
 
Heck, I'm beginning to wonder what "Big" Ampere's benchmarks will look like.
 
Note that the new bit is being able to do RT + tensor at the same time. Also, the tensor cores are still going to eat all of your RF bandwidth, so don't expect to be able to run the CUDA pipes alongside the tensor cores.

Bandwidth issues all around seem the limiting factor in gaming performance. All the extra transistors and parallelization apparently get bottlenecked by these issues, the memory bandwidth available roughly equate to what you'd expect from Turing with the same bandwidth used maximally, but the extra silicon is still useful, you'd expect memory bandwidth capped performance at just 23% of a ti. Though I don't know if memory overclocking will help many people, assumedly good chips max out at 19.5gbps, and I wouldn't be surprised if you run into issues trying to push them higher.

Be interesting to see how the extra 23% or some bandwidth the 3090 provides translates into performance.
 
Last edited:
From the hothardware link:

"With Ampere, NVIDIA wanted to be able to process Bounding Box and Triangle intersection rates in parallel. So, Ampere’s separate Bounding Box and Triangle resources can run in parallel, and as mentioned, Triangle Intersection rates are twice as fast.

A new Triangle Position Interpolation unit has also been added to Ampere to help create more accurate motion blur effects."

Sounds like pretty significant speedups to ray tracing. The triangle position interpolation is interesting. So frame to frame they can interpolate a new position and if it's outside the current triangle intersection then I guess they can go and test a new intersection. Different from having to test the bounding box and triangle intersection every frame.

Sounds like the interpolation is for motion blur. So it would be interpolated for each intersection check (each ray). It is a very old method.
 
Sounds like alu will be easier to use than Turing, because you won't have all of this int32 alu just sitting idle ... as long as you have the bandwidth. I'm curious what kind of shaders would be bandwidth limited and how often that's an issue.

If bandwidth-limited situations are more common for pixel shaders or compute shaders, I wonder if overclocking will even be worth it. I know it speeds up all of the other gpu parts as well, but if you speed them up they need more bandwidth too. I'm assuming the raster engines will always benefit from clock increases and will rarely be bandwidth limited.
 
This was per my discussion with NVIDIA. When I asked about how much pressure the tensor cores put on the register file, and whether that made it hard to use the CUDA pipes at the same time, I was told that it was the same situation as Turing.
Wouldn't pure FP16 calculations in the Tensor cores relieve the register (file) pressure somewhat compared to TF32 and whatnot?
 
Back
Top