Nvidia Ampere Discussion [2020-05-14]

Note that the new bit is being able to do RT + tensor at the same time. Also, the tensor cores are still going to eat all of your RF bandwidth, so don't expect to be able to run the CUDA pipes alongside the tensor cores.
From what I understand, what is depicted on this image is the possibiblity to use new async modes to overlap all this stuff, probably not the best wording with "concurrency" in use.
 
Note that the new bit is being able to do RT + tensor at the same time. Also, the tensor cores are still going to eat all of your RF bandwidth, so don't expect to be able to run the CUDA pipes alongside the tensor cores.
I also wonder how this is holding up wrt power requirements. And if I'm not mistaken, this means, the tensor core either has to apply DLSS for the previous frame or is working on something else like AI denoising for the current frame.
 
Isn't AI denoising part of DLSS, and isn't DLSS always a post processing effect?
 
Wouldn't pure FP16 calculations in the Tensor cores relieve the register (file) pressure somewhat compared to TF32 and whatnot?
FP16 are 2x faster so no. Bandwidth for matrix multiplications depends a lot on how large chunks of both matrices you can keep as close to the ALUs as possible. In context of tensor cores this basically means register file directly. If I remember correctly there were some investigations around here that showed that the code to get absolute max out of tensor cores loaded data into registers twice to avoid bank conflicts.
The 4x FP16 rate is getting 2 general purpose FP16 ops from scalar path and 2 additional FP16 ops from tensor cores. But tensor cores are capable of more then that. Looking at A100 there's 64 FP16x2 units per SM and 4 tensor cores each capable of 256 ops.
 
Just noticed the RTX3080 requires a 750W power supply. I have a 650W one and my CPU is Ryzen 3900X. I wonder if my power supply would be constrained if I got a 3080. What do you think?
 
24:00 - "With Nvidia RTX IO, vast worlds will load instantly. Picking up where you left off will be instant."

A hint that quick resume might be coming to PC? Please let it be!!

Yeah, but you will have to wait on DirectStorage according to Microsoft: "We’re targeting getting a development preview of DirectStorage into the hands of game developers next year."

So late 2021 at the earliest, methinks.
 
With having +15% power limit 3080 is almost at 370W. An overclocked 3900X is around 200W, you are already near %90 load on the PSU.

I'm not overclocking the CPU, although I've seen max load at 142W. True that with the boost clocks on the 3080. The 3070 is a much safer proposition indeed. I would wait for a 3070 with 16GB, but my brother's birthday is close, I'm travelling back home and wanted to surprise him with my current GTX1060. Might replace with his GTX750Ti meanwhile maybe...
 
FP16 are 2x faster so no. Bandwidth for matrix multiplications depends a lot on how large chunks of both matrices you can keep as close to the ALUs as possible. In context of tensor cores this basically means register file directly. If I remember correctly there were some investigations around here that showed that the code to get absolute max out of tensor cores loaded data into registers twice to avoid bank conflicts.
The 4x FP16 rate is getting 2 general purpose FP16 ops from scalar path and 2 additional FP16 ops from tensor cores. But tensor cores are capable of more then that. Looking at A100 there's 64 FP16x2 units per SM and 4 tensor cores each capable of 256 ops.
Maybe I don't understand you right, but FP16 throughput for A100 is 78 TFlops as per Ampere whitepaper. Half of that comes from the FP32-SIMDs in the SM, the TCs are delivering the other half: 39 TFlops. With FP16-Tensor math, they are at 312 TFlops w/o sparsity. So, my take on it is, that that's far from their maximum throughput. That has to have an effect on RF usage.
 
Maybe I don't understand you right, but FP16 throughput for A100 is 78 TFlops as per Ampere whitepaper. Half of that comes from the FP32-SIMDs in the SM, the TCs are delivering the other half: 39 TFlops. With FP16-Tensor math, they are at 312 TFlops w/o sparsity. So, my take on it is, that that's far from their maximum throughput. That has to have an effect on RF usage.
So if I understand you correctly you mean that general purpose FP16 should be even higher then x4 (78 Tflops) but it's not due to RF usage/bandwidth? 312 Tflops figure does burn pretty much all the RF bandwidth and the 78 Tflops does not. But TC are special pieces of hardware. They for example require cooperation of an entire warp and you can't have divergence where some of your threads in a warp would issue commands to TC and others would not.
If you mean that you can run RT and FP16 compute shaders at 4x rate and occupy RT cores, CUDA cores and tensor cores than that's technically correct. But you're not really stressing the tensors all that much. :)
 
So if I understand you correctly you mean that general purpose FP16 should be even higher then x4 (78 Tflops) but it's not due to RF usage/bandwidth?
Nope, I just stated that it's at that level and that it's a lower utilization than it's peak (78 vs. 312 TFlops). But it obviously has enough RF bandwidth to get near it's peak. So my train of though was, that RF is not ulitzied a 100% at pure FP16 non-MMA calculations, thus a certain amount should be available to at least partially feed the FP32-SIMDs.
 
Nope, I just stated that it's at that level and that it's a lower utilization than it's peak (78 vs. 312 TFlops). But it obviously has enough RF bandwidth to get near it's peak. So my train of though was, that RF is not ulitzied a 100% at pure FP16 non-MMA calculations, thus a certain amount should be available to at least partially feed the FP32-SIMDs.

CarstenS: tensor cores lower register file bandwidth per math operation because they work on tensors rather than scalars (N^3/N^2).
Nvidia actually shows this phenomenon in their animated tensor core cartoons in their keynotes.

So it’s likely they are close to peak RF bandwidth both at 78 scalar TFlops as well as at 312 tensor TFlops.
 
Back
Top