Nvidia Ampere Discussion [2020-05-14]

OlegSH · Sep 5, 2020

Ryan Smith said:
Note that the new bit is being able to do RT + tensor at the same time. Also, the tensor cores are still going to eat all of your RF bandwidth, so don't expect to be able to run the CUDA pipes alongside the tensor cores.

From what I understand, what is depicted on this image is the possibiblity to use new async modes to overlap all this stuff, probably not the best wording with "concurrency" in use.

CarstenS · Sep 5, 2020

Ryan Smith said:
Note that the new bit is being able to do RT + tensor at the same time. Also, the tensor cores are still going to eat all of your RF bandwidth, so don't expect to be able to run the CUDA pipes alongside the tensor cores.

I also wonder how this is holding up wrt power requirements. And if I'm not mistaken, this means, the tensor core either has to apply DLSS for the previous frame or is working on something else like AI denoising for the current frame.

Deleted member 13524 · Sep 5, 2020

Isn't AI denoising part of DLSS, and isn't DLSS always a post processing effect?

dorf · Sep 5, 2020

ToTTenTranz said:
Isn't AI denoising part of DLSS

I dont think so. Maybe this elucidates the matter somewhat (timestamped):

MDolenc · Sep 5, 2020

CarstenS said:
Wouldn't pure FP16 calculations in the Tensor cores relieve the register (file) pressure somewhat compared to TF32 and whatnot?

FP16 are 2x faster so no. Bandwidth for matrix multiplications depends a lot on how large chunks of both matrices you can keep as close to the ALUs as possible. In context of tensor cores this basically means register file directly. If I remember correctly there were some investigations around here that showed that the code to get absolute max out of tensor cores loaded data into registers twice to avoid bank conflicts.
The 4x FP16 rate is getting 2 general purpose FP16 ops from scalar path and 2 additional FP16 ops from tensor cores. But tensor cores are capable of more then that. Looking at A100 there's 64 FP16x2 units per SM and 4 tensor cores each capable of 256 ops.

Picao84 · Sep 5, 2020

Just noticed the RTX3080 requires a 750W power supply. I have a 650W one and my CPU is Ryzen 3900X. I wonder if my power supply would be constrained if I got a 3080. What do you think?

pjbliverpool · Sep 5, 2020

dorf said:
I dont think so. Maybe this elucidates the matter somewhat (timestamped):

Yes, and the graph shows the tensor work taking place at the same time as the RT work with a slight lag. So RT denoising makes sense rather than DLSS.

Man from Atlantis · Sep 5, 2020

Picao84 said:
Just noticed the RTX3080 requires a 750W power supply. I have a 650W one and my CPU is Ryzen 3900X. I wonder if my power supply would be constrained if I got a 3080. What do you think?

With having +15% power limit 3080 is almost at 370W. An overclocked 3900X is around 200W, you are already near %90 load on the PSU.

pjbliverpool · Sep 5, 2020

24:00 - "With Nvidia RTX IO, vast worlds will load instantly. Picking up where you left off will be instant."

A hint that quick resume might be coming to PC? Please let it be!!

iroboto · Sep 5, 2020

ToTTenTranz said:
Isn't AI denoising part of DLSS, and isn't DLSS always a post processing effect?

Separate If I understand correctly. DLSS occurs around the same time antialiasing algo would normally occur. It’s meant to slot in as a replacement.

AI denoising should occur during/just after ray tracing step.

LeStoffer · Sep 5, 2020

pjbliverpool said:
24:00 - "With Nvidia RTX IO, vast worlds will load instantly. Picking up where you left off will be instant."

A hint that quick resume might be coming to PC? Please let it be!!

Yeah, but you will have to wait on DirectStorage according to Microsoft: "We’re targeting getting a development preview of DirectStorage into the hands of game developers next year."

So late 2021 at the earliest, methinks.

Picao84 · Sep 5, 2020

Man from Atlantis said:
With having +15% power limit 3080 is almost at 370W. An overclocked 3900X is around 200W, you are already near %90 load on the PSU.

I'm not overclocking the CPU, although I've seen max load at 142W. True that with the boost clocks on the 3080. The 3070 is a much safer proposition indeed. I would wait for a 3070 with 16GB, but my brother's birthday is close, I'm travelling back home and wanted to surprise him with my current GTX1060. Might replace with his GTX750Ti meanwhile maybe...

CarstenS · Sep 5, 2020

MDolenc said:
FP16 are 2x faster so no. Bandwidth for matrix multiplications depends a lot on how large chunks of both matrices you can keep as close to the ALUs as possible. In context of tensor cores this basically means register file directly. If I remember correctly there were some investigations around here that showed that the code to get absolute max out of tensor cores loaded data into registers twice to avoid bank conflicts.
The 4x FP16 rate is getting 2 general purpose FP16 ops from scalar path and 2 additional FP16 ops from tensor cores. But tensor cores are capable of more then that. Looking at A100 there's 64 FP16x2 units per SM and 4 tensor cores each capable of 256 ops.

Maybe I don't understand you right, but FP16 throughput for A100 is 78 TFlops as per Ampere whitepaper. Half of that comes from the FP32-SIMDs in the SM, the TCs are delivering the other half: 39 TFlops. With FP16-Tensor math, they are at 312 TFlops w/o sparsity. So, my take on it is, that that's far from their maximum throughput. That has to have an effect on RF usage.

pjbliverpool · Sep 5, 2020

"In NVIDIA's own testing, they reveal that the GeForce RTX 3080 averages at around 1920 MHz with a GPU power draw of 310W and a peak temperature of 76C."

https://wccftech.com/nvidia-geforce-rtx-30-series-ampere-graphics-cards-deep-dive/

That's 33.5TF for those who are counting.

Picao84 · Sep 5, 2020

pjbliverpool said:
"In NVIDIA's own testing, they reveal that the GeForce RTX 3080 averages at around 1920 MHz with a GPU power draw of 310W and a peak temperature of 76C."

https://wccftech.com/nvidia-geforce-rtx-30-series-ampere-graphics-cards-deep-dive/

That's 33.5TF for those who are counting.

So that 310W is already taking into account boost! That's not so bad hmm..

trinibwoy · Sep 5, 2020

pjbliverpool said:
"In NVIDIA's own testing, they reveal that the GeForce RTX 3080 averages at around 1920 MHz with a GPU power draw of 310W and a peak temperature of 76C."

https://wccftech.com/nvidia-geforce-rtx-30-series-ampere-graphics-cards-deep-dive/

That's 33.5TF for those who are counting.

So similar clocks and temps as Turing but much higher power draw. Could be worse.

MDolenc · Sep 5, 2020

CarstenS said:
Maybe I don't understand you right, but FP16 throughput for A100 is 78 TFlops as per Ampere whitepaper. Half of that comes from the FP32-SIMDs in the SM, the TCs are delivering the other half: 39 TFlops. With FP16-Tensor math, they are at 312 TFlops w/o sparsity. So, my take on it is, that that's far from their maximum throughput. That has to have an effect on RF usage.

So if I understand you correctly you mean that general purpose FP16 should be even higher then x4 (78 Tflops) but it's not due to RF usage/bandwidth? 312 Tflops figure does burn pretty much all the RF bandwidth and the 78 Tflops does not. But TC are special pieces of hardware. They for example require cooperation of an entire warp and you can't have divergence where some of your threads in a warp would issue commands to TC and others would not.
If you mean that you can run RT and FP16 compute shaders at 4x rate and occupy RT cores, CUDA cores and tensor cores than that's technically correct. But you're not really stressing the tensors all that much.

CarstenS · Sep 5, 2020

MDolenc said:
So if I understand you correctly you mean that general purpose FP16 should be even higher then x4 (78 Tflops) but it's not due to RF usage/bandwidth?

Nope, I just stated that it's at that level and that it's a lower utilization than it's peak (78 vs. 312 TFlops). But it obviously has enough RF bandwidth to get near it's peak. So my train of though was, that RF is not ulitzied a 100% at pure FP16 non-MMA calculations, thus a certain amount should be available to at least partially feed the FP32-SIMDs.

RecessionCone · Sep 5, 2020

CarstenS said:
Nope, I just stated that it's at that level and that it's a lower utilization than it's peak (78 vs. 312 TFlops). But it obviously has enough RF bandwidth to get near it's peak. So my train of though was, that RF is not ulitzied a 100% at pure FP16 non-MMA calculations, thus a certain amount should be available to at least partially feed the FP32-SIMDs.

CarstenS: tensor cores lower register file bandwidth per math operation because they work on tensors rather than scalars (N^3/N^2).
Nvidia actually shows this phenomenon in their animated tensor core cartoons in their keynotes.

So it’s likely they are close to peak RF bandwidth both at 78 scalar TFlops as well as at 312 tensor TFlops.

swaaye · Sep 5, 2020

arandomguy said:
Was hoping for possibly VP9 encode acceleration from NVENC.

I think AV1 is near so maybe that isn't really worthwhile now. Actually why don't they have AV1 encoding at this point? Heh

Nvidia Ampere Discussion [2020-05-14]

OlegSH

CarstenS

Moderator

Deleted member 13524

Guest

dorf

MDolenc

Picao84

pjbliverpool

B3D Scallywag

Man from Atlantis

pjbliverpool

B3D Scallywag

iroboto

Daft Funk

LeStoffer

Picao84

CarstenS

Moderator

pjbliverpool

B3D Scallywag

Picao84

trinibwoy

Meh

MDolenc

CarstenS

Moderator

RecessionCone

swaaye

Entirely Suboptimal

Similar threads