Nvidia Ampere Discussion [2020-05-14]

Kaotik · May 28, 2020

But doesn't running FP16 on Tensor Cores mean that half of your FP16 FLOPS exists only if it's matrix multiplications? Or are they now versatile enough to run fast matrix multiplications and other operations, but if so, why drag the regular units along at all?

trinibwoy · May 28, 2020

Kaotik said:
But doesn't running FP16 on Tensor Cores mean that half of your FP16 FLOPS exists only if it's matrix multiplications? Or are they now versatile enough to run fast matrix multiplications and other operations, but if so, why drag the regular units along at all?

A tensor matrix multiplication is just a bunch of scalar FMAs that expect the operands in matrix form. My linear algebra is very rusty but I assume this works by feeding tensors with scalar FMA operands represented as sparse matrices.

That gives you 32 FP16 FMAs and the other 32 FMAs run on the 16 regular ALUs.

OlegSH · May 31, 2020

Voxilla said:
The main thing that is important IMHO is that the accumulation happens at FP32 precision

It seems multiplication also happens with full prescision based on the slide 55 here - https://developer.download.nvidia.c...tions/s21760-cuda-new-features-and-beyond.pdf
It's also explicitly said in this session at 36th minute - https://developer.nvidia.com/gtc/2020/video/s21760
So, TF32 just truncates input operands and does math at full precision

Voxilla · Jun 1, 2020

OlegSH said:
It seems multiplication also happens with full prescision based on the slide 55 here - https://developer.download.nvidia.c...tions/s21760-cuda-new-features-and-beyond.pdf
It's also explicitly said in this session at 36th minute - https://developer.nvidia.com/gtc/2020/video/s21760
So, TF32 just truncates input operands and does math at full precision

Why else you would accumulate at FP32 if you would not compute the addition operands at FP32 ?

The mixed precision FMA units for TF32 and FP16 are basically the same. The only difference is handling of the exponent which has 3 more bits for TF32 compared to FP16.
The cost of FMA is in the number of bits in the mantisa, as silicon area of FMA is proportional to the square of the number of bits in the mantissa which are the same for TF32 and FP16.
See also :
"Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area."

float32: 24^2=576 (100%)
float16: 11^2=121 (21%)
bfloat16: 8^2=64 (11%)

Voxilla · Jun 1, 2020

pharma said:
This is an Nvidia thread, so don't think there is much interest in researching Google's BF16 format.
Feel free to start a new thread if that is your intent.

Nvidia was smart enough to add BF16 to the A100, have you really been ignorant of this fact?

Voxilla · Jun 1, 2020

Some info regarding the Compute Data Compression. It is hardware based compression/decompression.

Rootax · Jun 1, 2020

If it's doesn't reduce the memory footprint, what the goal ? Faster transfert from point A to point B ?

Frenetic Pony · Jun 1, 2020

Rootax said:
If it's doesn't reduce the memory footprint, what the goal ? Faster transfert from point A to point B ?

Probably, sounds like lossless delta compression, but for compute, thus more bandwidth. Neat.

Samwell · Jun 1, 2020

Is this something which might help raytracing too? I often read, that RT is very cache performance sensitive, could this help or is it only useful for compute?

MDolenc · Jun 1, 2020

Rootax said:
If it's doesn't reduce the memory footprint, what the goal ? Faster transfert from point A to point B ?

It sort of does reduce footprint. It can keep data compressed in L2 so there is more cache available. Can't reduce footprint in main memory as you don't know in advance if output could be compressed or not.

Deleted member 2197 · Jun 1, 2020

Voxilla said:
Nvidia was smart enough to add BF16 to the A100, have you really been ignorant of this fact?

Ignorance is bliss, or is it inclusiveness? Including BF16 means supporting a currently established standard data format for clients to use, it would be negligence otherwise. There are likely scenarios where BF16 performance is sufficient, and in those situations the ability to maintain the status quo is Nvidia's advantage.

Having the ability to switch to TF32 format and it's extra precision covers scenarios where BF16 is lacking, and gains in performance,cost or precision can be found using the new TF32. Once independent training and inference benchmark results appear we will have a better idea about those scenarios.

Deleted member 2197 · Jun 1, 2020

DIVING DEEP INTO THE NVIDIA AMPERE GPU ARCHITECTURE
May 28, 2020

Here is how the GA100 stacks up against the Pascal, Volta, and Turing GPUs used in Tesla accelerators in terms of features and performance on the widening array (pun intended) of numeric formats that Nvidia has supported to push more throughput on AI workloads through its GPUs:

We have a bigger table that includes comparisons with the Kepler and Maxwell generations of Tesla accelerators, but this table is too big to display. (You can view it here in a separate window.) The FP16 with either FP16 or FP32 accumulate, bfloat16 (BF16), and Tensor Float32 (TF32) formats used on the new Tensor Core units show performance without the sparse matrix support and the 2X improvement with it turned on.

The sparse matrix support also gooses INT4 and INT8 inference processing on the Tensor Cores by a factor of 2X when it is activated. It is not available on the FP64 processing on the Tensor Cores, but the Tensor Core implementation of 64-bit matrix math can deliver 2X the throughput on FP64 math compared to the FP64 units on the GA100 and 2.5X that of the GV100, which only had plain vanilla FP64 units.
...
“It may not be obvious from the documentation, but is it’s a non-trivial exercise to get another 2X performance out of the SMs with the Tensor Cores,” Alben tells The Next Platform. “We pushed it as far as we thought we could in Volta without the thing catching on fire, but with a lot of hard work, we figured out how to get another 2X out of the system, and we were able to do that and get even better utilization than we did in the Volta generation. We are definitely proud to see the results.”

By the way, here is one thing to look forward to: That extra 20 percent of memory bandwidth and memory capacity will be unlocked, and so will the remaining 18.5 percent of latent performance embodied in the 20 SMs that are dark to increase the yield on the chips. This is a bigger block of latent capacity in the Ampere device than was in the Volta device.

https://www.nextplatform.com/2020/05/28/diving-deep-into-the-nvidia-ampere-gpu-architecture/

Scott_Arm · Jun 1, 2020

So data read from RAM is compressed before it's moved into L2, and then decompressed to L1?

MDolenc · Jun 1, 2020

Scott_Arm said:
So data read from RAM is compressed before it's moved into L2, and then decompressed to L1?

No. Data leaving SMs can be compressed prior to being written to L2 or memory. Afterwards if compute is accessing that data again it will be read in compressed form to L2. So you can save bandwidth on the way out of the GPU and on the way back in the GPU as well as increase available L2 cache size as data will only be decompressed when leaving L2 for L1.

DavidGraham · Jun 6, 2020

Pictures of an engineering sample of RTX 3080 have leaked online, are we close to launch?

https://videocardz.com/newz/nvidia-geforce-rtx-3080-pictured

Krteq · Jun 6, 2020

That's a great joke/fake

All those fins and fan have no purpose at all - there is no airflow allowed, fan blades are melted to a frame etc.

Just some joker with 3D printer

Kaotik · Jun 6, 2020

Krteq said:
That's a great joke/fake

All those fins and fan have no purpose at all - there is no airflow allowed, fan blades are melted to a frame etc.

Just some joker with 3D printer

What do you mean with "fan blades ar emelted to a frame"?

The image claims that the PCB is really short and one fan would be cooling from the "front side" of the graphics card and 2nd from the backside, there are fins visible under the supposed backside fan going cards lengthwise, even when the ones next to the fan serve no clear purpose

Overall really doubtful about it too but..

Man from Atlantis · Jun 6, 2020

someone photoshopped the back of the card without foils

cooling consept edited by me over the chiphell image, the memory arrangement and pcb detail is stand in no conformation of 256bit memory

PSman1700 · Jun 6, 2020

Why do i think of doom 3 seeing the top imagine

Intresting design for the cooling, unseen in traditional GPU's so far, with the two-fan setup, one on each side. Air out the back of the GPU (out of a pc case) is going to be limited that way?

Deleted member 7537 · Jun 6, 2020

One good thing about the pandemic is that I'm saving a huge loads of money lol. Gonna need it.

edit:mmm... if that airflow is correct, it's gonna shower the cpu/ram with a lot of hot air.

Nvidia Ampere Discussion [2020-05-14]

Kaotik

Drunk Member

trinibwoy

Meh

OlegSH

Voxilla

Voxilla

Voxilla

Rootax

Frenetic Pony

Samwell

MDolenc

Deleted member 2197

Guest

Deleted member 2197

Guest

Scott_Arm

MDolenc

DavidGraham

Krteq

Attachments

Kaotik

Drunk Member

Man from Atlantis

idk

PSman1700

Deleted member 7537

Guest

Similar threads