Nvidia Ampere Discussion [2020-05-14]

But doesn't running FP16 on Tensor Cores mean that half of your FP16 FLOPS exists only if it's matrix multiplications? Or are they now versatile enough to run fast matrix multiplications and other operations, but if so, why drag the regular units along at all?
 
But doesn't running FP16 on Tensor Cores mean that half of your FP16 FLOPS exists only if it's matrix multiplications? Or are they now versatile enough to run fast matrix multiplications and other operations, but if so, why drag the regular units along at all?

A tensor matrix multiplication is just a bunch of scalar FMAs that expect the operands in matrix form. My linear algebra is very rusty but I assume this works by feeding tensors with scalar FMA operands represented as sparse matrices.

That gives you 32 FP16 FMAs and the other 32 FMAs run on the 16 regular ALUs.
 
It seems multiplication also happens with full prescision based on the slide 55 here - https://developer.download.nvidia.c...tions/s21760-cuda-new-features-and-beyond.pdf
It's also explicitly said in this session at 36th minute - https://developer.nvidia.com/gtc/2020/video/s21760
So, TF32 just truncates input operands and does math at full precision

Why else you would accumulate at FP32 if you would not compute the addition operands at FP32 ?

The mixed precision FMA units for TF32 and FP16 are basically the same. The only difference is handling of the exponent which has 3 more bits for TF32 compared to FP16.
The cost of FMA is in the number of bits in the mantisa, as silicon area of FMA is proportional to the square of the number of bits in the mantissa which are the same for TF32 and FP16.
See also :
"Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area."
  • float32: 24^2=576 (100%)
  • float16: 11^2=121 (21%)
  • bfloat16: 8^2=64 (11%)
 
Last edited:
This is an Nvidia thread, so don't think there is much interest in researching Google's BF16 format.
Feel free to start a new thread if that is your intent.
Nvidia was smart enough to add BF16 to the A100, have you really been ignorant of this fact?
 
If it's doesn't reduce the memory footprint, what the goal ? Faster transfert from point A to point B ?
 
Nvidia was smart enough to add BF16 to the A100, have you really been ignorant of this fact?
Ignorance is bliss, or is it inclusiveness? Including BF16 means supporting a currently established standard data format for clients to use, it would be negligence otherwise. There are likely scenarios where BF16 performance is sufficient, and in those situations the ability to maintain the status quo is Nvidia's advantage.

Having the ability to switch to TF32 format and it's extra precision covers scenarios where BF16 is lacking, and gains in performance,cost or precision can be found using the new TF32. Once independent training and inference benchmark results appear we will have a better idea about those scenarios.
 
DIVING DEEP INTO THE NVIDIA AMPERE GPU ARCHITECTURE
May 28, 2020

Here is how the GA100 stacks up against the Pascal, Volta, and Turing GPUs used in Tesla accelerators in terms of features and performance on the widening array (pun intended) of numeric formats that Nvidia has supported to push more throughput on AI workloads through its GPUs:

nvidia-pascal-volta-ampere-comparison-table.jpg

We have a bigger table that includes comparisons with the Kepler and Maxwell generations of Tesla accelerators, but this table is too big to display. (You can view it here in a separate window.) The FP16 with either FP16 or FP32 accumulate, bfloat16 (BF16), and Tensor Float32 (TF32) formats used on the new Tensor Core units show performance without the sparse matrix support and the 2X improvement with it turned on.

The sparse matrix support also gooses INT4 and INT8 inference processing on the Tensor Cores by a factor of 2X when it is activated. It is not available on the FP64 processing on the Tensor Cores, but the Tensor Core implementation of 64-bit matrix math can deliver 2X the throughput on FP64 math compared to the FP64 units on the GA100 and 2.5X that of the GV100, which only had plain vanilla FP64 units.
...
“It may not be obvious from the documentation, but is it’s a non-trivial exercise to get another 2X performance out of the SMs with the Tensor Cores,” Alben tells The Next Platform. “We pushed it as far as we thought we could in Volta without the thing catching on fire, but with a lot of hard work, we figured out how to get another 2X out of the system, and we were able to do that and get even better utilization than we did in the Volta generation. We are definitely proud to see the results.”

By the way, here is one thing to look forward to: That extra 20 percent of memory bandwidth and memory capacity will be unlocked, and so will the remaining 18.5 percent of latent performance embodied in the 20 SMs that are dark to increase the yield on the chips. This is a bigger block of latent capacity in the Ampere device than was in the Volta device.
https://www.nextplatform.com/2020/05/28/diving-deep-into-the-nvidia-ampere-gpu-architecture/
 
So data read from RAM is compressed before it's moved into L2, and then decompressed to L1?
No. Data leaving SMs can be compressed prior to being written to L2 or memory. Afterwards if compute is accessing that data again it will be read in compressed form to L2. So you can save bandwidth on the way out of the GPU and on the way back in the GPU as well as increase available L2 cache size as data will only be decompressed when leaving L2 for L1.
 
That's a great joke/fake

All those fins and fan have no purpose at all - there is no airflow allowed, fan blades are melted to a frame etc.

Just some joker with 3D printer
 

Attachments

  • upload_2020-6-6_11-39-55.gif
    upload_2020-6-6_11-39-55.gif
    42 bytes · Views: 29
That's a great joke/fake

All those fins and fan have no purpose at all - there is no airflow allowed, fan blades are melted to a frame etc.

Just some joker with 3D printer
What do you mean with "fan blades ar emelted to a frame"?

The image claims that the PCB is really short and one fan would be cooling from the "front side" of the graphics card and 2nd from the backside, there are fins visible under the supposed backside fan going cards lengthwise, even when the ones next to the fan serve no clear purpose

Overall really doubtful about it too but..
 
Last edited:
Why do i think of doom 3 seeing the top imagine :p Intresting design for the cooling, unseen in traditional GPU's so far, with the two-fan setup, one on each side. Air out the back of the GPU (out of a pc case) is going to be limited that way?
 
One good thing about the pandemic is that I'm saving a huge loads of money lol. Gonna need it.

edit:mmm... if that airflow is correct, it's gonna shower the cpu/ram with a lot of hot air.
 
Back
Top