Nvidia Ampere Discussion [2020-05-14]

According to Nvidia, they provide the same accuracy in training. From here:
https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

In fact, I have the results of Nvidia comparison between FP32 and TF32. Not sure I can share since I don't see it anywhere online, but I can say that the networks trained using TF32 have the same accuracy than FP32. For AI, TF32 is really a safe replacement for FP32 with a huge speedup in performance
AI training - probably yes, since that is what it was designed to do well. But this line of discussion started around the usefulness for scientifc HPC applications.

Apart from that: Tensor cores are massive MMA arrays, they cannot do anything else, AFAIK, for example they do not have the SFUs in the traditional cores.
 
AI training - probably yes, since that is what it was designed to do well. But this line of discussion started around the usefulness for scientifc HPC applications.
Once more, FP64 Tensor format is designed for HPC simulation workloads as well as AI training. NVIDIA is encouraging developers to migrate their HPC code to the new format.

"With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.

Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.

We refer to this new capability as Double-Precision Tensor Cores."
https://agenparl.eu/double-precision-tensor-cores-speed-high-performance-computing/
 
Once more, FP64 Tensor format is designed for HPC simulation workloads as well as AI training. NVIDIA is encouraging developers to migrate their HPC code to the new format.
What is FP64 Tensor Format? For scientific, it has to be IEEE754 FP64.
 
What is FP64 Tensor Format? For scientific, it has to be IEEE754 FP64.
It is.

the A100 Tensor Core includes new IEEE-compliant FP64 processing that delivers 2.5x the FP64 performance of V100.

To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU.
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
 
Peak FP64 is 9.7 FLOPs, not 19.5.

19.5TF FP64 is for deep learning training.

They are not for DL only. They are just matrix on matrix multiplication instead of "scalar", so their use may (or may not) be more limited, depending on the requirements of the aplication. My remark comes after a stream of posts talking about how matrix on matrix multiplication can actually be useful in HPC in some/most cases. Which is something that I wouldn't know first hand, but if it is, then my remark follows: A100 offers 19.5 Tflops of FP64.

If their FP64 tensor cores were capable of non-ML tasks, why would they even need any FP64 ALUs at all?

Because matrix multiplication may not be useful in every situation. And FWIW, A100 may not even have FP64 ALUs at all outside the tensor cores, just like Volta/Turing didn't have FP16 ALUs, yet offered 2x FP16 rates through the TCs.
 
They are not for DL only. They are just matrix on matrix multiplication instead of "scalar", so their use may (or may not) be more limited, depending on the requirements of the aplication. My remark comes after a stream of posts talking about how matrix on matrix multiplication can actually be useful in HPC in some/most cases.
What are all these FP64 matrix-on-matrix non-ML loads that the posts mention?
And if they exist in any relevant proportion, why is nvidia's own GA100 page only linking the FP64 Tensor performance to ML training?


Which is something that I wouldn't know first hand, but if it is, then my remark follows: A100 offers 19.5 Tflops of FP64.
Except that is in direct contradiction to the public specifications.
 
Maybe due to the reduced precision so it likely depends on what the mantissa/exponent requirements are of the application itself? The TF32 definitely has less precision than FP32 but for ML purposes it's suitable? The Ampere documents don't go into what IEEE the FP64 is compliant with nor what levels of precision it has, like it does with TF32. At least this is all based on my limited knowledge.
 
I brought this here.

I also know that Pascal P100 runs much lower clocks than GP102 despite the fact arch was identical except for the addition of half rate FP64 in P100.

That makes sense. It's a larger chip with more execution units and the TDP is similar, so they needed to lower the clocks to keep power consumption and heat dissipation in check.
Also, IIRC the P100 was nvidia's very first 16FF chip so there could have been some process optimization between GP100 and GP102.


lol No, it doesn't say so, no. That's just jumping to conclusions, wildly, with absolutely no proof, whatsoever.
Only if you believe TDP doesn't depend on clock rates in a given chip.
 
Maybe due to the reduced precision so it likely depends on what the mantissa/exponent requirements are of the application itself?
FP64 Tensor runs the full precision of FP64. The only mode with reduced precision is TF32.

he TF32 definitely has less precision than FP32 but for ML purposes it's suitable?
Yes, and it doesn't require a code change.

The Ampere documents don't go into what IEEE the FP64 is compliant with nor what levels of precision it has
IEEE FP64 compliant means the standard FP64.
 
That's what I said, ain't it? ;)

Not really sure what you're trying to say tbh. As OlegSH just pointed out right above, DGEMM's importance in HPC is beyond any doubt. I don't see what does not compute in Nvidia offering different intruction types for different tasks.
Also afaik in Turing at least, it was posible to dual issue a TC and either an FP32 or INT32, too, so maybe that extends to FP64 ALUs if there are any at all.
 
If the apparent inclusion of FP64 ALUs is what bugs you, my theory for that is the following:

Turing didn't have FP16 yet it did offer "scalar" FP16 instructions at 2x rate running on the tensor cores. If my math is correct that translates to "scalar" instructions running in the TCs, providing 1/4 the rate of the tensor/matrix-on-matrix rates (i.e 32 TFLOPS vs 120). If running "scalar" FP64 instructions on Ampere TCs has the same 1/4 impact (probable), that would result in 1:4 FP64:FP32 rate compared to the desired 1:2 rate. Hence the inclusion of actual FP64 ALUs.
 
AI training - probably yes, since that is what it was designed to do well. But this line of discussion started around the usefulness for scientifc HPC applications.
But AI ML is where the money is. From hyperion research:
Screenshot_20200527-071015__01.jpg
HPC is shrinking and Nvidia made the right decision to go all in AI ML DL with Ampere. For the record, by the end of 2019, Nvidia had more than 1500 direct customers for DGX systems. From nothing 2 years ago, and at +100k per system, that's where the money is. Forget HPC, already for this generation it's not the main focus anymore...
 
Last edited:
Just to clarify, double precision (FP64) matrix multiplication is just one of many things that HPC codes do. All our other (double precision) kernels (currently) expect traditional double precision ALUs.
 
Back
Top