Nvidia Ampere Discussion [2020-05-14]

CarstenS · May 26, 2020

xpea said:
According to Nvidia, they provide the same accuracy in training. From here:
https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

In fact, I have the results of Nvidia comparison between FP32 and TF32. Not sure I can share since I don't see it anywhere online, but I can say that the networks trained using TF32 have the same accuracy than FP32. For AI, TF32 is really a safe replacement for FP32 with a huge speedup in performance

AI training - probably yes, since that is what it was designed to do well. But this line of discussion started around the usefulness for scientifc HPC applications.

Apart from that: Tensor cores are massive MMA arrays, they cannot do anything else, AFAIK, for example they do not have the SFUs in the traditional cores.

DavidGraham · May 26, 2020

CarstenS said:
AI training - probably yes, since that is what it was designed to do well. But this line of discussion started around the usefulness for scientifc HPC applications.

Once more, FP64 Tensor format is designed for HPC simulation workloads as well as AI training. NVIDIA is encouraging developers to migrate their HPC code to the new format.

"With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.

Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.

We refer to this new capability as Double-Precision Tensor Cores."
https://agenparl.eu/double-precision-tensor-cores-speed-high-performance-computing/

CarstenS · May 26, 2020

DavidGraham said:
Once more, FP64 Tensor format is designed for HPC simulation workloads as well as AI training. NVIDIA is encouraging developers to migrate their HPC code to the new format.

What is FP64 Tensor Format? For scientific, it has to be IEEE754 FP64.

DavidGraham · May 26, 2020

CarstenS said:
What is FP64 Tensor Format? For scientific, it has to be IEEE754 FP64.

It is.

the A100 Tensor Core includes new IEEE-compliant FP64 processing that delivers 2.5x the FP64 performance of V100.

To meet the rapidly growing compute needs of HPC computing, the A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU.

https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/

Benetanegia · May 26, 2020

ToTTenTranz said:
Peak FP64 is 9.7 FLOPs, not 19.5.

19.5TF FP64 is for deep learning training.

They are not for DL only. They are just matrix on matrix multiplication instead of "scalar", so their use may (or may not) be more limited, depending on the requirements of the aplication. My remark comes after a stream of posts talking about how matrix on matrix multiplication can actually be useful in HPC in some/most cases. Which is something that I wouldn't know first hand, but if it is, then my remark follows: A100 offers 19.5 Tflops of FP64.

If their FP64 tensor cores were capable of non-ML tasks, why would they even need any FP64 ALUs at all?

Because matrix multiplication may not be useful in every situation. And FWIW, A100 may not even have FP64 ALUs at all outside the tensor cores, just like Volta/Turing didn't have FP16 ALUs, yet offered 2x FP16 rates through the TCs.

Deleted member 13524 · May 26, 2020

Benetanegia said:
They are not for DL only. They are just matrix on matrix multiplication instead of "scalar", so their use may (or may not) be more limited, depending on the requirements of the aplication. My remark comes after a stream of posts talking about how matrix on matrix multiplication can actually be useful in HPC in some/most cases.

What are all these FP64 matrix-on-matrix non-ML loads that the posts mention?
And if they exist in any relevant proportion, why is nvidia's own GA100 page only linking the FP64 Tensor performance to ML training?

Benetanegia said:
Which is something that I wouldn't know first hand, but if it is, then my remark follows: A100 offers 19.5 Tflops of FP64.

Except that is in direct contradiction to the public specifications.

Malo · May 26, 2020

Maybe due to the reduced precision so it likely depends on what the mantissa/exponent requirements are of the application itself? The TF32 definitely has less precision than FP32 but for ML purposes it's suitable? The Ampere documents don't go into what IEEE the FP64 is compliant with nor what levels of precision it has, like it does with TF32. At least this is all based on my limited knowledge.

Deleted member 13524 · May 26, 2020

I brought this here.

Benetanegia said:
I also know that Pascal P100 runs much lower clocks than GP102 despite the fact arch was identical except for the addition of half rate FP64 in P100.

That makes sense. It's a larger chip with more execution units and the TDP is similar, so they needed to lower the clocks to keep power consumption and heat dissipation in check.
Also, IIRC the P100 was nvidia's very first 16FF chip so there could have been some process optimization between GP100 and GP102.

Benetanegia said:
lol No, it doesn't say so, no. That's just jumping to conclusions, wildly, with absolutely no proof, whatsoever.

Only if you believe TDP doesn't depend on clock rates in a given chip.

DavidGraham · May 26, 2020

Malo said:
Maybe due to the reduced precision so it likely depends on what the mantissa/exponent requirements are of the application itself?

FP64 Tensor runs the full precision of FP64. The only mode with reduced precision is TF32.

Malo said:
he TF32 definitely has less precision than FP32 but for ML purposes it's suitable?

Yes, and it doesn't require a code change.

Malo said:
The Ampere documents don't go into what IEEE the FP64 is compliant with nor what levels of precision it has

IEEE FP64 compliant means the standard FP64.

Benetanegia · May 26, 2020

ToTTenTranz said:
And if they exist in any relevant proportion, why is nvidia's own GA100 page only linking the FP64 Tensor performance to ML training?

Why? I don't care? As shown by DavidGraham above, twice, Nvidia is actually advising to switch?

Except that is in direct contradiction to the public specifications.

No.

Bondrewd · May 26, 2020

DavidGraham said:
Yes, and it doesn't require a code change.

idk ask the poor cuDNN devs lmao

Benetanegia · May 26, 2020

Benetanegia said:
Why? I don't care? As shown by DavidGraham above, twice, Nvidia is actually advising to switch?

EDIT: It is not only linking FP64 Tensor performance to ML training.

No.

CarstenS · May 26, 2020

DavidGraham said:
It is.

https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/

Something does not compute here - why making the TC's FP64 2.5x as fast and yet keep lugging the old FP64 units around? Maybe, FP64 is not just GEMM?

Benetanegia · May 26, 2020

CarstenS said:
Something does not compute here - why making the TC's FP64 2.5x as fast and yet keep lugging the old FP64 units around? Maybe, FP64 is not just GEMM?

For the same reason Turing had 2xFP16 rate alongside the much higher Tensor FP16. One is matrix multiplication the other is not. Both modes still executed on the TCs.

CarstenS · May 26, 2020

That's what I said, ain't it?

OlegSH · May 26, 2020

CarstenS said:
FP64 is not just GEMM

Yet, it's used enough to serve as a performance measure for TOP500 supercomputers ranking.
It's used in many practical HPC tasks here and there, otherwise it would have been pointless to test supercomputers on DGEMM.

Benetanegia · May 26, 2020

CarstenS said:
That's what I said, ain't it?

Not really sure what you're trying to say tbh. As OlegSH just pointed out right above, DGEMM's importance in HPC is beyond any doubt. I don't see what does not compute in Nvidia offering different intruction types for different tasks.
Also afaik in Turing at least, it was posible to dual issue a TC and either an FP32 or INT32, too, so maybe that extends to FP64 ALUs if there are any at all.

Benetanegia · May 26, 2020

If the apparent inclusion of FP64 ALUs is what bugs you, my theory for that is the following:

Turing didn't have FP16 yet it did offer "scalar" FP16 instructions at 2x rate running on the tensor cores. If my math is correct that translates to "scalar" instructions running in the TCs, providing 1/4 the rate of the tensor/matrix-on-matrix rates (i.e 32 TFLOPS vs 120). If running "scalar" FP64 instructions on Ampere TCs has the same 1/4 impact (probable), that would result in 1:4 FP64:FP32 rate compared to the desired 1:2 rate. Hence the inclusion of actual FP64 ALUs.

xpea · May 27, 2020

CarstenS said:
AI training - probably yes, since that is what it was designed to do well. But this line of discussion started around the usefulness for scientifc HPC applications.

But AI ML is where the money is. From hyperion research:

HPC is shrinking and Nvidia made the right decision to go all in AI ML DL with Ampere. For the record, by the end of 2019, Nvidia had more than 1500 direct customers for DGX systems. From nothing 2 years ago, and at +100k per system, that's where the money is. Forget HPC, already for this generation it's not the main focus anymore...

nnunn · May 27, 2020

Just to clarify, double precision (FP64) matrix multiplication is just one of many things that HPC codes do. All our other (double precision) kernels (currently) expect traditional double precision ALUs.

Nvidia Ampere Discussion [2020-05-14]

CarstenS

Moderator

DavidGraham

CarstenS

Moderator

DavidGraham

Benetanegia

Deleted member 13524

Guest

Malo

Yak Mechanicum

Deleted member 13524

Guest

DavidGraham

Benetanegia

Bondrewd

Benetanegia

CarstenS

Moderator

Benetanegia

CarstenS

Moderator

OlegSH

Benetanegia

Benetanegia

xpea

nnunn

Similar threads