Nvidia Ampere Discussion [2020-05-14]

Would be interesting to know if fp16 trained network matches fp32 trained in accuracy or if there is some small loss.
Isn't that what the conversion is about? You determine parameters for accuracy and the AMP tries to match that? I would imagine it giving you two options cloest to your spec - one slight below, one slightly above if it cannot match it perfectly.
 
Isn't that what the conversion is about? You determine parameters for accuracy and the AMP tries to match that? I would imagine it giving you two options cloest to your spec - one slight below, one slightly above if it cannot match it perfectly.

What I was expecting to see as comparison would be something like: "Using dataset foo and queries bar" the neural network achieved inference accuracy of xx% for fp32 and xx% for fp16 trained models. Based on the wording on the blog I assume the accuracy is very similar but would be nice to know how similar and how it was measured.
 
So I googled bert fp32 vs fp16. BERT is all the rage nowdays. A little bit surprisingly I found a data point showing what I was trying to anecdotally share. Unfortunately the blog post doesn't give comparison on accuracy fp32 vs. fp16 trained model. Would be interesting to know if fp16 trained network matches fp32 trained in accuracy or if there is some small loss.




https://news.developer.nvidia.com/nvidia-achieves-4x-speedup-on-bert-neural-network/
For BERT, this was the last graphs for FP32 vs FP16 (A100 vs V100).
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
The graph is actually incorrect in talking about FP32 for A100, it uses TF32.
No mention of BF16. Also not about accuracy difference, if any.
 
For BERT, this was the last graphs for FP32 vs FP16 (A100 vs V100).
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
The graph is actually incorrect in talking about FP32 for A100, it uses TF32.
No mention of BF16. Also not about accuracy difference, if any.

The graph explicitly states tf32 was used for fp32. It's just in the small print.

The link to old blogbost I gave tried to show that bert was initially fp32(i..e research, initial work, reference to compare against with). After bert was proven the optimizations to use lower precision come into mix where nvidia optimized bert to use fp16 on volta. In this case it was very different folks doing the optimizations than who did the research.

The point I'm really trying to make is the value and use of fp32 as reference/research utility. There is no denying value of lower precision but often it becomes choose the right tool for the job and schedule you have. There isn't strict one size fits all solution here, especially so when research and schedules are taken into account.
 
For BERT, this was the last graphs for FP32 vs FP16 (A100 vs V100).
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
The graph is actually incorrect in talking about FP32 for A100, it uses TF32.
No mention of BF16. Also not about accuracy difference, if any.

Looking at the BERT graph, TF32 (FP19) isn't quite the touted 10x faster compared to FP32 on V100.

I'm still interested to know what the results would be for BF16 as that should be even faster as FP16.
But I think I'm repeating myself, and most of you don't seem to understand what I'm talking about.
 
This is an Nvidia thread, so don't think there is much interest in researching Google's BF16 format.
Feel free to start a new thread if that is your intent.
 
Looking at the BERT graph, TF32 (FP19) isn't quite the touted 10x faster compared to FP32 on V100.

I'm still interested to know what the results would be for BF16 as that should be even faster as FP16.
But I think I'm repeating myself, and most of you don't seem to understand what I'm talking about.

For ampere BF16 and fp16 would likely execute at same speed. TF32 would be ~half the speed versus fp16. This is probably caused by the fp32 input/output(memory bandwidth?). Proper fp32 would be way slow and hence tf32 is interesting,especially as it's a drop in replacement(fp32 input/output). Some other hw of course could be different.

Peak FP16 Tensor Core 312 TFLOPS | 624 TFLOPS2
Peak BF16 Tensor Core 312 TFLOPS | 624 TFLOPS2
Peak TF32 Tensor Core 156 TFLOPS | 312 TFLOPS2
 
It's quite strange how Nvidia tries to promote TF32 and doesn't talk about the benefits of BF16, as does your reply.
Different interpretation: NVidia has already lost the race for the market segments where BF16 is of relevance. Bulk computational resources are wasted on image data. There you have a number of ASIC vendors which are shipping embedded CPU + encoder + tensor acceleration units. NVidia tried to enter that market with derivative of the Jetson platform, but pretty much failed. Too late to market. Intel also failed.
These platforms have enough computational power to perform pre-evaluation of image content before encoding. Low precision, simple networks, running several times a second with limited power budget. But good enough to provide at least fuzzy heuristics for when something could pop up, including most basic segmentation and reliable tracking. (And maybe soon also more than just that.)

So what's left is just detection and classification for objects which have already been segmented and are being tracked by edge computing. And for that, high confidence is key. 95% vs 98%, 99%, 99.5% or 99.9% makes a huge difference in that step, with regard to making it applicable in production. Cause that makes the difference between requiring human supervision or not.
 
GTC 2020: How CUDA Math Libraries can help you unleash the power of the new NVIDIA A100 GPU
In the first part of this talk we will focus on how the new features of the NVIDIA A100 GPU can be accessed through the CUDA 11.0 Math libraries. These include 3rd generation tensor core functionality for double precision (FP64), TensorFloat-32 (TF32), half precision (FP16) and Bfloat16 (BF16); as well as increased memory bandwidth, multi-GPU performance improvements, and the hardware JPEG decoder.

In the second part of the talk, we will deep dive into the mixed-precision tensor core accelerated solvers and see how 3rd generation tensor cores can boost many HPC applications (workload) bringing exciting speedups up to 4x on the A100 GPU.
https://developer.nvidia.com/gtc/2020/video/s21681
 
Has it been explained anywhere how (non-Tensor) FP16 performance is now 4:1 instead of 2:1 to FP32?
 
Something does not compute here - why making the TC's FP64 2.5x as fast and yet keep lugging the old FP64 units around?
Please correct me If I am wrong here.

Traditional FP64 units are simply a byproduct of the FP32 CUDA cores, you simply combine two FP32 cores (with some additional data paths) and that's it, that's why FP64 is always half rate in these GPUs, it doesn't take much effort to do this, as this is the lowest effort required to achieve high DP throughput. In other words: there are no actual unique FP64 ALUs on the die. You can see this clearly when running FP64 code on HPC GPUs, it will take over the entire shader core utilization.

This then comes back to the design of the chip, NVIDIA deemed they still need FP32 CUDA cores and Texture units in an AI chip, so doing traditional FP64 on top of them is easy enough after that, it doesn't cost that much, and will likely remain there for compatibility purposes and other general calculations.

For the rest of the stuff, Tensor cores will now provide the majority of throughput, they will replace FP32 for training with TF32, automatically without a code change, and they will achieve high matrix FP64 throughput but require developers to adapt their code to it.

Each Tensor core in Turing was capable of 64 FP32/FP16 operations per clock, Ampere increases that to 256 FP32/FP16 op per clock! So IPC has increased 4 times, with this comes the revelation that NVIDIA can use this heightened ability to unlock running FP64 ops on the tensor cores, I think each GEMM FP64 op now takes about 32 clocks on each tensor core.
 
The fact that TF32 stores FP19 values with 32 bit is indeed a waste of memory bandwidth and capacity.
When those 32-bit values are loaded in the tensor cores, the first thing that happens is that 13 unused bits are thrown away.
Not a waste as the output of TF32 matrix multiplication is FP32 and can be used as input to FP32 math.
 
Interesting use of the mutiple instance feature is to allocate some for training, some for inferencing, or any other task simultaneously with fault isolation.
I can see where this would be handy in segmenting resources to multiple containers.
A100 is unique. A100 is maybe what you would refer to as a universal platform. It is not geared just to training this time. It is not geared just to inferencing. It has the capability of a platform to do both. It also has the ability to create multiple instances at the same time, so that you could choose to have part of those instances be overall training, part of them overall be inferencing and many other different types of combinations around that.
https://seekingalpha.com/article/43...virtual-tmt-conference-transcript?part=single
 
Good catch. Haven’t seen mention of that anywhere besides the table of peak throughout numbers.
Yep, albeit only in discussion directly under the Nvidia blog:
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
Ronny Krashinsky Chulian Zhang8 days ago


That's correct, we doubled non-Tensor-Core FP16 math up to 4x multiply-add rate relative to FP32. It was straightforward to support given 2xFP16 in the scalar datapath and 2xFP16 that could naturally be provided in the tensor core datapath.

edit 200905: Since posting this, Nvidia managed to "lose" the comment section, but I managed to find the comment in Krashinsky's disqus-Profile. A screenshot is attached for your reference and in case it might get lost over there too.Ampere_GA100_4xFP16-rate_Krashinsky.png

Please correct me If I am wrong here.

Traditional FP64 units are simply a byproduct of the FP32 CUDA cores, you simply combine two FP32 cores (with some additional data paths) and that's it, that's why FP64 is always half rate in these GPUs, it doesn't take much effort to do this, as this is the lowest effort required to achieve high DP throughput. In other words: there are no actual unique FP64 ALUs on the die. You can see this clearly when running FP64 code on HPC GPUs, it will take over the entire shader core utilization.
IIRC, Nvidia has stated on multiple occasions that the FP64 units are separate.

If you're right though, it's indeed not very surprising, but I was under the impression that this was AMDs approach.
 
Last edited:
IIRC, Nvidia has stated on multiple occasions that the FP64 units are separate.
Yeah, I don't think that aspect has changed from Volta.
The GV100 SM is partitioned into four processing blocks, each with 16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two of the new mixed-precision Tensor Cores for deep learning matrix arithmetic, a new L0 instruction cache, one warp scheduler, one dispatch unit, and a 64 KB Register File.
 
Back
Top