Nvidia Ampere Discussion [2020-05-14]

On a sidenote, no wonder NVIDIA "complained" years ago about HBM not scaling as IHVs would want it to. Those wondering about the lack of way more FP32 performance within the current process/manufacturing/die area and overall bandwidth boundaries (amongst many others of course) IHVs obviously try to find the best possible balance for each target market.

Ampere isn't a mainstream consumer product and I'd be very surprised if Turing's successor will come with HBM.
 
To explain once more the difference between FP32 and TF32.
TF32 does matrix multiplication with matrix values being 19 bit numbers, that is with FP19 precision.
Hence the tensor cores with TF32 can not do matrix multiplication at FP32 precision.

For AI training FP19 can be enough, even BF16 can be enough.
Google TPU2/3 is based on BF16. Also the A100 supports BF16.
BF16 reduces memory/cache storage by half and increases bandwidth efficiency by a factor 2 compared to TF32.
 
Last edited:
Just to clarify, double precision (FP64) matrix multiplication is just one of many things that HPC codes do. All our other (double precision) kernels (currently) expect traditional double precision ALUs.

Out of curiosity - are your HPC codes typically limited by ALU throughput, or is increasing memory (main/cache/shared) bandwidth per FLOP more important?
 
To explain once more the difference between FP32 and TF32.
TF32 does matrix multiplication with matrix values being 19 bit numbers, that is with FP19 precision.
Hence the tensor cores with TF32 can not do matrix multiplication at FP32 precision.
You forget to say that TF32 operates on FP32 input, all internal accumulators are FP32, and output is FP32. In training, they are no difference between FP32 and TF32. That's why Nvidia totally replaced FP32 by TF32.

https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range.

The combination makes TF32 a great alternative to FP32 for crunching through single-precision math, specifically the massive multiply-accumulate functions at the heart of deep learning and many HPC apps.

Applications using NVIDIA libraries enable users to harness the benefits of TF32 with no code change required. TF32 Tensor Cores operate on FP32 inputs and produce results in FP32. Non-matrix operations continue to use FP32.
Sure, people have doubts now, but soon, users will compare TF32 and FP32 to see that it gives exactly the same output for up to 10 times faster training (up to 20 times with sparsity).
 
Last edited:
You forget to say that TF32 operates on FP32 input, all internal accumulators are FP32, and output is FP32. In training, they are no difference between FP32 and TF32. That's why Nvidia totally replaced FP32 by TF32.

https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

The fact that TF32 stores FP19 values with 32 bit is indeed a waste of memory bandwidth and capacity.
When those 32-bit values are loaded in the tensor cores, the first thing that happens is that 13 unused bits are thrown away.
 
When those 32-bit values are loaded in the tensor cores, the first thing that happens is that 13 unused bits are thrown away.
Is not addition of biases matrix afterwards comes at full FP32 precision? (so that output fully occupies 32 bits) Not sure how it helps with convergence in NNs, but I heard it helps quite a bit
 
Last edited:
Is not addition of biases matrix afterwards comes at full FP32 precision? (so that output fully occupies 32 bits) Not sure how it helps with convergence in NNs, but I heard it helps quite a bit
The main thing that is important IMHO is that the accumulation happens at FP32 precision, once that is done the result output can be reduced back to lower precision optionally after adding bias. Either the output is fed back into tensor cores which use only FP19, or a non-linearity like relu, sigmoid, tanh is applied requiring only low precision input and output.
In any case in principle it is not be needed to store back the matrices at 32 bit when the tensor cores only use 19 bit of the matrix values, as storing them at 19 bit would be sufficient.Some hardware based compression/decompression could have enabled that. (or maybe this mysterious compute data compression can do this?)
 
Last edited:
Out of curiosity - are your HPC codes typically limited by ALU throughput, or is increasing memory (main/cache/shared) bandwidth per FLOP more important?
A good question.
HPC folks I know are quite positive about changes in Ampere because their tasks are bandwidth bound in many cases.
They were also quite impressed by the new L2 cache architecture with cache residency control, they said they already had had pipelines that can greatly benefit from on-chip producer-consumer queues in L2.
Asynchronous barriers and copy instructions + new warp reduction ops were praised too.

I don't know why folks here think that NVIDIA had not done hell a lot of profiling for Ampere to ensure that it's great in traditional compute.
Also, all the new features are likely based on feedback from devs, hence most of devs I know are very positive about changes in Ampere.
 
frankly, I like this kind of waste that provides 10 times the performance :LOL:
More seriously, it's obviously for backward compatibility with FP32. User doesn't have to change anything in the set of data to get instant speed up
I like the no waste of BF16 that does 20 times better (obviously also compared to V100 non tensor core)
And it's not just about speed, If you waste memory and your big model can not fit in memory you can not even train it.
 
I don't know why folks here think that NVIDIA had not done hell a lot of profiling for Ampere to ensure that it's great in traditional compute.
Also, all the new features are likely based on feedback from devs, hence most of devs I know are very positive about changes in Ampere.
The performance benchmarks can't come soon enough! There should be some very revealing benchmark comparisons since many are currently running "mature" training and inference models.
 
I like the no waste of BF16 that does 20 times better (obviously also compared to V100 non tensor core)
And it's not just about speed, If you waste memory and your big model can not fit in memory you can not even train it.
Agree. Speed is one variable to consider. Accuracy is another one and BF16 doesn't provide enough precision for many networks...
 
Agree. Speed is one variable to consider. Accuracy is another one and BF16 doesn't provide enough precision for many networks...
I asked already before, without reply:
provide some reference papers that show BF16 would not be sufficient
 
Out of curiosity - are your HPC codes typically limited by ALU throughput, or is increasing memory (main/cache/shared) bandwidth per FLOP more important?

For us, it's increasing memory bandwidth.

Lots of HPC jobs are famously "bandwidth limited". For these, bandwidth per FLOP determines performance. So while it's nice to know our new A100 cards will offer "19.5 Tflops of FP64", for us the problem is how to feed such feisty cores. (1.6 Tbps HBM2 does help!)
 
I asked already before, without reply:
provide some reference papers that show BF16 would not be sufficient

Maybe your experience is different than mine. I know quite many people who work with dnn's. Gold standard is to implement training with fp32 . Then try to optimize to use fp16 or even lower precision if possible. Some layers/networks this works out, some it doesn't. A lot of the tensor accelerators do multiplies in lower precision format but input, accumulation and output is in higher precision. tf32 multiplies in lower precision(19bit) but accumulation, input, output is fp32. It's pretty good compromise between quality and performance. If tf32 can replace fp32 in training that is a very big boost. For tf32 input/output is fp32 it's a drop in replacement for fp32. From network developer/scientist perspective it just work without any code changes(albeit the precision can be worse than fp32).

Inference often can be lower precision than training. There the holy grail is to get to int8 or even int4 if possible. fp32 for inference can be a reference but it's very likely some lower precision format is good enough.

It's typical to see different hw solution for training and inference as the requirement in both cases is different enough that creating special silicon gives advantage(cost, power consumption).

If one knows the use case exactly the end result can be something like google tpu or what tesla uses in their cars. On the other hand if one is doing generic product for datacenter that wide variety of customers want to use then flexibility is required to match reasonable amount of different use cases. More flexible you go less you have chance to create small and optimal solution. i.e there is space for very specific accelerators, very generic processors(cpu) and possibly something in between(gpu?).
 
Last edited:
For us, it's increasing memory bandwidth.

Lots of HPC jobs are famously "bandwidth limited". For these, bandwidth per FLOP determines performance. So while it's nice to know our new A100 cards will offer "19.5 Tflops of FP64", for us the problem is how to feed such feisty cores. (1.6 Tbps HBM2 does help!)
Interesting, is that mainly main memory bandwidth or do things like the large and fast (7,2 TByte/s) L2 cache for data reuse as well as async. copy for taking pressure from L1 and RF help was well?
 
Maybe your experience is different than mine. I know quite many people who work with dnn's. Gold standard is to implement training with fp32 . Then try to optimize to use fp16 or even lower precision if possible. Some layers/networks this works out, some it doesn't. A lot of the tensor accelerators do multiplies in lower precision format but input, accumulation and output is in higher precision. tf32 multiplies in lower precision(19bit) but accumulation, input, output is fp32. It's pretty good compromise between quality and performance. If tf32 can replace fp32 in training that is a very big boost. As input/output is fp32 it's a drop in replacement. From network developer/scientist perspective it just work without any code changes(albeit the precision can be worse than fp32).

Inference often can be lower precision than training. There the holy grail is to get to int8 or even int4 if possible. fp32 for inference can be a reference but it's very likely some lower precision format is good enough.

It's typical to see different hw solution for training and inference as the requirement in both cases is different enough that creating special silicon gives advantage(cost, power consumption).

If one knows the use case exactly the end result can be something like google tpu or what tesla uses in their cars. On the other hand if one is doing generic product for datacenter that wide variety of customers want to use then flexibility is required to match reasonable amount of different use cases. More flexible you go less you have chance to create small and optimal solution. i.e there is space for very specific accelerators, very generic processors(cpu) and possibly something in between(gpu?).

Still doesn't answer the question with solid paper evidence showing BF16 would not be sufficient for training.
It's quite strange how Nvidia tries to promote TF32 and doesn't talk about the benefits of BF16, as does your reply.
I'm reading quite a bit of AI training papers, which happen mostly to be Google papers.
There they don't talk about GPUs but about TPU which are BF16.
For example this quote "All models are trained in Tensorflow [25] using the Lingvo [26] toolkit on 8x8 Tensor Processing Units (TPU) slices with a global batch size of 4,096."
If for Google, training works with BF16 and they design their TPUs to be BF16, that gives them a huge edge over people who are told to stick with FP32/TF32.
 
If for Google, training works with BF16 and they design their TPUs to be BF16, that gives them a huge edge over people who are told to stick with FP32/TF32.
With The Tensor Float32 format, Nvidia did something that looks obvious in hindsight: It took the exponent of FP32 at eight bits, so it has the same range as either FP32 or Bfloat16, and then it added 10 bits for the mantissa, which gives it the same precision as FP16 instead of less as Bfloat16 has.
https://forum.beyond3d.com/posts/2126178/

https://www.nextplatform.com/2020/05/14/nvidia-unifies-ai-compute-with-ampere-gpu/
 
Still doesn't answer the question with solid paper evidence showing BF16 would not be sufficient for training.
It's quite strange how Nvidia tries to promote TF32 and doesn't talk about the benefits of BF16, as does your reply.
I'm reading quite a bit of AI training papers, which happen mostly to be Google papers.
There they don't talk about GPUs but about TPU which are BF16.
For example this quote "All models are trained in Tensorflow [25] using the Lingvo [26] toolkit on 8x8 Tensor Processing Units (TPU) slices with a global batch size of 4,096."
If for Google, training works with BF16 and they design their TPUs to be BF16, that gives them a huge edge over people who are told to stick with FP32/TF32.

I wasn't trying to prove anything. I just wrote what I have seen happen in real life. Based on those people I know and have worked with it's typical to implement fp32 training. Then optimize to lower precision and compare against fp32 model. Sometimes lower precision optimizations work out, sometimes not. Inference on the other hand is very different animal.

Lot of the dnn research/development is done by folks who are surprisingly computer illiterate. Those sciency folks just like to have high precision+python and make things work for their papers. It's whole another talent to take that research and optimize the hell out of it when making something production worthy.
 
Last edited:
So I googled bert fp32 vs fp16. BERT is all the rage nowdays. A little bit surprisingly I found a data point showing what I was trying to anecdotally share. Unfortunately the blog post doesn't give comparison on accuracy fp32 vs. fp16 trained model. Would be interesting to know if fp16 trained network matches fp32 trained in accuracy or if there is some small loss.

The BERT github repository started with a FP32 single-precision model, which is a good starting point to converge networks to a specified accuracy level. Converting the model to use mixed precision with V100 Tensor Cores, which computes using FP16 precision and accumulates using FP32, delivered the first speedup of 2.3x. Tensor Core’s mixed precision brings developers the best of both worlds: the execution speed of a lower precision to achieve significant speedups, and with sufficient accuracy to train networks to the same accuracy as higher precisions. More details about Tensor Cores can be found in our Programming Tensor Cores blog.
The next optimization adds an optimized layer normalization operation called “layer norm” for short, which improves performance by building on the existing cuDNN Batch Normalization primitive, and netted an additional 9% speedup. Next, doubling batch size from its initial size of 8 to 16 increased throughput another 18%.
And finally, the team used TensorFlow’s XLA, a deep learning compiler that optimizes TensorFlow computations. XLA was used to fuse pointwise operations and generate new a optimized kernel to replace multiple slower kernels. Some of the specific operations that saw speedups include a GELU activation function, scale and shift operation in Layer Norm, Adam weights update, attention softmax and attention dropout. A recent blog describes how to get the most out of XLA running on GPUs. This optimization brought an additional 34% performance speedup.


https://news.developer.nvidia.com/nvidia-achieves-4x-speedup-on-bert-neural-network/
 
Back
Top