Nvidia Ampere Discussion [2020-05-14]

The sparse int just looks like it's for poorly pruned deployment neural nets to begin with, how else would you zero half the nodes with no outcome effect?

Well, I suppose an easy way to optimize is highly tempting for a lot of devs, and locking them into a Nvidia only supported mode is good for Nvidia.
Looks like you are not happy about the new sparsity tech. And BTW it's not only for int, it also works on TF16, BF16 and TF32
it's all here:
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
A100 introduces fine-grained structured sparsity
With the A100 GPU, NVIDIA introduces fine-grained structured sparsity, a novel approach that doubles compute throughput for deep neural networks.

Sparsity is possible in deep learning because the importance of individual weights evolves during the learning process, and by the end of network training, only a subset of weights have acquired a meaningful purpose in determining the learned output. The remaining weights are no longer needed.

Fine grained structured sparsity imposes a constraint on the allowed sparsity pattern, making it more efficient for hardware to do the necessary alignment of input operands. Because deep learning networks are able to adapt weights during the training process based on training feedback, NVIDIA engineers have found in general that the structure constraint does not impact the accuracy of the trained network for inferencing. This enables inferencing acceleration with sparsity.

For training acceleration, sparsity needs to be introduced early in the process to offer a performance benefit, and methodologies for training acceleration without accuracy loss are an active research area.

Sparse matrix definition
Structure is enforced through a new 2:4 sparse matrix definition that allows two non-zero values in every four-entry vector. A100 supports 2:4 structured sparsity on rows, as shown in Figure 9.

Due to the well-defined structure of the matrix, it can be compressed efficiently and reduce memory storage and bandwidth by almost 2x.

Fine-Grain-Structured-Sparsity.png

Figure 9. A100 fine-grained structured sparsity prunes trained weights with a 2-out-of-4 non-zero pattern, followed by a simple and universal recipe for fine-tuning the non-zero weights. The weights are compressed for a 2x reduction in data footprint and bandwidth, and the A100 Sparse Tensor Core doubles math throughput by skipping the zeros.

NVIDIA has developed a simple and universal recipe for sparsifying deep neural networks for inference using this 2:4 structured sparsity pattern. The network is first trained using dense weights, then fine-grained structured pruning is applied, and finally the remaining non-zero weights are fine-tuned with additional training steps. This method results in virtually no loss in inferencing accuracy based on evaluation across dozens of networks spanning vision, object detection, segmentation, natural language modeling, and translation.

The A100 Tensor Core GPU includes new Sparse Tensor Core instructions that skip the compute on entries with zero values, resulting in a doubling of the Tensor Core compute throughput. Figure 9 shows how the Tensor Core uses the compression metadata (the non-zero indices) to match the compressed weights with the appropriately selected activations for input to the Tensor Core dot-product computation.
 
The sparse int just looks like it's for poorly pruned deployment neural nets to begin with, how else would you zero half the nodes with no outcome effect?
For an equal number of non-zero parameters a larger sparse (pruned) network is almost always more accurate than a smaller dense (unpruned) network. You need hardware support to run the sparse network (that supports the specific pruning patterns used), but if you do you get close to the accuracy of a larger network at the execution cost of the smaller network.
 

This newly expanded range starts at an NCAP 5-star ADAS system and runs all the way to a DRIVE AGX Pegasus robotaxi platform. The latter features two Orin SoCs and two NVIDIA Ampere GPUs to achieve an unprecedented 2,000 trillion operations per second, or TOPS — more than 6x the performance of the previous platform.
Doesn't look like A100 in the picture. Or very cut down version with only 4 HBM.
https://blogs.nvidia.com/blog/2020/05/14/drive-platform-nvidia-ampere-architecture/
 
It may not be a perfect scale representation, but surely there must be some effort made to indicate the area occupied by each SM component in the block diagram.

Nvidia has been using the same block diagrams since Kepler. They’ve never been to scale. E.g. the schedulers are likely larger than depicted and the fixed function geometry hardware isn’t represented at all.
 
NVIDIA A100 Ampere Resets the Entire AI Industry
May 14, 2020

Every AI company measures its performance to the Tesla V100. Today, that measuring stick changes, and takes a big leap with the NVIDIA A100. Note the cards are actually labeled as GA100 but we are using A100 to align with NVIDIA’s marketing materials.

NVIDIA-Tesla-A100-Specs.jpg

You may ask yourself, what are the asterisks next to many of those numbers, NVIDIA says those are the numbers with structural sparsity enabled. We are going to discuss more of that in a bit.
...
NVIDIA is inventing new math formats and adding Tensor Core acceleration to many of these. Part of the story of the NVIDIA A100’s evolution from the Tesla P100 and Tesla V100 is that it is designed to handle BFLOAT16, TF32, and other new computation formats. This is exceedingly important because it is how NVIDIA is getting claims of 10-20x the performance of previous generations. At the same time, raw FP64 (non-Tensor Core) performance, for example, has gone from 5.3 TFLOPS with the Tesla P100, 7.5 TFLOPS for the SXM2 Tesla V100 (a bit more in the SXM3 versions), and up to 9.7 TFLOPS in the A100. While traditional FP64 double precision is increasing, the accelerators and new formats are on a different curve.
...
NVLink speeds have doubled to 600GB/s from 300GB/s. We figured this was the case recently in NVIDIA A100 HGX-2 Edition Shows Updated Specs. That observation seems to be confirmed along with the PCIe Gen4 observation.

The A100 now utilizes PCIe Gen4. That is actually a big deal. With Intel’s major delays of the Ice Lake Xeon platform that will include PCIe Gen4, NVIDIA was forced to move to the AMD EPYC 64-core PCIe Gen4 capable chips for its flagship DGX A100 solution. While Intel is decisively going after NVIDIA with its Xe HPC GPU and Habana Labs acquisition, AMD is a GPU competitor today. Still, NVIDIA had to move to the AMD solution to get PCIe Gen4. NVIDIA’s partners will also likely look to the AMD EPYC 7002 Series to get PCIe Gen4 capable CPUs paired to the latest NVIDIA GPUs.

NVIDIA wanted to stay x86 rather than go to POWER for Gen4 support. The other option would have been to utilize an Ampere Altra Q80-30 or similar as part of NVIDIA CUDA on Arm. It seems like NVIDIA does not have enough faith in Arm server CPUs to move its flagship DGX platform to Arm today. This may well happen in future generations so it does not need to design-in a competitor’s solution.

I was able to ask Jensen a question directly on the obvious question: supply. Starting today onward, a Tesla V100 for anything that can be accelerated with Tensor Cores on the A100 is a tough sell. As a result, the industry will want the A100. I asked how will NVIDIA prioritize which customers get the supply of new GPUs. Jensen said that the A100 is already in mass production. Cloud customers already have the A100 and that it will be in every major cloud. There are already customers who have the A100. Systems vendors can take the HGX A100 platform and deliver solutions around it. The DGX A100 is available for order today. That is a fairly typical data center launch today where some customers already are deploying before the launch. Still, our sense is that there will be lead times as organizations rush to get the new GPU for hungry AI workloads.

With the first round of GPUs, we are hearing that NVIDIA is focused on the 8x GPU HGX and 4x GPU boards to sell in its own and partner systems. NVIDIA is not just selling these initial A100’s as single PCIe GPUs. Instead, NVIDIA is selling them as pre-assembled GPU and PCB assemblies.

https://www.servethehome.com/nvidia-tesla-a100-ampere-resets-the-entire-ai-industry/


 
This is one impressive GPU.
Though clearly the design target was not reached as 1/8 of SMs (+4) and 1/6 of HBM2 (8GB) + L2 (8MB) was disabled due to low yield.
Later versions will likely have ~8K SMs and 48 GB HBM2, though that either will increase the 400 Watt or lower the clock further.

One 'deception' I noticed about the claim of 156 TF for FP32.
This was presented in the videos as equivalent computation to FP32 when not making use of tensor cores.
But it is not. The tensor cores work with the new floating point format 'deceptively' called TF32.
In fact this is a 19 bit floating point format and would have been better called FP19.
The TF32 has 8 bit exponent and 10 bit mantissa as shown below.
GTC_PPB_08.jpg
 
Later versions will likely have ~8K SMs and 48 GB HBM2, though that either will increase the 400 Watt or lower the clock further.
I guess they will up the power consumption, as V100's power was also increased twice, first to 350w and a second time to 450w.

Or they may go the V100S route where they increased clocks and bandwdith while simultaneously slashing power from 300w to 250w.

This was presented in the videos as equivalent computation to FP32 when not making use of tensor cores.
The new format works without changing code, so I guess there is "some" merits to this comparison?
 
The new format works without changing code, so I guess there is "some" merits to this comparison?
Also any AI acceleration from the new format will only be available on Ampere. Matching Volta fails to be a competitive option anymore.
 
One of the clever bits in the Ampere architecture this time around is a new numerical format that is called Tensor Float32, which is a hybrid between single precision FP32 and half precision FP16 and that is distinct from the Bfloat16 format that Google has created for its Tensor Processor Unit (TPU) and that many CPU vendors are adding to their math units because of the advantages it offers in boosting AI throughput. Every floating point number starts with a sign for negative or positive and then has a certain number of bits that signify the exponent, which gives the format its dynamic range, and then another set of bits that are the signifcand or mantissa, which gives the format its precision.

The IEEE FP64 format is not shown, but it has an 11-bit exponent plus a 52-bit mantissa and it has a range of ~2.2e-308 to ~1.8e308. The IEEE FP32 single precision format has an 8-bit exponent plus a 23-but mantissa and it has a smaller range of ~1e-38 to ~3e38. The half precision FP16 format has a 5-bit exponent and a 10-bit mantissa with a range of ~5.96e-8 to 65,504. Obviously that truncated range at the high end of FP16 means you have to be careful how you use it. Google’s Bfloat16 has an 8-bit exponent, so it has the same range as FP32, but it has a shorter 7-bit mantissa, so it has less precision than FP16.

With The Tensor Float32 format, Nvidia did something that looks obvious in hindsight: It took the exponent of FP32 at eight bits, so it has the same range as either FP32 or Bfloat16, and then it added 10 bits for the mantissa, which gives it the same precision as FP16 instead of less as Bfloat16 has. The new Tensor Cores supporting this format can input data in FP32 format and accumulate in FP32 format, and they will speed up machine learning training without any change in coding, according to Kharya. Incidentally, the Ampere GPUs will support the Bfloat16 format as well as FP64, FP32, FP16, INT4, and INT8 – the latter two being popular for inference workloads, of course.
https://www.nextplatform.com/2020/05/14/nvidia-unifies-ai-compute-with-ampere-gpu/
 
This is one impressive GPU.
Though clearly the design target was not reached as 1/8 of SMs (+4) and 1/6 of HBM2 (8GB) + L2 (8MB) was disabled due to low yield.
Later versions will likely have ~8K SMs and 48 GB HBM2, though that either will increase the 400 Watt or lower the clock further.

One 'deception' I noticed about the claim of 156 TF for FP32.
This was presented in the videos as equivalent computation to FP32 when not making use of tensor cores.
But it is not. The tensor cores work with the new floating point format 'deceptively' called TF32.
In fact this is a 19 bit floating point format and would have been better called FP19.
The TF32 has 8 bit exponent and 10 bit mantissa as shown below.
View attachment 3893

There’s no claim of 156 TF/s for FP32. That claim is for TF32, which is a mixed precision format for matrix multiplication and addition. The input and output operands are FP32, but the multipliers inputs are FP19, with their output accumulated at FP32. The app doesn’t need to change its code and you get 8x (or 16x with sparsity) higher peak flop/s than full FP32 math.
 
There’s no claim of 156 TF/s for FP32. That claim is for TF32, which is a mixed precision format for matrix multiplication and addition. The input and output operands are FP32, but the multipliers inputs are FP19, with their output accumulated at FP32. The app doesn’t need to change its code and you get 8x (or 16x with sparsity) higher peak flop/s than full FP32 math.

If the app doesn't need to change how does the app then makes the difference between training with FP32 and FP19 (aka TF32)?
It begs also the question if there is much value to FP19 as dropping 3 more bit from the mantisse and you got BFloat16, which makes training 2x faster.
BTW Remark you can not use the sparisity feature for training this is for inferencing,
 
Regarding the power efficiency of Ampere, A100 has 54bil transistors, Titan RTX has 18bil, V100 has 21bil, both came at a TDP of 280~300w, so roughly speaking A100 has introduced 2.5X to 3X the transistor count, while simultaneously increasing power to 400w (a 40% increase).

I know this math is extremely rough around the edges, but it can give us some sort of an indication of how much progress NVIDIA has achieved on 7nm, the claim that Ampere is 50% faster than Turing at half the power is not that far fetched at least?
 
If the app doesn't need to change how does the app then makes the difference between training with FP32 and FP19 (aka TF32)?

The app doesnt. Standard mode will be TF32 when developers use certain libaries from nVidia and in the future from other companies.
 
BTW Remark you can not use the sparisity feature for training this is for inferencing,
Sparsity can be used for both training and inferencing, though currently has more benefit when used for inferencing.
While the sparsity feature more readily benefits AI inference, it can also be used to improve the performance of model training.
https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/

https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
https://blogs.nvidia.com/blog/2020/05/14/sparsity-ai-inference/
 
Last edited by a moderator:
Regarding the power efficiency of Ampere, A100 has 54bil transistors, Titan RTX has 18bil, V100 has 21bil, both came at a TDP of 280~300w, so roughly speaking A100 has introduced 2.5X to 3X the transistor count, while simultaneously increasing power to 400w (a 40% increase).

I know this math is extremely rough around the edges, but it can give us some sort of an indication of how much progress NVIDIA has achieved on 7nm, the claim that Ampere is 50% faster than Turing at half the power is not that far fetched at least?
Any comment on this guys?
 
Likely will know more once independent testing/reviews are done. Right now we only have Nvidia's numbers.
 
Back
Top