Nvidia Ampere Discussion [2020-05-14]

So from the specs it looks like A100 is running at 1425MHz.
That's lower than any V100 16FFN implementation I know of (1440-1455MHz), even though A100 is on 7nm and can reach 100W more than the highest consuming V100 (mezzanine 300W).

So with 400W maximum power and 19.5 FP32 TFLOPs, the A100 goes up to 20.5 W/TFLOPs. The 3 year-old V100 with 300W max power and 14.9 FP32 TFLOPs goes up to 20.1 W/TFLOPs.

The power efficiency on FP32 actually went down, when transitioning from 16FFN Volta to 7nm Ampere?
Or are these 400W only applicable when the GPU is running its tensor units in full parallel with the FP32 ALUs (and is that possible without hitting L2 cache / VRAM bottlenecks)?

This is just the first data point of many, but it could be that nvidia's delay in adopting 7nm is due to the fact that they can't boost higher clocks from 7nm than they could from 16/12FFN.


Not quite, each Orin can go at least up to 45W (L2+ spec), but @Ryan Smith suggests they could go as high as 65 - 70W each. Then there's whatever that daughterboard in the upper edge is too. The whole platform is supposed to be 800W.
@Nebuchadnezzar wrote that the A100 can go up to 400W, though he also assumed the GPUs in Robotaxi are the A100, which they don't seem to be if you compare their pictures.
The dGPUs in Robotaxi seem to be substantially smaller, maybe around 600mm^2.
 
It's impossible to say what matters and what not without knowing how many transistors are assigned to which h/w unit and how they contribute to the rated 400W power draw figure.

If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere - but again it's hard to say how uniform that gain is or how it will translate into graphics oriented GPUs.
 
The 1st iteration of NVLink alone amounts to about 50w of power, which is the difference between V100 NVLink and V100S PCI-E power consumption figures. The 2nd iteration will probably need more power than that.

If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere
An obvious, and a huge one at that as well.
without knowing how many transistors are assigned to which h/w unit
Tensor core count has been reduced in the A100, it now stands at 432 vs ~640 in V100, and they even run at lower clocks, we can infer from that that a significant amount of transistor budget has gone to the new tensor units in A100 to improve their IPC, also they now support significantly more formats, which also require significant transistor budget.
 
Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?
 
Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?
Ampere GA100 TCs support FP64 now.
 
Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?

Often but not always that math is about matrix multiplies and that's the thing tensor cores accelerate.
 
Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?
A100 supports two new significant data formats: Tensor TF32 and FP64. TF32 doesn't require code change, but FP64 does.

NVIDIA is instructing developers to migrate their code to the FP64 Tensor format to achieve 2.5X increase in throughput.

"With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.

Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.

We refer to this new capability as Double-Precision Tensor Cores."

https://agenparl.eu/double-precision-tensor-cores-speed-high-performance-computing/
 

This snippet from whitepaper is neat

Asynchronous Barrier
The A100 GPU provides hardware-accelerated barriers in shared memory. These barriers are
available using CUDA 11 in the form of ISO C++-conforming barrier objects. Asynchronous
barriers split apart the barrier arrive and wait operations, and can be used to overlap
asynchronous copies from global memory into shared memory with computations in the SM.
They can be used to implement producer-consumer models using CUDA threads. Barriers also
provide mechanisms to synchronize CUDA threads at different granularities, not just warp or
block level.
 
It's impossible to say what matters and what not without knowing how many transistors are assigned to which h/w unit and how they contribute to the rated 400W power draw figure.

If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere - but again it's hard to say how uniform that gain is or how it will translate into graphics oriented GPUs.

Exactly. Unless we know what workloads result in 400w power consumption the numbers don’t tell us anything. If I had to guess, the peak power numbers probably correspond to max tensor throughout given the massive amount of data movement required.
 
I still think it's odd that this new 7nm GPU is clocking lower than nvidia's very first implementarion of 16FF, but sure it could be due to it being so massively wide.
We'll wait and see how the consumer GPUs come out.
 
We don't know for gaming yet, but am I the only one to see MA-SSI-VE architecture changes that provide HU-GE performance jump, thus efficiency jump, in the intended workloads?

Screenshot_20200520-023642.jpg Screenshot_20200520-023146.jpg Screenshot_20200520-023312.jpg Screenshot_20200520-023108.jpg
3 to 7 times real world performance gain on BERT training/inferencing is above expectations
 
Back
Top