Nvidia Ampere Discussion [2020-05-14]

manux · May 18, 2020

Would this mean ability to program GPU directly with c++ instead of using cuda?

https://twitter.com/x/status/1261219711418724357

https://twitter.com/x/status/1260998036899037184

edit. Digging a bit more

https://twitter.com/x/status/1261970219225845760

Deleted member 13524 · May 19, 2020

So from the specs it looks like A100 is running at 1425MHz.
That's lower than any V100 16FFN implementation I know of (1440-1455MHz), even though A100 is on 7nm and can reach 100W more than the highest consuming V100 (mezzanine 300W).

So with 400W maximum power and 19.5 FP32 TFLOPs, the A100 goes up to 20.5 W/TFLOPs. The 3 year-old V100 with 300W max power and 14.9 FP32 TFLOPs goes up to 20.1 W/TFLOPs.

The power efficiency on FP32 actually went down, when transitioning from 16FFN Volta to 7nm Ampere?
Or are these 400W only applicable when the GPU is running its tensor units in full parallel with the FP32 ALUs (and is that possible without hitting L2 cache / VRAM bottlenecks)?

This is just the first data point of many, but it could be that nvidia's delay in adopting 7nm is due to the fact that they can't boost higher clocks from 7nm than they could from 16/12FFN.

Kaotik said:
Not quite, each Orin can go at least up to 45W (L2+ spec), but @Ryan Smith suggests they could go as high as 65 - 70W each. Then there's whatever that daughterboard in the upper edge is too. The whole platform is supposed to be 800W.

@Nebuchadnezzar wrote that the A100 can go up to 400W, though he also assumed the GPUs in Robotaxi are the A100, which they don't seem to be if you compare their pictures.
The dGPUs in Robotaxi seem to be substantially smaller, maybe around 600mm^2.

troyan · May 19, 2020

FP32 and DP compute performance is not relevant for A100. With TensorCores (FP16) or FP16 alone the compute performance is 2,5x higher. Results in 87,5% higher efficiency with 1,6TB bandwidth...

Deleted member 13524 · May 19, 2020

troyan said:
FP32 and DP compute performance is not relevant for A100.

Of course it is. DP performance is the first number they're showing in the chip's specs list.

troyan · May 19, 2020

TensorCore throughput matters and not DP over Cuda cores.

DegustatoR · May 19, 2020

It's impossible to say what matters and what not without knowing how many transistors are assigned to which h/w unit and how they contribute to the rated 400W power draw figure.

If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere - but again it's hard to say how uniform that gain is or how it will translate into graphics oriented GPUs.

DavidGraham · May 19, 2020

The 1st iteration of NVLink alone amounts to about 50w of power, which is the difference between V100 NVLink and V100S PCI-E power consumption figures. The 2nd iteration will probably need more power than that.

DegustatoR said:
If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere

An obvious, and a huge one at that as well.

DegustatoR said:
without knowing how many transistors are assigned to which h/w unit

Tensor core count has been reduced in the A100, it now stands at 432 vs ~640 in V100, and they even run at lower clocks, we can infer from that that a significant amount of transistor budget has gone to the new tensor units in A100 to improve their IPC, also they now support significantly more formats, which also require significant transistor budget.

Malo · May 19, 2020

Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?

DegustatoR · May 19, 2020

Malo said:
Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?

Ampere GA100 TCs support FP64 now.

manux · May 19, 2020

Malo said:
Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?

Often but not always that math is about matrix multiplies and that's the thing tensor cores accelerate.

Malo · May 19, 2020

DegustatoR said:
Ampere GA100 TCs support FP64 now.

ok, I thought I did see that in the spec sheet.

DavidGraham · May 19, 2020

Malo said:
Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?

A100 supports two new significant data formats: Tensor TF32 and FP64. TF32 doesn't require code change, but FP64 does.

NVIDIA is instructing developers to migrate their code to the FP64 Tensor format to achieve 2.5X increase in throughput.

"With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.

Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.

We refer to this new capability as Double-Precision Tensor Cores."

https://agenparl.eu/double-precision-tensor-cores-speed-high-performance-computing/

troyan · May 19, 2020

I an hour nVidia will talk about the Ampere architecture: https://www.nvidia.com/en-us/gtc/#webinar-schedule

BRiT · May 19, 2020

troyan said:
I an hour nVidia will talk about the Ampere architecture: https://www.nvidia.com/en-us/gtc/#webinar-schedule

NAck, that's during MS Build presentations...

Deleted member 2197 · May 19, 2020

troyan said:
I an hour nVidia will talk about the Ampere architecture: https://www.nvidia.com/en-us/gtc/#webinar-schedule

There's also free registration.
https://www.nvidia.com/en-us/gtc/session-catalog-details/?search=S21730

troyan · May 19, 2020

nVidia published the Ampere whitepaper: https://www.nvidia.com/content/dam/...ter/nvidia-ampere-architecture-whitepaper.pdf

manux · May 19, 2020

troyan said:
nVidia published the Ampere whitepaper: https://www.nvidia.com/content/dam/...ter/nvidia-ampere-architecture-whitepaper.pdf

This snippet from whitepaper is neat

Asynchronous Barrier
The A100 GPU provides hardware-accelerated barriers in shared memory. These barriers are
available using CUDA 11 in the form of ISO C++-conforming barrier objects. Asynchronous
barriers split apart the barrier arrive and wait operations, and can be used to overlap
asynchronous copies from global memory into shared memory with computations in the SM.
They can be used to implement producer-consumer models using CUDA threads. Barriers also
provide mechanisms to synchronize CUDA threads at different granularities, not just warp or
block level.

trinibwoy · May 19, 2020

DegustatoR said:
It's impossible to say what matters and what not without knowing how many transistors are assigned to which h/w unit and how they contribute to the rated 400W power draw figure.

If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere - but again it's hard to say how uniform that gain is or how it will translate into graphics oriented GPUs.

Exactly. Unless we know what workloads result in 400w power consumption the numbers don’t tell us anything. If I had to guess, the peak power numbers probably correspond to max tensor throughout given the massive amount of data movement required.

Deleted member 13524 · May 19, 2020

I still think it's odd that this new 7nm GPU is clocking lower than nvidia's very first implementarion of 16FF, but sure it could be due to it being so massively wide.
We'll wait and see how the consumer GPUs come out.

xpea · May 19, 2020

We don't know for gaming yet, but am I the only one to see MA-SSI-VE architecture changes that provide HU-GE performance jump, thus efficiency jump, in the intended workloads?

3 to 7 times real world performance gain on BERT training/inferencing is above expectations

Nvidia Ampere Discussion [2020-05-14]

manux

Deleted member 13524

Guest

troyan

Deleted member 13524

Guest

troyan

DegustatoR

DavidGraham

Malo

Yak Mechanicum

DegustatoR

manux

Malo

Yak Mechanicum

DavidGraham

troyan

BRiT

(>• •)>⌐■-■ (⌐■-■)

Deleted member 2197

Guest

troyan

manux

trinibwoy

Meh

Deleted member 13524

Guest

xpea

Similar threads