Nvidia Ampere Discussion [2020-05-14]

Exactly where do they claim a 20x FP32 performance increase when comparing between Ampere and Volta?
In the big slide from press deck just above your post?
Other reviews had no problems interpreting the graphs along with what Nvidia stated.
I'm fully aware of how they came up with the numbers. Like I said, I should have been more specific and said "in their slides", which at minimum would need some fineprint saying "check details from here since this is actually BS if you just read it like it is"

That's a slide from the press deck. And i'm reading FP32 training. TF32 is nVidia's default precision for Training using A100.
But TF32 isn't FP32, that was the point.
The other comparisons are a little iffy too, at least to my understanding that FP64-number holds true only if it's matrix multiplications (aka something tensors run fast), same for INT8 (which in addition assumes you can take advantage of Sparsity support)
 
Last edited:
In the big slide from press deck just above your post?

I'm fully aware of how they came up with the numbers. Like I said, I should have been more specific and said "in their slides", which at minimum would need some fineprint saying "check details from here since this is actually BS if you just read it like it is"


But TF32 isn't FP32, that was the point.
The other comparisons are a little iffy too, at least to my understanding that FP64-number holds true only if it's matrix multiplications (aka something tensors run fast), same for INT8 (which in addition assumes you can take advantage of Sparsity support)

Which is why they only list fp64 as 10 teraflops, and not a huge number. Nvidia gonna Nvidia, whatever.
 
Which is why they only list fp64 as 10 teraflops, and not a huge number. Nvidia gonna Nvidia, whatever.
Except in that slide I posted, where they list it as 19.5 TFLOPS, which is tensor cores doing tensor stuff at FP64 precision.

edit:
The slide says
FP32 312 TFLOPS
INT8 1248 TOPS
FP64 19.5 TFLOPS

When it really should say
FP32 19.5 TFLOPS
INT8 624 Tensor TOPS (1248 with Sparsity support)
FP64 9.7 TFLOPS
 
OTOH, that column is headed by „peak“, and traditional FP32/64 numbers are given for FMAs only as well, so you just used to those, albeit they tell you just another peak value. MUL, ADD or - especially, beware - DIV, SQR and POW are losely connected to this. I'd rather question the identical peaks with 150 watts less power envelope compared to the SXM4 model, so the peaks could be massively shorter in duration or depending on instruction mix, a purely theoretical number.
 
For FP32, that's on top, sure. But Kaotik seems to dislike more than that.
Omitting that you can neither achieve the conditions for peak performance due to functional constraints in most use cases, nor sustained operation would fit within the TDP budget, and on top of that also wildly re-interpreting terms to include operations-not-performed in the stated peak numbers, and deliberately mislabling data types.

It's not that bad compared to the one time where NVidia had the nerve to label bitwise operations on 32bit operands followed by popcnt as "33 1-bit FLOPs". But the numbers are useless regardless.
 
THE NEW GENERAL AND NEW PURPOSE IN COMPUTING
June 29, 2020
We already went through the salient characteristics and performance using various data formats and processing units of Nvidia GPU accelerators from Kepler through Ampere in the architectural deep dive we did at the end of May. Now, we are going to look at the cost of Ampere A100 accelerators relative to prior generations at their current street prices. Ultimately, this is always about money as much as it is about architecture. All of the clever architecture in the world doesn’t amount to much if you can’t afford to buy it.

In the table below, we show the price, bang for the buck, and bang for the watt of the past four-ish generations of GPU accelerators (counting Turing as its own generation makes four in addition to the Pascal, Volta, and Ampere GPU accelerators). There are a lot of different performance metrics we can look at, and that is particularly true when considering sparse matrix support for machine learning inference. In this comparison, we assume the best case scenario using the various data formats and compute units as well as sparse matrix calculations when calculating performance per watt and performance per dollar. We also being it all together in the mother of all ratios, dollars per performance per watt, which is the be-all, end-all calculation that drives architectural choices at the hyperscalers and cloud builders.


nvidia-kepler-to-a100-little-table.jpg


That’s just a subset of the data about server-class GPU accelerators from Nvidia that we have compiled over time. We have this big table that goes all the way back to the Kepler GPU accelerators, which you can view in a separate window here because it doesn’t fit in our column width by a long shot. This is the companion set for the two tables we did in the architectural dep dive, but as we said, now we are making performance fractions with watts and bucks.
...
As part of the Ampere rollout, Nvidia released some HPC and AI benchmarks to give customers a sense of how real-world applications performance on the A100 accelerators and how they compare to the previous Turing and Volta generations as appropriate.

These are initial tests, and we have a strong suspicion that over time as the software engineers get their hands on Ampere accelerators and update their software, the performance will get even better. There is precedent for this, after all.

nvidia-ampere-hpc-ai-benchmarks-p100-v100-a100.jpg


Our best guess is that the HPC applications in the chart above are not running on the Tensor Cores, so there is headroom for the software to catch up with the software and double the performance for these applications. So maybe when everything is all tuned up, it will be somewhere between 15X and 20X improvement on the same A100 hardware this time next year.

The important thing to note in that chart is how performance improved on the V100s with no changes in the hardware but lots of changes in the software, both in 2018 and 2019. There is no reason to believe that the features in the Ampere GPU will not be exploited to the hilt – eventually – by HPC software as much as by AI software. The stakes are just as high, and the software engineers are just as smart.
https://www.nextplatform.com/2020/06/29/the-new-general-and-new-purpose-in-computing/


Edit: The article is a good read and has a bit more information.
 
Last edited:
Some speculation about the top consumer Ampere GPU.
The A100 went from the V100 from 6 GPC with 14 SM / GPC to 8 GPC with 16 SM / GPC.
Or from 84 SMs / 5.376‬ FP32 Cores to 128 SM / 8.192 FP32 Cores
Logically the GA102 will also have 8 GPCs, and keeping 12 SM / GPC that would lead to
the TU102 going from 6 GPC with 12 SM / GPC to a GA102 with 8 GPC and 12 SM / GPC
Or from 72 SMs / 4608 FP32 Cores to 96 SMs / 6144 FP32 Cores
 
Some speculation about the top consumer Ampere GPU.
The A100 went from the V100 from 6 GPC with 14 SM / GPC to 8 GPC with 16 SM / GPC.
Or from 84 SMs / 5.376‬ FP32 Cores to 128 SM / 8.192 FP32 Cores
Logically the GA102 will also have 8 GPCs, and keeping 12 SM / GPC that would lead to
the TU102 going from 6 GPC with 12 SM / GPC to a GA102 with 8 GPC and 12 SM / GPC
Or from 72 SMs / 4608 FP32 Cores to 96 SMs / 6144 FP32 Cores
I wouldn't be drawing conclusions that Volta > Ampere jump relates in any way to gaming sides Turing > Ampere, especially when we take into account that A100 and gaming Amperes will be quite a bit different, with the former lacking RT-acceleration and most likely dedicating more space to Tensors
 
I wouldn't be drawing conclusions that Volta > Ampere jump relates in any way to gaming sides Turing > Ampere, especially when we take into account that A100 and gaming Amperes will be quite a bit different, with the former lacking RT-acceleration and most likely dedicating more space to Tensors

You are quite wrong about that, the V100 was a good predictor for the TU102, with the latter gaming GPU having even more space dedicated to tensors per SM. Likewise the A100 will serve as a basis for the GA102.
I would hate seeing those big tensor cores transition from the A100 to the GA102, but given the past it would not be a complete surprise.
 
You are quite wrong about that, the V100 was a good predictor for the TU102, with the latter gaming GPU having even more space dedicated to tensors per SM. Likewise the A100 will serve as a basis for the GA102.
I would hate seeing those big tensor cores transition from the A100 to the GA102, but given the past it would not be a complete surprise.
Volta and Turing weren't the same architecture, Amperes should be even if the implementations are different in similar way. You can't just pick few nice indicators and ignore the rest of the differences and call it a day.
 
Volta and Turing weren't the same architecture, Amperes should be even if the implementations are different in similar way. You can't just pick few nice indicators and ignore the rest of the differences and call it a day.

I was speculating about the number of GPCs and SMs in a plausible way.
Besides that there will be obvious differences like there always have been like no double precision, much less NV links, reduced ECC / memory system...
If 96 SMs for a GA102 is unreasonable to you, argue about that.
 
I think it all depends on whether or not Gaming-Ampere will also get the beefy tensor cores. Since the chip specialties in training and inference seem to diverge a bit, I can see Nvidia offering the training-optimized µarch throughout the product line, thus leaving more space for simpler cores with high throughput only in traditional applications and inferencing.
 
Did Jen-Hsun's statement (which I don't recall, hence my question) specify, if he meant the absolute number of produced chips (in millions) or the number of models (GA100, GA102, GA104) etc.? Samsung could be making GA102/104, while TSMC churns out GA 106/107/108.
 
Back
Top