Nvidia Ampere Discussion [2020-05-14]

CarstenS · Jun 23, 2020

I stopped to be a true believer, when everyone removed AA-samples/s from their spec sheets. [/sarcasm]

Kaotik · Jun 23, 2020

pharma said:
Exactly where do they claim a 20x FP32 performance increase when comparing between Ampere and Volta?

In the big slide from press deck just above your post?

Other reviews had no problems interpreting the graphs along with what Nvidia stated.

I'm fully aware of how they came up with the numbers. Like I said, I should have been more specific and said "in their slides", which at minimum would need some fineprint saying "check details from here since this is actually BS if you just read it like it is"

troyan said:
That's a slide from the press deck. And i'm reading FP32 training. TF32 is nVidia's default precision for Training using A100.

But TF32 isn't FP32, that was the point.
The other comparisons are a little iffy too, at least to my understanding that FP64-number holds true only if it's matrix multiplications (aka something tensors run fast), same for INT8 (which in addition assumes you can take advantage of Sparsity support)

Frenetic Pony · Jun 23, 2020

Kaotik said:
In the big slide from press deck just above your post?

I'm fully aware of how they came up with the numbers. Like I said, I should have been more specific and said "in their slides", which at minimum would need some fineprint saying "check details from here since this is actually BS if you just read it like it is"

But TF32 isn't FP32, that was the point.
The other comparisons are a little iffy too, at least to my understanding that FP64-number holds true only if it's matrix multiplications (aka something tensors run fast), same for INT8 (which in addition assumes you can take advantage of Sparsity support)

Which is why they only list fp64 as 10 teraflops, and not a huge number. Nvidia gonna Nvidia, whatever.

Kaotik · Jun 23, 2020

Frenetic Pony said:
Which is why they only list fp64 as 10 teraflops, and not a huge number. Nvidia gonna Nvidia, whatever.

Except in that slide I posted, where they list it as 19.5 TFLOPS, which is tensor cores doing tensor stuff at FP64 precision.

edit:
The slide says
FP32 312 TFLOPS
INT8 1248 TOPS
FP64 19.5 TFLOPS

When it really should say
FP32 19.5 TFLOPS
INT8 624 Tensor TOPS (1248 with Sparsity support)
FP64 9.7 TFLOPS

CarstenS · Jun 24, 2020

OTOH, that column is headed by „peak“, and traditional FP32/64 numbers are given for FMAs only as well, so you just used to those, albeit they tell you just another peak value. MUL, ADD or - especially, beware - DIV, SQR and POW are losely connected to this. I'd rather question the identical peaks with 150 watts less power envelope compared to the SXM4 model, so the peaks could be massively shorter in duration or depending on instruction mix, a purely theoretical number.

no-X · Jun 24, 2020

The "problem" isn't that they are peak numbers, but that those are not FP32 numbers, but TF32 numbers labeled as FP32.

CarstenS · Jun 24, 2020

For FP32, that's on top, sure. But Kaotik seems to dislike more than that.

Ext3h · Jun 24, 2020

CarstenS said:
For FP32, that's on top, sure. But Kaotik seems to dislike more than that.

Omitting that you can neither achieve the conditions for peak performance due to functional constraints in most use cases, nor sustained operation would fit within the TDP budget, and on top of that also wildly re-interpreting terms to include operations-not-performed in the stated peak numbers, and deliberately mislabling data types.

It's not that bad compared to the one time where NVidia had the nerve to label bitwise operations on 32bit operands followed by popcnt as "33 1-bit FLOPs". But the numbers are useless regardless.

Deleted member 2197 · Jun 30, 2020

THE NEW GENERAL AND NEW PURPOSE IN COMPUTING
June 29, 2020

We already went through the salient characteristics and performance using various data formats and processing units of Nvidia GPU accelerators from Kepler through Ampere in the architectural deep dive we did at the end of May. Now, we are going to look at the cost of Ampere A100 accelerators relative to prior generations at their current street prices. Ultimately, this is always about money as much as it is about architecture. All of the clever architecture in the world doesn’t amount to much if you can’t afford to buy it.

In the table below, we show the price, bang for the buck, and bang for the watt of the past four-ish generations of GPU accelerators (counting Turing as its own generation makes four in addition to the Pascal, Volta, and Ampere GPU accelerators). There are a lot of different performance metrics we can look at, and that is particularly true when considering sparse matrix support for machine learning inference. In this comparison, we assume the best case scenario using the various data formats and compute units as well as sparse matrix calculations when calculating performance per watt and performance per dollar. We also being it all together in the mother of all ratios, dollars per performance per watt, which is the be-all, end-all calculation that drives architectural choices at the hyperscalers and cloud builders.

That’s just a subset of the data about server-class GPU accelerators from Nvidia that we have compiled over time. We have this big table that goes all the way back to the Kepler GPU accelerators, which you can view in a separate window here because it doesn’t fit in our column width by a long shot. This is the companion set for the two tables we did in the architectural dep dive, but as we said, now we are making performance fractions with watts and bucks.
...
As part of the Ampere rollout, Nvidia released some HPC and AI benchmarks to give customers a sense of how real-world applications performance on the A100 accelerators and how they compare to the previous Turing and Volta generations as appropriate.

These are initial tests, and we have a strong suspicion that over time as the software engineers get their hands on Ampere accelerators and update their software, the performance will get even better. There is precedent for this, after all.

Our best guess is that the HPC applications in the chart above are not running on the Tensor Cores, so there is headroom for the software to catch up with the software and double the performance for these applications. So maybe when everything is all tuned up, it will be somewhere between 15X and 20X improvement on the same A100 hardware this time next year.

The important thing to note in that chart is how performance improved on the V100s with no changes in the hardware but lots of changes in the software, both in 2018 and 2019. There is no reason to believe that the features in the Ampere GPU will not be exploited to the hilt – eventually – by HPC software as much as by AI software. The stakes are just as high, and the software engineers are just as smart.

https://www.nextplatform.com/2020/06/29/the-new-general-and-new-purpose-in-computing/

Edit: The article is a good read and has a bit more information.

Voxilla · Jul 1, 2020

Some speculation about the top consumer Ampere GPU.
The A100 went from the V100 from 6 GPC with 14 SM / GPC to 8 GPC with 16 SM / GPC.
Or from 84 SMs / 5.376‬ FP32 Cores to 128 SM / 8.192 FP32 Cores
Logically the GA102 will also have 8 GPCs, and keeping 12 SM / GPC that would lead to
the TU102 going from 6 GPC with 12 SM / GPC to a GA102 with 8 GPC and 12 SM / GPC
Or from 72 SMs / 4608 FP32 Cores to 96 SMs / 6144 FP32 Cores

Kaotik · Jul 1, 2020

Voxilla said:
Some speculation about the top consumer Ampere GPU.
The A100 went from the V100 from 6 GPC with 14 SM / GPC to 8 GPC with 16 SM / GPC.
Or from 84 SMs / 5.376‬ FP32 Cores to 128 SM / 8.192 FP32 Cores
Logically the GA102 will also have 8 GPCs, and keeping 12 SM / GPC that would lead to
the TU102 going from 6 GPC with 12 SM / GPC to a GA102 with 8 GPC and 12 SM / GPC
Or from 72 SMs / 4608 FP32 Cores to 96 SMs / 6144 FP32 Cores

I wouldn't be drawing conclusions that Volta > Ampere jump relates in any way to gaming sides Turing > Ampere, especially when we take into account that A100 and gaming Amperes will be quite a bit different, with the former lacking RT-acceleration and most likely dedicating more space to Tensors

Voxilla · Jul 1, 2020

Kaotik said:
I wouldn't be drawing conclusions that Volta > Ampere jump relates in any way to gaming sides Turing > Ampere, especially when we take into account that A100 and gaming Amperes will be quite a bit different, with the former lacking RT-acceleration and most likely dedicating more space to Tensors

You are quite wrong about that, the V100 was a good predictor for the TU102, with the latter gaming GPU having even more space dedicated to tensors per SM. Likewise the A100 will serve as a basis for the GA102.
I would hate seeing those big tensor cores transition from the A100 to the GA102, but given the past it would not be a complete surprise.

Kaotik · Jul 1, 2020

Voxilla said:
You are quite wrong about that, the V100 was a good predictor for the TU102, with the latter gaming GPU having even more space dedicated to tensors per SM. Likewise the A100 will serve as a basis for the GA102.
I would hate seeing those big tensor cores transition from the A100 to the GA102, but given the past it would not be a complete surprise.

Volta and Turing weren't the same architecture, Amperes should be even if the implementations are different in similar way. You can't just pick few nice indicators and ignore the rest of the differences and call it a day.

Voxilla · Jul 1, 2020

Kaotik said:
Volta and Turing weren't the same architecture, Amperes should be even if the implementations are different in similar way. You can't just pick few nice indicators and ignore the rest of the differences and call it a day.

I was speculating about the number of GPCs and SMs in a plausible way.
Besides that there will be obvious differences like there always have been like no double precision, much less NV links, reduced ECC / memory system...
If 96 SMs for a GA102 is unreasonable to you, argue about that.

CarstenS · Jul 1, 2020

I think it all depends on whether or not Gaming-Ampere will also get the beefy tensor cores. Since the chip specialties in training and inference seem to diverge a bit, I can see Nvidia offering the training-optimized µarch throughout the product line, thus leaving more space for simpler cores with high throughput only in traditional applications and inferencing.

Kaotik · Jul 2, 2020

https://videocardz.com/newz/asus-geforce-rtx-3080-ti-rog-strix-leaked

Supposed ROG Strix 3080 Ti (note: the image at videocardz has been upscaled with unknown method, picture below has been resized down for forum from that)
The chinese text translates (according to google translate) into "A new generation of rog strix"

Digidi · Jul 4, 2020

Ampere leak if it's true, it will come with 8nm (better 10nm from Samsung)

https://twitter.com/x/status/1278503357804457984

Kaotik · Jul 4, 2020

Digidi said:
Ampere leak if it's true, it will come with 8nm (better 10nm from Samsung)

https://twitter.com/x/status/1278503357804457984

This again

Jensen specifically said sometime after they announced the new Samsung deal that TSMC is still going to make most of their chips. There's nothing indicating the Samsung deal would be somehow bigger than the last one, where they made lowend Pascals

CarstenS · Jul 4, 2020

Did Jen-Hsun's statement (which I don't recall, hence my question) specify, if he meant the absolute number of produced chips (in millions) or the number of models (GA100, GA102, GA104) etc.? Samsung could be making GA102/104, while TSMC churns out GA 106/107/108.

Putas · Jul 4, 2020

It is Samsung fabs that focus on smaller low-power chips.

Nvidia Ampere Discussion [2020-05-14]

CarstenS

Moderator

Kaotik

Drunk Member

Frenetic Pony

Kaotik

Drunk Member

CarstenS

Moderator

no-X

CarstenS

Moderator

Ext3h

Deleted member 2197

Guest

Voxilla

Kaotik

Drunk Member

Voxilla

Kaotik

Drunk Member

Voxilla

CarstenS

Moderator

Kaotik

Drunk Member

Digidi

Kaotik

Drunk Member

CarstenS

Moderator

Putas

Similar threads