In the big slide from press deck just above your post?Exactly where do they claim a 20x FP32 performance increase when comparing between Ampere and Volta?
I'm fully aware of how they came up with the numbers. Like I said, I should have been more specific and said "in their slides", which at minimum would need some fineprint saying "check details from here since this is actually BS if you just read it like it is"Other reviews had no problems interpreting the graphs along with what Nvidia stated.
But TF32 isn't FP32, that was the point.That's a slide from the press deck. And i'm reading FP32 training. TF32 is nVidia's default precision for Training using A100.
In the big slide from press deck just above your post?
I'm fully aware of how they came up with the numbers. Like I said, I should have been more specific and said "in their slides", which at minimum would need some fineprint saying "check details from here since this is actually BS if you just read it like it is"
But TF32 isn't FP32, that was the point.
The other comparisons are a little iffy too, at least to my understanding that FP64-number holds true only if it's matrix multiplications (aka something tensors run fast), same for INT8 (which in addition assumes you can take advantage of Sparsity support)
Except in that slide I posted, where they list it as 19.5 TFLOPS, which is tensor cores doing tensor stuff at FP64 precision.Which is why they only list fp64 as 10 teraflops, and not a huge number. Nvidia gonna Nvidia, whatever.
Omitting that you can neither achieve the conditions for peak performance due to functional constraints in most use cases, nor sustained operation would fit within the TDP budget, and on top of that also wildly re-interpreting terms to include operations-not-performed in the stated peak numbers, and deliberately mislabling data types.For FP32, that's on top, sure. But Kaotik seems to dislike more than that.
We already went through the salient characteristics and performance using various data formats and processing units of Nvidia GPU accelerators from Kepler through Ampere in the architectural deep dive we did at the end of May. Now, we are going to look at the cost of Ampere A100 accelerators relative to prior generations at their current street prices. Ultimately, this is always about money as much as it is about architecture. All of the clever architecture in the world doesn’t amount to much if you can’t afford to buy it.
In the table below, we show the price, bang for the buck, and bang for the watt of the past four-ish generations of GPU accelerators (counting Turing as its own generation makes four in addition to the Pascal, Volta, and Ampere GPU accelerators). There are a lot of different performance metrics we can look at, and that is particularly true when considering sparse matrix support for machine learning inference. In this comparison, we assume the best case scenario using the various data formats and compute units as well as sparse matrix calculations when calculating performance per watt and performance per dollar. We also being it all together in the mother of all ratios, dollars per performance per watt, which is the be-all, end-all calculation that drives architectural choices at the hyperscalers and cloud builders.
That’s just a subset of the data about server-class GPU accelerators from Nvidia that we have compiled over time. We have this big table that goes all the way back to the Kepler GPU accelerators, which you can view in a separate window here because it doesn’t fit in our column width by a long shot. This is the companion set for the two tables we did in the architectural dep dive, but as we said, now we are making performance fractions with watts and bucks.
...
As part of the Ampere rollout, Nvidia released some HPC and AI benchmarks to give customers a sense of how real-world applications performance on the A100 accelerators and how they compare to the previous Turing and Volta generations as appropriate.
These are initial tests, and we have a strong suspicion that over time as the software engineers get their hands on Ampere accelerators and update their software, the performance will get even better. There is precedent for this, after all.
https://www.nextplatform.com/2020/06/29/the-new-general-and-new-purpose-in-computing/Our best guess is that the HPC applications in the chart above are not running on the Tensor Cores, so there is headroom for the software to catch up with the software and double the performance for these applications. So maybe when everything is all tuned up, it will be somewhere between 15X and 20X improvement on the same A100 hardware this time next year.
The important thing to note in that chart is how performance improved on the V100s with no changes in the hardware but lots of changes in the software, both in 2018 and 2019. There is no reason to believe that the features in the Ampere GPU will not be exploited to the hilt – eventually – by HPC software as much as by AI software. The stakes are just as high, and the software engineers are just as smart.
I wouldn't be drawing conclusions that Volta > Ampere jump relates in any way to gaming sides Turing > Ampere, especially when we take into account that A100 and gaming Amperes will be quite a bit different, with the former lacking RT-acceleration and most likely dedicating more space to TensorsSome speculation about the top consumer Ampere GPU.
The A100 went from the V100 from 6 GPC with 14 SM / GPC to 8 GPC with 16 SM / GPC.
Or from 84 SMs / 5.376 FP32 Cores to 128 SM / 8.192 FP32 Cores
Logically the GA102 will also have 8 GPCs, and keeping 12 SM / GPC that would lead to
the TU102 going from 6 GPC with 12 SM / GPC to a GA102 with 8 GPC and 12 SM / GPC
Or from 72 SMs / 4608 FP32 Cores to 96 SMs / 6144 FP32 Cores
I wouldn't be drawing conclusions that Volta > Ampere jump relates in any way to gaming sides Turing > Ampere, especially when we take into account that A100 and gaming Amperes will be quite a bit different, with the former lacking RT-acceleration and most likely dedicating more space to Tensors
Volta and Turing weren't the same architecture, Amperes should be even if the implementations are different in similar way. You can't just pick few nice indicators and ignore the rest of the differences and call it a day.You are quite wrong about that, the V100 was a good predictor for the TU102, with the latter gaming GPU having even more space dedicated to tensors per SM. Likewise the A100 will serve as a basis for the GA102.
I would hate seeing those big tensor cores transition from the A100 to the GA102, but given the past it would not be a complete surprise.
Volta and Turing weren't the same architecture, Amperes should be even if the implementations are different in similar way. You can't just pick few nice indicators and ignore the rest of the differences and call it a day.
This againAmpere leak if it's true, it will come with 8nm (better 10nm from Samsung)