Nvidia Ampere Discussion [2020-05-14]

We don't know for gaming yet, but am I the only one to see MA-SSI-VE architecture changes that provide HU-GE performance jump, thus efficiency jump, in the intended workloads?

View attachment 3905 View attachment 3906 View attachment 3907 View attachment 3908
3 to 7 times real world performance gain on BERT training/inferencing is above expectations

Isn't HPC intended workload too? I mean, of course they focused more on the AI stuff this time around, but still that's their big HPC chip for everything else too.
There the improvements aren't that impressive, considering the 2,5x transistor budget and higher consumption.
upload_2020-5-19_22-13-56.png

Also it seems the AI comparisons aren't really apples to apples, they're using lesser TF32 precision with A100 on BERT-Large FP32 Training
 
Isn't HPC intended workload too? I mean, of course they focused more on the AI stuff this time around, but still that's their big HPC chip for everything else too.
There the improvements aren't that impressive, considering the 2,5x transistor budget and higher consumption.
View attachment 3909

Also it seems the AI comparisons aren't really apples to apples, they're using lesser TF32 precision with A100 on BERT-Large FP32 Training
Yeah A100 is a funky GEMM machine, but nothing snazzy besides that.
Which works for intended market, but may or may not piss the wide and varied HPC crowds off.
 
There the improvements aren't that impressive, considering the 2,5x transistor budget and higher consumption.
That's 70% avg speed up in HPC workloads, which is pretty good if contrasted against the only 30% increase in pure FP32, along side the more considerable speedups in AI, I would say that's quite impressive actually.
 
We don't know for gaming yet, but am I the only one to see MA-SSI-VE architecture changes that provide HU-GE performance jump, thus efficiency jump, in the intended workloads?

Is it? What I see is them using the extra transistor budget on putting an absurd amount of tensor cores that matter little to HPC, while at the same time not being able to increase the clocks even when said tensor cores are not being used, thus reducing power efficiency for HPC tasks compared to their own predecessor or even Vega 20 (let alone Arcturus that should appear this year).

Of course, nvidia will be trying to sell the idea that every scientific calculation out there is being replaced with machine learning, though I don't know if that's true.
 
Is it? What I see is them using the extra transistor budget on putting an absurd amount of tensor cores that matter little to HPC, while at the same time not being able to increase the clocks even when said tensor cores are not being used, thus reducing power efficiency for HPC tasks compared to their own predecessor or even Vega 20 (let alone Arcturus that should appear this year).

Of course, nvidia will be trying to sell the idea that every scientific calculation out there is being replaced with machine learning, though I don't know if that's true.
Eh GEMM cores now go brrrrrrr even in FP64 proper + it is a solid bandwidth uptick.

Arcturus is yeah, a far meaner GPGPU actual part, but it has no software or documentation to speak of so it is basically a Frontier devboard.
 
I'm looking forward to see the market impact A100 and Arcturus will have. Especially in which markets with their respective TAMs. Right now, all the rage (and thus all the money) seems to be in Machine Learning. Also wondering how much of an HPC-accelerator the FP64-FMACs from the Tensor Core will be, given they're (probably) not fully fledged FP64 units. I'm pretty sure they could be used to good effect in Linpack (yes, I know, linpack linshmack)
 
I wonder if Nvidia will create a smaller/affordable PCIe based AI/HPC card.
As a thought experiment, chopping the A100 in half, and keeping 3 HBM2 stacks.
This would result in a 400mm2 GPU immensely improving yield.
With 64 SMs, and increasing the clock by 10%, this would result in 65% performance of the A100.
Spec would be:
  • Power <250 Watt
  • 24 GB at 960 GB/s
  • 200 TFLOPs FP16 / BF16
  • 100 TFLOPs TF32
  • 12.5 TFLOPs FP32,
  • 6.25 TFLOPs FP64
Price 2K
 
I wonder if Nvidia will create a smaller/affordable PCIe based AI/HPC card.
As a thought experiment, chopping the A100 in half, and keeping 3 HBM2 stacks.
This would result in a 400mm2 GPU immensely improving yield.
With 64 SMs, and increasing the clock by 10%, this would result in 65% performance of the A100.
Spec would be:
  • Power <250 Watt
  • 24 GB at 960 GB/s
  • 200 TFLOPs FP16 / BF16
  • 100 TFLOPs TF32
  • 12.5 TFLOPs FP32,
  • 6.25 TFLOPs FP64
Price 2K
Nah, but we've already seen the one chip with 4 stacks
 
Is it? What I see is them using the extra transistor budget on putting an absurd amount of tensor cores that matter little to HPC, while at the same time not being able to increase the clocks even when said tensor cores are not being used, thus reducing power efficiency for HPC tasks compared to their own predecessor or even Vega 20 (let alone Arcturus that should appear this year).

Of course, nvidia will be trying to sell the idea that every scientific calculation out there is being replaced with machine learning, though I don't know if that's true.

It's definitely not, sometimes you know you need accurate simulation. Like up to and including 64bit precision sim, not "an AI says this maybe is the answer based on guessing a trillion times".

But I think they see the most money in the AI market, and designing a 7nm chip is expensive and takes damned long, so they just went for what they saw as the highest profit margins they could get first and foremost.
 
It's definitely not, sometimes you know you need accurate simulation. Like up to and including 64bit precision sim, not "an AI says this maybe is the answer based on guessing a trillion times".

But I think they see the most money in the AI market, and designing a 7nm chip is expensive and takes damned long, so they just went for what they saw as the highest profit margins they could get first and foremost.

My understanding of scientific HPC algorithms is that at their hearts they very often rely on huge amounts of matrix multiplication. This is not something restricted to neural nets by any means. Since the Tensor Cores are in fact specialized matrix multiplication hardware, and since they added 32 and 64-bit float support to them for Ampere, they are in fact extremely well suited for HPC.
 
And nVidia has published numbers from diffierent applications or use cases. Performance improvements range between 1,5x and 2,1x for HPC. GA100 has 2,5x more transistors than GV100 so the scaling isnt bad. Considereing that GA100 has more use cases than GV100 performance improvement is as good as it is possible. I dont think just increasing the numbers of SM would archive the same.
 
My understanding of scientific HPC algorithms is that at their hearts they very often rely on huge amounts of matrix multiplication. This is not something restricted to neural nets by any means. Since the Tensor Cores are in fact specialized matrix multiplication hardware, and since they added 32 and 64-bit float support to them for Ampere, they are in fact extremely well suited for HPC.
Some applications seem to be ok with iterative solvers in order to achieve a desired precision. For those, if your specialized cores' throughput at a given precision is greater than the added duration until the iterative solver reaches it's saturation, you're in a net win as well.
 
From the white paper this one needs more explanation:
"A100 adds Compute Data Compression. Compression saves up to 4x DRAM read/write bandwidth, up to 4x L2 read bandwidth, and up to 2x L2 capacity."
In lack of detail how this is implemented It looks this is some kind of SM software based compression/decompression.
 
My understanding of scientific HPC algorithms is that at their hearts they very often rely on huge amounts of matrix multiplication. This is not something restricted to neural nets by any means. Since the Tensor Cores are in fact specialized matrix multiplication hardware, and since they added 32 and 64-bit float support to them for Ampere, they are in fact extremely well suited for HPC.
Until you realize that the new tensor cores can not replace general FP32 matrix matrix multiplication
Hint: TF32 != FP32
 
From the white paper this one needs more explanation:
"A100 adds Compute Data Compression. Compression saves up to 4x DRAM read/write bandwidth, up to 4x L2 read bandwidth, and up to 2x L2 capacity."
In lack of detail how this is implemented It looks this is some kind of SM software based compression/decompression.

Interesting. Sounds suspiciously like what MooresLawIsDead claimed about tensor core based VRAM compression.
 
Until you realize that the new tensor cores can not replace general FP32 matrix matrix multiplication
Hint: TF32 != FP32

FP64 = FP64 tho and there A100 tensor cores deliver 19.5 TFlops, which I doubt would be attainable otherwise, since 40 TFlops FP32 and 1:2 FP64 ratio seems more unrealistic and 1:1 seems even more unrealistic and probably a total waste of die and power tbh.

At the end of the day, I really think that it is more realistic to assume that Nvidia knew (by feedback) which formats and where would benefit their prospect customers the most and delivered accordingly, which is reflected on the fact that they did offer FP64 support on the TCs at 2x the speed as normal, while they didn't even bother supporting FP32. Or I guess we can go around assuming it's a massive oversight and Nvidia is clueless.
 
Hint: TF32 != FP32
According to Nvidia, they provide the same accuracy in training. From here:
https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
To validate the accuracy of TF32, we used it to train a broad set of AI networks across a wide variety of applications from computer vision to natural language processing to recommender systems. All of them have the same convergence-to-accuracy behavior as FP32.

That’s why NVIDIA is making TF32 the default on its cuDNN library which accelerates key math operations for neural networks. At the same time, NVIDIA is working with the open-source communities that develop AI frameworks to enable TF32 as their default training mode on A100 GPUs, too.
In fact, I have the results of Nvidia comparison between FP32 and TF32. Not sure I can share since I don't see it anywhere online, but I can say that the networks trained using TF32 have the same accuracy than FP32. For AI, TF32 is really a safe replacement for FP32 with a huge speedup in performance
 
FP64 = FP64 tho and there A100 tensor cores deliver 19.5 TFlops,
Peak FP64 is 9.7 FLOPs, not 19.5.

19.5TF FP64 is for deep learning training.
If their FP64 tensor cores were capable of non-ML tasks, why would they even need any FP64 ALUs at all?



At the end of the day, I really think that it is more realistic to assume that Nvidia knew (by feedback) which formats and where would benefit their prospect customers the most and delivered accordingly, which is reflected on the fact that they did offer FP64 support on the TCs at 2x the speed as normal,
Or GA100 might just be a chip that tries really hard to compete with the likes of Google TPU on the ML market, while going against competitor CPUs, GPUs and dedicated accelerators on the HPC market, while at the same time stubbornly keeping its GPU core functionality.
This sounds great on paper, but if they find decent opposition on more than one side then it might put the product in a difficult position, because it spread its transistors too thin on fighting different fronts and power efficiency ended up hurting in some of them.

It's a risk, like e.g. AMD took with Vega 10 and didn't pay out as they expected.


Or I guess we can go around assuming it's a massive oversight and Nvidia is clueless.
So nvidia is immune to mistakes and they're only capable of delivering perfect products?
 
Back
Top