Nvidia Ampere Discussion [2020-05-14]

Personal opinion, your mileage may vary:
If 3070/3060Ti are on or above the performance level of 2080 Ti and Radeon RX 6800 non-XT is not too far away, the latter's 16-GByte-complement of graphics memory becomes more and more of a compelling pro-argument.

In other words: A 16 GByte variant of at least 3070 and a 20-GByte-3080 would be much more desireable.
 
Personal opinion, your mileage may vary:
If 3070/3060Ti are on or above the performance level of 2080 Ti and Radeon RX 6800 non-XT is not too far away, the latter's 16-GByte-complement of graphics memory becomes more and more of a compelling pro-argument.

In other words: A 16 GByte variant of at least 3070 and a 20-GByte-3080 would be much more desireable.
According to Anthony at Linus Tech Tips "RX 6800 (the non-XT) is closer to 3080 than 3070", it's in the shortcircuit unboxing video
 
According to Anthony at Linus Tech Tips "RX 6800 (the non-XT) is closer to 3080 than 3070", it's in the shortcircuit unboxing video

Yeah the 6080 is apparently considerably faster than the 3070. VRAM capacity is the least of its worries.
 
GA104 is faster than TU104. That was happening every generation. I think naming is meaningless, it's 3060-class, but it won't cost $199-249 like 960, 1060 or 1660 did. More likely $349-399, like 970 or 1070.

Exactly this. The numbers in the names are worthless in this day and age. The 970 debuted in 2014 for $330. Four years later the RTX 2070 released with a $599 MSRP.
We should just stick with comparing performance-per-cost, and within that scope the performance-per-cost evolution of consumer graphics cards we've had during these past 6 years is really disappointing IMO. Especially when compared to the 6 years prior to those.
AMD's lower ability to compete has only ever been good for nvidia's pockets.
 
All things relative, the addition memory and increased bandwidth on A100 does seem to affect performance a bit.


nvidia-a100-3x-fp16-ai-training-2c50-d.svg



nvidia-a100-1_8x-hpc-2c50-d.svg




nvidia-a100-83x-data-analytics-2c50-d.svg



https://www.nvidia.com/en-us/data-center/a100/
 
All things relative, the addition memory and increased bandwidth on A100 does seem to affect performance a bit.


nvidia-a100-3x-fp16-ai-training-2c50-d.svg



nvidia-a100-1_8x-hpc-2c50-d.svg




nvidia-a100-83x-data-analytics-2c50-d.svg



https://www.nvidia.com/en-us/data-center/a100/
If some of those unlabelled benchmarks are specifically made to fit the 80 GByte budget, then of course, why wouldn't it.

edit: here are the footnotes from the presenation which I could find geared towards the benchmarks you posted. Obviously, they were rounding to whole numbers.
1 AI Training running DLRM using Huge CTR framework on a 450 GB Criteo dataset. Normalized speedup ~2.6X
2 HPC: Quantum Espresso with CNT10POR8 dataset on a 1.6TB dataset. Normalized speedup ~1.8X
3 Data Analytics: big data benchmark with 10TB dataset, 30 analytical retail queries, ETL, ML, NLP. Normalized speedup ~1.9X
[my bold]
From there, it seems pretty natural for larger local memory to allow for much better perf scaling.
 
Last edited:
If some of those unlabelled benchmarks are specifically made to fit the 80 GByte budget, then of course, why wouldn't it.
3 Data Analytics: big data benchmark with 10TB dataset, 30 analytical retail queries, ETL, ML, NLP. Normalized speedup ~1.9X
The 10TB /30 queries is standard configuration for this benchmark. Nvidia started using this benchmark a few years ago to showcase AI/Hadoop performance using RAPIDS to data science/business analytics types, always using 10TB /30 queries (which doesn't fit 80 GB budget).

Edit: Not familiar with the other benchmarks so can't comment on configurations used. I would venture to guess they used standard configurations as well.
 
Last edited:
The 10TB /30 queries is standard configuration for this benchmark. Nvidia started using this benchmark a few years ago to showcase AI/Hadoop performance using RAPIDS to data science/business analytics types, always using 10TB /30 queries (which doesn't fit 80 GB budget).

Edit: Not familiar with the other benchmarks so can't comment on configurations used. I would venture to guess they used standard configurations as well.
Just pointing out memory requirements in order to better quantify those (previously unnamed) benchmarks. I only found out later in my edit, what they were and how much memory they used. If that's standard amounts, fine! :)
 
I am curious to see the performance of Ampere GPU in Unreal 5, there is no raytracing but the engine seems to be heavy on compute side with Nanite mixing hardware rasterizer for bigger triangle and software rasterizer for small triangle and Lumen seems to use compute a lot. They are better in raytracing but for the engine heavily favouring compute power without using raytracing they will probably perform better than RDNA 2.

I will not be surprised if Ampere GPU perform very well in UE5.
 
Compared to the 2070 the 3090, is only twice faster for my fluid sim (BlazeFX),
which is less than I expected.
I guess I'm limited by memory bandwidth.
36 TFlop/s not so much of an improvement to 7.5 TFlop/s in this case.
 
Last edited:
Compared to the 2070 the 3090, is only twice faster for my fluid sim (BlazeFX),
which is less than I expected.
I guess I'm limited by memory bandwidth.
36 TFlop/s not so much of an improvement to 7.5 TFlop/s in this case.
Is it possible to get access to the matrix math (tensor cores) in these GPUs? I'm wondering if you programmed this in terms of matrix operations, would you get a useful speedup?
 
Is it possible to get access to the matrix math (tensor cores) in these GPUs? I'm wondering if you programmed this in terms of matrix operations, would you get a useful speedup?


The amount of FP32 computation is relatively low relative to the amount data read/written.
This is inherent to this kind of fluid simulation, which involves juggling very large volume textures.

There is one step that is high on FP32 computation and that is cubic texture sampling.
Sadly texture units have stagnated to only linear interpolation.
If TMUs would be enhanced with cubic interpolation, that would be nice.
You can never match TMU cubic interpolation with shader computation as even then fetching texels is a bottleneck and not computation.

I do see similarities between tensor cores and a TMU capable of cubic interpolation.
This functionality could be efficiently merged into a unified tensor/TMU hardware unit.

I’ve been speculating if the RX 6800 would be any good for this kind of fluid simulation.
It might possibly not be out of the box, but I’ll be only sure when I can test it.
 
Matrix ALUs are effectively optimised for a high bandwidth/FLOP ratio :)
But I have no idea what your texel re-use is like...

Tensor cores compute vector dot product, like a0*w0+a1*w1+a2*w2+a3*w3 ...
That is not what you need for fluid simulation (but do need for cubic interpolation)
The tensor cores can only reach full speed with heavy data reuse like in convolutional kernel and batched computation and matrix*matrix computation.
That is also unlike fluid simulation where there is little data reuse and no matrix*matrix.
(We are getting a bit off topic here.)
 
One has to wonder if fluid/smoke/physics could end up partially/fully accelerated by dnn's. Old stuff from 2016, there has to be newer research available. One more potential use for tensor cores?

 
Back
Top