Nvidia Ampere Discussion [2020-05-14]

CarstenS · Nov 17, 2020

Personal opinion, your mileage may vary:
If 3070/3060Ti are on or above the performance level of 2080 Ti and Radeon RX 6800 non-XT is not too far away, the latter's 16-GByte-complement of graphics memory becomes more and more of a compelling pro-argument.

In other words: A 16 GByte variant of at least 3070 and a 20-GByte-3080 would be much more desireable.

Kaotik · Nov 17, 2020

CarstenS said:
Personal opinion, your mileage may vary:
If 3070/3060Ti are on or above the performance level of 2080 Ti and Radeon RX 6800 non-XT is not too far away, the latter's 16-GByte-complement of graphics memory becomes more and more of a compelling pro-argument.

In other words: A 16 GByte variant of at least 3070 and a 20-GByte-3080 would be much more desireable.

According to Anthony at Linus Tech Tips "RX 6800 (the non-XT) is closer to 3080 than 3070", it's in the shortcircuit unboxing video

trinibwoy · Nov 17, 2020

Kaotik said:
According to Anthony at Linus Tech Tips "RX 6800 (the non-XT) is closer to 3080 than 3070", it's in the shortcircuit unboxing video

Yeah the 6080 is apparently considerably faster than the 3070. VRAM capacity is the least of its worries.

CarstenS · Nov 17, 2020

I cannot do else yet than keeping it vague.

Deleted member 13524 · Nov 17, 2020

no-X said:
GA104 is faster than TU104. That was happening every generation. I think naming is meaningless, it's 3060-class, but it won't cost $199-249 like 960, 1060 or 1660 did. More likely $349-399, like 970 or 1070.

Exactly this. The numbers in the names are worthless in this day and age. The 970 debuted in 2014 for $330. Four years later the RTX 2070 released with a $599 MSRP.
We should just stick with comparing performance-per-cost, and within that scope the performance-per-cost evolution of consumer graphics cards we've had during these past 6 years is really disappointing IMO. Especially when compared to the 6 years prior to those.
AMD's lower ability to compete has only ever been good for nvidia's pockets.

pharma · Nov 18, 2020

All things relative, the addition memory and increased bandwidth on A100 does seem to affect performance a bit.

https://www.nvidia.com/en-us/data-center/a100/

CarstenS · Nov 18, 2020

pharma said:
All things relative, the addition memory and increased bandwidth on A100 does seem to affect performance a bit.

https://www.nvidia.com/en-us/data-center/a100/

If some of those unlabelled benchmarks are specifically made to fit the 80 GByte budget, then of course, why wouldn't it.

edit: here are the footnotes from the presenation which I could find geared towards the benchmarks you posted. Obviously, they were rounding to whole numbers.
1 AI Training running DLRM using Huge CTR framework on a 450 GB Criteo dataset. Normalized speedup ~2.6X
2 HPC: Quantum Espresso with CNT10POR8 dataset on a 1.6TB dataset. Normalized speedup ~1.8X
3 Data Analytics: big data benchmark with 10TB dataset, 30 analytical retail queries, ETL, ML, NLP. Normalized speedup ~1.9X
[my bold]
From there, it seems pretty natural for larger local memory to allow for much better perf scaling.

pharma · Nov 18, 2020

CarstenS said:
If some of those unlabelled benchmarks are specifically made to fit the 80 GByte budget, then of course, why wouldn't it.
3 Data Analytics: big data benchmark with 10TB dataset, 30 analytical retail queries, ETL, ML, NLP. Normalized speedup ~1.9X

The 10TB /30 queries is standard configuration for this benchmark. Nvidia started using this benchmark a few years ago to showcase AI/Hadoop performance using RAPIDS to data science/business analytics types, always using 10TB /30 queries (which doesn't fit 80 GB budget).

Edit: Not familiar with the other benchmarks so can't comment on configurations used. I would venture to guess they used standard configurations as well.

CarstenS · Nov 18, 2020

pharma said:
The 10TB /30 queries is standard configuration for this benchmark. Nvidia started using this benchmark a few years ago to showcase AI/Hadoop performance using RAPIDS to data science/business analytics types, always using 10TB /30 queries (which doesn't fit 80 GB budget).

Edit: Not familiar with the other benchmarks so can't comment on configurations used. I would venture to guess they used standard configurations as well.

Just pointing out memory requirements in order to better quantify those (previously unnamed) benchmarks. I only found out later in my edit, what they were and how much memory they used. If that's standard amounts, fine!

Voxilla · Nov 18, 2020

Finally could get hold of a non scalped 3090.
Works quite nice, just had a try with the bundled CoD game.
Unfortunately with RT on, and in the heat of battle, the game quickly crashed.
Might be a bug in the game, or that hardware stability issue again

pharma · Nov 18, 2020

Voxilla said:
Finally could get hold of a non scalped 3090.
Works quite nice, just had a try with the bundled CoD game.
Unfortunately with RT on, and in the heat of battle, the game quickly crashed.

https://www.gamesradar.com/activisi...l-of-duty-modern-warfare-and-warzone-crashes/

Edit: If looking, War Thunder is a free DLSS game that came out yesterday.

Babel-17 · Nov 19, 2020

pharma said:
https://www.gamesradar.com/activisi...l-of-duty-modern-warfare-and-warzone-crashes/

Edit: If looking, War Thunder is a free DLSS game that came out yesterday.

Four movies, and now adapted into a video game.

"No Man Left Behind!"

Lol, sorry for the OT reply, but the name compelled me.

chris1515 · Nov 19, 2020

I am curious to see the performance of Ampere GPU in Unreal 5, there is no raytracing but the engine seems to be heavy on compute side with Nanite mixing hardware rasterizer for bigger triangle and software rasterizer for small triangle and Lumen seems to use compute a lot. They are better in raytracing but for the engine heavily favouring compute power without using raytracing they will probably perform better than RDNA 2.

I will not be surprised if Ampere GPU perform very well in UE5.

Voxilla · Nov 19, 2020

Compared to the 2070 the 3090, is only twice faster for my fluid sim (BlazeFX),
which is less than I expected.
I guess I'm limited by memory bandwidth.
36 TFlop/s not so much of an improvement to 7.5 TFlop/s in this case.

Jawed · Nov 19, 2020

Voxilla said:
Compared to the 2070 the 3090, is only twice faster for my fluid sim (BlazeFX),
which is less than I expected.
I guess I'm limited by memory bandwidth.
36 TFlop/s not so much of an improvement to 7.5 TFlop/s in this case.

Is it possible to get access to the matrix math (tensor cores) in these GPUs? I'm wondering if you programmed this in terms of matrix operations, would you get a useful speedup?

Voxilla · Nov 19, 2020

Jawed said:
Is it possible to get access to the matrix math (tensor cores) in these GPUs? I'm wondering if you programmed this in terms of matrix operations, would you get a useful speedup?

The amount of FP32 computation is relatively low relative to the amount data read/written.
This is inherent to this kind of fluid simulation, which involves juggling very large volume textures.

There is one step that is high on FP32 computation and that is cubic texture sampling.
Sadly texture units have stagnated to only linear interpolation.
If TMUs would be enhanced with cubic interpolation, that would be nice.
You can never match TMU cubic interpolation with shader computation as even then fetching texels is a bottleneck and not computation.

I do see similarities between tensor cores and a TMU capable of cubic interpolation.
This functionality could be efficiently merged into a unified tensor/TMU hardware unit.

I’ve been speculating if the RX 6800 would be any good for this kind of fluid simulation.
It might possibly not be out of the box, but I’ll be only sure when I can test it.

Jawed · Nov 19, 2020

Voxilla said:
The amount of FP32 computation is relatively low relative to the amount data read/written.

Matrix ALUs are effectively optimised for a high bandwidth/FLOP ratio

This is inherent to this kind of fluid simulation, which involves juggling very large volume textures.

But I have no idea what your texel re-use is like...

Voxilla · Nov 19, 2020

Jawed said:
Matrix ALUs are effectively optimised for a high bandwidth/FLOP ratio
But I have no idea what your texel re-use is like...

Tensor cores compute vector dot product, like a0*w0+a1*w1+a2*w2+a3*w3 ...
That is not what you need for fluid simulation (but do need for cubic interpolation)
The tensor cores can only reach full speed with heavy data reuse like in convolutional kernel and batched computation and matrix*matrix computation.
That is also unlike fluid simulation where there is little data reuse and no matrix*matrix.
(We are getting a bit off topic here.)

Jawed · Nov 19, 2020

Voxilla said:
That is also unlike fluid simulation where there is little data reuse and no matrix*matrix.

I'm merely suggesting that if you re-formulate the algorithm using matrices (or matrix-vector math), you might make progress.

manux · Nov 19, 2020

One has to wonder if fluid/smoke/physics could end up partially/fully accelerated by dnn's. Old stuff from 2016, there has to be newer research available. One more potential use for tensor cores?

Nvidia Ampere Discussion [2020-05-14]

CarstenS

Moderator

Kaotik

Drunk Member

trinibwoy

Meh

CarstenS

Moderator

Deleted member 13524

Guest

pharma

CarstenS

Moderator

pharma

CarstenS

Moderator

Voxilla

pharma

Babel-17

chris1515

Voxilla

Jawed

Voxilla

Jawed

Voxilla

Jawed

manux

Similar threads