NVidia Ada Speculation, Rumours and Discussion

xpea · Aug 5, 2021

Jawed said:
FLOPS per transistor (mm²) and FLOPS per watt are what really matter, so games-actual versus theoretical isn't such an enlivening topic in the end.

Absolutely and it depends on the workload. With RT and DLSS enabled, Ampere has much better FLOPS per watt (with one full node behind!!!) and FLOP per transistor than RDNA2. Now if you look at pure rasterization performance, RDNA2 has the edge. But we are in 2021, not in 2019 anymore. Pure rasterization is not a problem with this generation.
The same can be said with MI200. A FP64 monster that looks good at first sight but it targets the dying traditional HPC market, where vast majority of the workloads are replaced by AI/ML. It looks like AMD is always one step behind...

So what is important? What workload matters in 2021 to judge the FLOPs/watts or FLOPs/transistor metrics on a high-end GPU?

techuse · Aug 5, 2021

xpea said:
Absolutely and it depends on the workload. With RT and DLSS enabled, Ampere has much better FLOPS per watt (with one full node behind!!!) and FLOP per transistor than RDNA2. Now if you look at pure rasterization performance, RDNA2 has the edge. But we are in 2021, not in 2019 anymore. Pure rasterization is not a problem with this generation.
The same can be said with MI200. A FP64 monster that looks good at first sight but it targets the dying traditional HPC market, where vast majority of the workloads are replaced by AI/ML. It looks like AMD is always one step behind...

So what is important? What workload matters in 2021 to judge the FLOPs/watts or FLOPs/transistor metrics on a high-end GPU?

A full node behind? And rasterization performance is not “fine”. We can still make use of many times more. It’s not a solved issue.

CarstenS · Aug 5, 2021

techuse said:
A full node behind?

Probably referring to Samsung 8 nm allegedly being a 10nm-derivative even though Nvidia calls it 8N instead of 8LPP?
https://en.wikichip.org/wiki/10_nm_lithography_process#Samsung

techuse · Aug 5, 2021

CarstenS said:
Probably referring to Samsung 8 nm allegedly being a 10nm-derivative even though Nvidia calls it 8N instead of 8LPP?
https://en.wikichip.org/wiki/10_nm_lithography_process#Samsung

The density and power advantages I've seen stated here for RDNA’s 7nm over Ampere’s 10nm are not those of a full node shrink.

no-X · Aug 5, 2021

xpea said:
The same can be said with MI200. A FP64 monster that looks good at first sight but it targets the dying traditional HPC market, where vast majority of the workloads are replaced by AI/ML. It looks like AMD is always one step behind...

After the launch of MI100 AMD stated, that customers asked for a powerful FP64 solution, because there are none available and they need to upgrade. But maybe I'm wrong and you know their customers better than they do.

As for "the dying HPC market":

The global high-performance computing (HPC) market size reached USD 41.07 Billion in 2020 and is expected to register a significant CAGR of 6.3% during the forecast period, according to latest analysis by Emergen Research. Major factors driving market revenue growth are rising need to streamline business processes, growing prominence of cloud computing coupled with rapid digitization trend.

https://www.globenewswire.com/news-...e-of-Cloud-Computing-is-Driving-Industry.html

xpea · Aug 5, 2021

CarstenS said:
Probably referring to Samsung 8 nm allegedly being a 10nm-derivative even though Nvidia calls it 8N instead of 8LPP?
https://en.wikichip.org/wiki/10_nm_lithography_process#Samsung

Correct. That's what I wanted to say. RDNA2 TSMC 7nm vs Ampere Samsung 8nm which is a slightly improved 10nm (like Turing 12nm was a slightly improved 16nm)

techuse said:
The density and power advantages I've seen stated here for RDNA’s 7nm over Ampere’s 10nm are not those of a full node shrink.

TSMC 7nm is widely considered a full node improvement vs Samsung 10nm derivative. In terms of peak density, it's ~94MTx/mm2 for TSMC 7nm vs ~51MTx/mm2 for Samsung 10/8nm. Of course we can argue that historically a full node was 4 times the density but these days are over...

CarstenS · Aug 5, 2021

techuse said:
The density and power advantages I've seen stated here for RDNA’s 7nm over Ampere’s 10nm are not those of a full node shrink.

That's true, but then you are comparing different architectures from different companies at different fabs on different processes. Also design plays a large role. Higher clocks often need some additional transistor investments for example.

We've got one clue though: Compare transistor density between A100-Ampere and RDNA2, which are at least from the same fab and the same process class - but still there are process variants for 7 nm class.

xpea · Aug 5, 2021

no-X said:
As for "the dying HPC market":
https://www.globenewswire.com/news-...e-of-Cloud-Computing-is-Driving-Industry.html

yeah compared to:

IDC Forecasts Companies to Spend Almost $342 Billion on AI Solutions in 2021

https://www.idc.com/getdoc.jsp?containerId=prUS48127321
If we link these 2 reports, when Hopper will launch, AI/ML market will already be more than 10 times the size of the traditional HPC market... And the difference will continue to grow quickly

CarstenS · Aug 5, 2021

no-X said:
After the launch of MI100 AMD stated, that customers asked for a powerful FP64 solution, because there are none available and they need to upgrade. But maybe I'm wrong and you know their customers better than they do.

Is that really more than marketing? I mean Nvidia's saying the same thing about AI. And how much better does MI100 fare wrt FP64 than A100? Is the difference enough for their customers to go from nay to yay?

no-X said:
As for "the dying HPC market":
https://www.globenewswire.com/news-...e-of-Cloud-Computing-is-Driving-Industry.html

There's a whole lot of "cloud" and "services" there. Are you sure, they refer to HPC as the classical "FP64-or-bust" segment?

no-X · Aug 5, 2021

xpea said:
yeah compared to:

https://www.idc.com/getdoc.jsp?containerId=prUS48127321
If we link these 2 reports, when Hopper will launch, AI/ML market will already be more than 10 times the size of the traditional HPC market... And the difference will continue to grow quickly

I'm sorry, but the fact that AI/ML market is bigger doesn't support your opinion, that HPC market is dying. It doesn't say anything about evolution of HPC market at all. The article clearly states, that HPC market is growing, so your statement was invalid.

CarstenS said:
There's a whole lot of "cloud" and "services" there. Are you sure, they refer to HPC as the classical "FP64-or-bust" segment?

Of course the entire HPC market isn't based on FP64 accelerators. But the same applies to AI/ML market. It also isn't based purely on GPU-accelerators. The point is that demand for FP64-accelerators haven't disappeared. Nvidia doesn't care, so it does make sense for AMD to take advantage of it. AI/ML market is bigger, but competition is much stronger. Anyway, MI200 is going to be quite interestion solution even for AI.

Bondrewd · Aug 5, 2021

no-X said:
Anyway, MI200 is going to be quite interestion solution even for AI.

Ehhh. Minor update to matrix cores.
MI300 yes.
Even nV dudes like that one!

xpea · Aug 5, 2021

no-X said:
I'm sorry, but the fact that AI/ML market is bigger doesn't support your opinion, that HPC market is dying. It doesn't say anything about evolution of HPC market at all. The article clearly states, that HPC market is growing, so your statement was invalid.

We don't see the same thing. 10 years ago, FP64 HPC was the only market for accelerators. Today, AI/ML replaced the vast majority of FP64 workloads, to the point that AI/ML is already 8.3 times bigger than FP64 HPC. Whatever FP64 HPC is growing or not, it became insignificant compared to AI/ML, thus my term "dying".

DegustatoR · Aug 5, 2021

trinibwoy said:
You achieve peak FP16 flops on Ampere by issuing a wave-32 of packed FP16 FMA operands (64 sets total) to the tensor pipe. Presumably the tensors will take 2 cycles to process the wave. What’s preventing the SM from issuing a wave-32 of FP32 instructions to one of the other SIMDs in the next cycle while the tensors are still chewing on the FP16 data?

Or do you mean that tensors process the full wave in one cycle so there’s no opportunity to run tensors in parallel with other work?

I don't know. Going from specs the throughputs for FP32 and FP16 are the same on Ampere. Then you should presumably be able to run them concurrently (you'd need two async workloads of course). But how this is happening and with what speeds should be tested and I haven't seen any data on this.

troyan · Aug 5, 2021

Jawed said:
FLOPS per transistor (mm²) and FLOPS per watt are what really matter, so games-actual versus theoretical isn't such an enlivening topic in the end.

Only flops per watt matters because transistors are cheap and compute units are very effcient. Biggest problem is data movement.

no-X said:
Of course the entire HPC market isn't based on FP64 accelerators. But the same applies to AI/ML market. It also isn't based purely on GPU-accelerators. The point is that demand for FP64-accelerators haven't disappeared. Nvidia doesn't care, so it does make sense for AMD to take advantage of it. AI/ML market is bigger, but competition is much stronger. Anyway, MI200 is going to be quite interestion solution even for AI.

FP64 is an ineffcient way to calculate data. Using mixed precision in cases where FP64 isnt necessary increases effciency by x-times. Why settle with 1 Exaflop when you can scale to 32 Exaflops?

That makes single purpose products like AMD's CDNA less competitive and cost ineffective for most companies and cloud providers. nVidia's datacenter business exploded with Volta (HPC and DL training) and Turing (DL inference), now with GA100 they can tackle every workload with one product.

The same reason why RDNA2 failed: Being good at "pure" rasterizing isnt good enough today.

JoeJ · Aug 5, 2021

OlegSH said:
Because there is 0 demand for the 7 times higher resolution displays (most of PC displays are still 1080p, 1440p is second most popular resolution and 4K still captures a minor fraction of PC market), geometry processing takes pretty much constant time in all resolutions, etc, etc.
If you look carefully, you would probably notice that most of games don't scale linearly with resolutions for tons of reasons (not just CPU), only the heaviest (thanks to RT and compute) games like CP2077 do scale linearly with pixels, but that's exactly type of games where RTX 3090 is up to 2x faster in comparison with RX 6900 XT currently.

Agree. But the question remains: What to do with insane 75tf GPU?
Scaling console games up this far is pointless.
Multisampling is not efficient.
Maxing out RT just to bring it to its knees is not efficient either i guess (we'll see if / how they'll improve).
So we need to add something new not present in the console game we aim to port.
Which could be (summing up my previous proposals) volumetric stuff (fog simulation, lighting), layered framebuffer to address SS hacks shortcomings, fancy SM based area shadow techniques. And ofc. GI if compute can do this better than RT. What else?
No matter what, there should be more than enough async compute work around to compensate speculated issues from running traditional gfx pipeline on chiplets. So even if there is a problem at all, it feels pretty rhetorical to me (would change if chiplets move to entry/mid level).
Even if we just scale up RT, the BVH building work on very detailed geometry alone would already provide shitloads of async compute work.

So i don't think there'll be a problem to utilize the GPU, *if* we do this extra work.
It depends on how many such GPUs get sold to gamers, which should depend on the visual improvements we are able to achieve by cranking up, in relation to the high price of the HW.
Feels crazy, because on the other hand we surely can sell more games with putting the focus on scaling down (Series-S, SteamDeck, Switch, poor mans PC).
The expected issues form chiplets yes or no won't be a problem, but the increasing variety in over- and underspecced HW is. Multi gen and platform games become even more expensive to make and more compromised, while the lower and higher ends on HW become more niche, so hard to say what's worth it.

yuri · Aug 5, 2021

troyan said:
That makes single purpose products like AMD's CDNA less competitive and cost ineffective for most companies and cloud provders. nVidia's datacenter business exploded with Volta (HPC and DL training) and Turing (DL inference), now with GA100 they can tackle every workload with one product.

TBH aiming CDNA (Vega) on pure HPC is not that weird given the SW side of the business. Targeting AI/ML requires top-notch SW. AMD's SW is traditionally far from that.

DegustatoR · Aug 5, 2021

JoeJ said:
Agree. But the question remains: What to do with insane 75tf GPU?

There is no such question. You forget that 75TF top end means ~25TF low end, and even that will not be enough to run games from last year at maximum settings. The lineup isn't made out of one GPU.

And even beyond that scaling RT and compute based raster is far from over. Games aren't really hitting the point at which we can say "well, we don't need better graphics now".

Bondrewd · Aug 5, 2021

DegustatoR said:
You forget that 75TF top end means ~25TF low end

it means everything gets more expensive

CarstenS · Aug 5, 2021

no-X said:
Anyway, MI200 is going to be quite interestion solution even for AI.

MI200 probably will be, yes. Depending of course of the competition at the time when it actually goes to market and not preliminary shipments for deployment tests. But that's neither Lovelace nor Hopper.

Bondrewd · Aug 5, 2021

CarstenS said:
MI200 probably will be, yes

Nah.

CarstenS said:
when it actually goes to market

That's now.

NVidia Ada Speculation, Rumours and Discussion

xpea

techuse

CarstenS

Moderator

techuse

no-X

xpea

CarstenS

Moderator

xpea

CarstenS

Moderator

no-X

Bondrewd

xpea

DegustatoR

troyan

JoeJ

yuri

DegustatoR

Bondrewd

CarstenS

Moderator

Bondrewd

Similar threads