Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

DavidGraham · Mar 3, 2020

Sounds like HPC Ampere is a 128CU, another chip just showed up, with 7936 CUDA cores.

https://twitter.com/x/status/1234861928230653955

DavidGraham · Mar 4, 2020

More chips are appearing, this time we have a 118CU chip:

https://twitter.com/x/status/1234932539019710469

The 124CU chip @1100MHz achieves a CUDA score of: 222337
The 118CU chip @1100MHz achieves a CUDA score of: 169368

So going from 118 to 124 resulted in an increase of 31%!

Also worth noting that these results are not comparable to CUDA scores for Tesla or Turing, as the tests for Ampere are using insider drivers and CUDA 11 which is yet to be released.

445.01 & 445.35 are both insider drivers using WDDM 2.7 for Windows 10 20H1

Current latest driver is 442.50 and CUDA 10.2

https://twitter.com/x/status/1233532641531637760

Malo · Mar 4, 2020

DavidGraham said:
So going from 118 to 124 resulted in an increase of 31%!

Yeah because that makes sense. No other factors are involved of course.

Oooh Nvidiaaaaaaaa!

CarstenS · Mar 4, 2020

FWIW, a 2080 Ti @1100 MHz is around 106k in GB5.

Benetanegia · Mar 4, 2020

I think it's interesting that if the clocks are not being missrepresented, those benchmark results are pretty consistent with previous rumors/leaks.

First, the higher than expected performance-per-SM is consistent with the proposed setup of 16xFP32 + 16xINT or 16xFP32 + 16xFP32, and I'd even say that by the expected amount. Turing would be effectively ~1.4x warps every 2 cycles (36 INT per 100 FP thing), while Ampere would actually be able to do the full 2 warps, which is a 40% perf increase, which we are kinda actually seeing in those benches.

Second, 50% higher performance is roughly what we are shown and I think with such low clocks on 7nm and the afforementioned setup, half the power consumption is pretty realistic, almost a given to be very low.

Deleted member 2197 · Mar 4, 2020

Malo said:
Oooh Nvidiaaaaaaaa!

Yea, I know it must hurt so much!

DegustatoR · Mar 4, 2020

Benetanegia said:
I think it's interesting that if the clocks are not being missrepresented, those benchmark results are pretty consistent with previous rumors/leaks.

First, the higher than expected performance-per-SM is consistent with the proposed setup of 16xFP32 + 16xINT or 16xFP32 + 16xFP32, and I'd even say that by the expected amount. Turing would be effectively ~1.4x warps every 2 cycles (36 INT per 100 FP thing), while Ampere would actually be able to do the full 2 warps, which is a 40% perf increase, which we are kinda actually seeing in those benches.

Second, 50% higher performance is roughly what we are shown and I think with such low clocks on 7nm and the afforementioned setup, half the power consumption is pretty realistic, almost a given to be very low.

Could be just as easily due to memory bandwidth changes.

Benetanegia · Mar 4, 2020

DegustatoR said:
Could be just as easily due to memory bandwidth changes.

You think? It's the same memory setup as Volta except for clocks, and perf on Volta vs Turing is more consistent with TFLOPS than it is with Volta's 50% higher memory BW. In this particular benchmark anyway.

DegustatoR · Mar 4, 2020

Benetanegia said:
You think? It's the same memory setup as Volta except for clocks, and perf on Volta vs Turing is more consistent with TFLOPS than it is with Volta's 50% higher memory BW. In this particular benchmark anyway.

I'm just saying that it's hard to tell from these results. L2 is 5X+ larger, memory size varies from 24 to 32 to 48 GBs so it's hard to say if it's even four stacks and not six now for example.

Also how does Geekbench count the number of SPs? Shouldn't it detect the proper number if it's 2X per SM now? Or does it just use some fixed number per SM and multiply it by SM count?

Benetanegia · Mar 4, 2020

DegustatoR said:
I'm just saying that it's hard to tell from these results. L2 is 5X+ larger, memory size varies from 24 to 32 to 48 GBs so it's hard to say if it's even four stacks and not six now for example.

Yeah, I get what you mean, but I still think there's more to it.

Also how does Geekbench count the number of SPs? Shouldn't it detect the proper number if it's 2X per SM now? Or does it just use some fixed number per SM and multiply it by SM count?

As far as I can tell, the bench doesn't count the number of SPs at all. It only reports number of CUs.

CarstenS · Mar 4, 2020

DegustatoR said:
Also how does Geekbench count the number of SPs? Shouldn't it detect the proper number if it's 2X per SM now? Or does it just use some fixed number per SM and multiply it by SM count?

Since it uses OpenCL and CUDA, it can read it directly from what the driver reports.

trinibwoy · Mar 4, 2020

CarstenS said:
FWIW, a 2080 Ti @1100 MHz is around 106k in GB5.

That would put an Ampere SM at 15% higher IPC than Turing. We're looking at the same alu config + better caches I think.

trinibwoy · Mar 4, 2020

CarstenS said:
Since it uses OpenCL and CUDA, it can read it directly from what the driver reports.

Those frameworks report the number of SMs not ALUs correct?

w0lfram · Mar 4, 2020

xpea said:
hmmm...

NAVI10 is 251mm2 on 7nm for 10.3 billion transistors
TU106 is 445mm2 on 16/12nm for 10.8 billion transistors

Number of transistors is in the same ballpark (within 5%), TDP is also very close despite AMD full node advantage, but RDNA lacks VRS, Ray Tracing acceleration, Tensor cores, DLSS, good video encoder and support of INT4/8 for fast inference !!! It's very clear which one has the upper hand.

So IMHO, RDNA is far away from Turing. AMD can only compete because of their node advantage. They already made their big move with RDNA and RDNA 2 will be a small architecture evolution (mostly bringing VRS and RT) on a refined mode, where Ampere is a totally new architecture with full mode shrink. Like everybody, I want close competition for the sake of reasonable prices. But let's be pragmatic, even with only a node shrink and zero uarch improvement (and it's not), it will be a bloodbath for AMD...

AND...? (is it not obvious?)

Navi10 is faster in games, not equal in games. That is why nVidia released SUPER. (that still can't compete in some games). Understand?

Secondly, navi-10 uses a hybrid design rdna(1), and the long held secret of rnda2 (AMD's full new gaming architecture) has yet to be seen. But we know it is not weighed down with gcn, or what limited that design... it free from that.

TU-106 equalized (for mhz, transistors, etc) can't beat hybrid navi, then how will it compete whit rdna2's gaming efficiency ? We are talking about a uArch, that AMD has been working on (in secret) for 3 years and once shown to Clients years ago, jumped on board. Both the new Xbox & PlayStation will be using rdna2, not to mention we've seen some of the specs. Xbox might have the gaming performance of the rtx2080, using rdna2.

You might want to stop on let that sink in.

Thridly, Ampere is not new, it's architecture is 100% based off of Turing, just further refined. You are fabricating. And yes, on a full node shrink, that is totally new to nVidia. They will have growing pains.

But you still have not refuted the fact that rdna(1) is more powerful (at gaming) than turing architecture. And more of nvidia's design (turing 2.0?), is not going the help in games, because (again) it's based on an antiquated design.

I don't want a bigger 2080 shrink down, I want a revolutionary one. And this ampere, as we know it, is only nvidia's next volta business sector dGPU.

naenrda · Mar 4, 2020

w0lfram said:
AND...? (is it not obvious?)

Navi10 is faster in games, not equal in games. That is why nVidia released SUPER. (that still can't compete in some games). Understand?

Secondly, navi-10 uses a hybrid design rdna(1), and the long held secret of rnda2 (AMD's full new gaming architecture) has yet to be seen. But we know it is not weighed down with gcn, or what limited that design... it free from that.

TU-106 equalized (for mhz, transistors, etc) can't beat hybrid navi, then how will it compete whit rdna2's gaming efficiency ? We are talking about a uArch, that AMD has been working on (in secret) for 3 years and once shown to Clients years ago, jumped on board. Both the new Xbox & PlayStation will be using rdna2, not to mention we've seen some of the specs. Xbox might have the gaming performance of the rtx2080, using rdna2.

You might want to stop on let that sink in.

Thridly, Ampere is not new, it's architecture is 100% based off of Turing, just further refined. You are fabricating. And yes, on a full node shrink, that is totally new to nVidia. They will have growing pains.

But you still have not refuted the fact that rdna(1) is more powerful (at gaming) than turing architecture. And more of nvidia's design (turing 2.0?), is not going the help in games, because (again) it's based on an antiquated design.

I don't want a bigger 2080 shrink down, I want a revolutionary one. And this ampere, as we know it, is only nvidia's next volta business sector dGPU.

Navi isn’t faster in games, it’s roughly on par with Turing while having less features. Also, please don’t peddle this “Navi/RDNA is a hybrid” that came from the lowest of the low tech publications...

RDNA2 will be a solid improvement but nothing revolutionary.

Rootax · Mar 4, 2020

w0lfgram keeps raising the bar... I'm waiting for "and oh BTW, Vega was trashing Turing too, you know"

w0lfram · Mar 6, 2020

naenrda said:
Navi isn’t faster in games, it’s roughly on par with Turing while having less features. Also, please don’t peddle this “Navi/RDNA is a hybrid” that came from the lowest of the low tech publications...

RDNA2 will be a solid improvement but nothing revolutionary.

full stop. I just proved my case above, there is no argument here.
If navi10 and TU-106 are the same size and navi-10 is on average +15% faster... how is it not more efficient. More performance ("ipc"), using less transistors..? No matter how you scale it, rdna(1) comes out on top for freq vs output.

And, I am not peddling anything. rdna2 is different, period!

JasonLD · Mar 6, 2020

w0lfram said:
full stop. I just proved my case above, there is no argument here.
If navi10 and TU-106 are the same size and navi-10 is on average +15% faster... how is it not more efficient. More performance ("ipc"), using less transistors..? No matter how you scale it, rdna(1) comes out on top for freq vs output.

And, I am not peddling anything. rdna2 is different, period!

Fact that 445mm2 12nm GPU is competing against 251mm2 7nm2 not only in performance but efficiency pretty much ends the argument period. Its AMD that needs to catch up, not the other way.

You are completely ignoring 7nm advantage vs 12nm on same number of transistors. Without that advantage on Performance/Power Savings, Navi wouldn't look pretty against Turing.

trinibwoy · Mar 6, 2020

w0lfram said:
I just proved my case above, there is no argument here.

Not really. You're just repeating unfounded statements over and over. That doesn't amount to proof.

If navi10 and TU-106 are the same size and navi-10 is on average +15% faster... how is it not more efficient. More performance ("ipc"), using less transistors..?

How did you calculate the transistor count? How many transistors did you allocate to tensors and RT?

Instead of random guessing why don't you compare the 5700xt and 2070 super. They literally have the same specs.

techuse · Mar 6, 2020

trinibwoy said:
Not really. You're just repeating unfounded statements over and over. That doesn't amount to proof.

How did you calculate the transistor count? How many transistors did you allocate to tensors and RT?

Instead of random guessing why don't you compare the 5700xt and 2070 super. They literally have the same specs.

I dont necessarily agree with Wolfram overall but hes correct in terms of transistor counts. 5700xt is 10.3 billion. The 2070 super is 13.6 billion.

Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

DavidGraham

DavidGraham

Malo

Yak Mechanicum

CarstenS

Moderator

Benetanegia

Deleted member 2197

Guest

DegustatoR

Benetanegia

DegustatoR

Benetanegia

CarstenS

Moderator

trinibwoy

Meh

trinibwoy

Meh

w0lfram

naenrda

Rootax

w0lfram

JasonLD

trinibwoy

Meh

techuse

Similar threads