Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Status
Not open for further replies.
More chips are appearing, this time we have a 118CU chip:


The 124CU chip @1100MHz achieves a CUDA score of: 222337
The 118CU chip @1100MHz achieves a CUDA score of: 169368

So going from 118 to 124 resulted in an increase of 31%!

Also worth noting that these results are not comparable to CUDA scores for Tesla or Turing, as the tests for Ampere are using insider drivers and CUDA 11 which is yet to be released.

445.01 & 445.35 are both insider drivers using WDDM 2.7 for Windows 10 20H1

Current latest driver is 442.50 and CUDA 10.2

 
I think it's interesting that if the clocks are not being missrepresented, those benchmark results are pretty consistent with previous rumors/leaks.

First, the higher than expected performance-per-SM is consistent with the proposed setup of 16xFP32 + 16xINT or 16xFP32 + 16xFP32, and I'd even say that by the expected amount. Turing would be effectively ~1.4x warps every 2 cycles (36 INT per 100 FP thing), while Ampere would actually be able to do the full 2 warps, which is a 40% perf increase, which we are kinda actually seeing in those benches.

Second, 50% higher performance is roughly what we are shown and I think with such low clocks on 7nm and the afforementioned setup, half the power consumption is pretty realistic, almost a given to be very low.
 
I think it's interesting that if the clocks are not being missrepresented, those benchmark results are pretty consistent with previous rumors/leaks.

First, the higher than expected performance-per-SM is consistent with the proposed setup of 16xFP32 + 16xINT or 16xFP32 + 16xFP32, and I'd even say that by the expected amount. Turing would be effectively ~1.4x warps every 2 cycles (36 INT per 100 FP thing), while Ampere would actually be able to do the full 2 warps, which is a 40% perf increase, which we are kinda actually seeing in those benches.

Second, 50% higher performance is roughly what we are shown and I think with such low clocks on 7nm and the afforementioned setup, half the power consumption is pretty realistic, almost a given to be very low.
Could be just as easily due to memory bandwidth changes.
 
Could be just as easily due to memory bandwidth changes.

You think? It's the same memory setup as Volta except for clocks, and perf on Volta vs Turing is more consistent with TFLOPS than it is with Volta's 50% higher memory BW. In this particular benchmark anyway.
 
You think? It's the same memory setup as Volta except for clocks, and perf on Volta vs Turing is more consistent with TFLOPS than it is with Volta's 50% higher memory BW. In this particular benchmark anyway.
I'm just saying that it's hard to tell from these results. L2 is 5X+ larger, memory size varies from 24 to 32 to 48 GBs so it's hard to say if it's even four stacks and not six now for example.

Also how does Geekbench count the number of SPs? Shouldn't it detect the proper number if it's 2X per SM now? Or does it just use some fixed number per SM and multiply it by SM count?
 
Last edited:
I'm just saying that it's hard to tell from these results. L2 is 5X+ larger, memory size varies from 24 to 32 to 48 GBs so it's hard to say if it's even four stacks and not six now for example.

Yeah, I get what you mean, but I still think there's more to it.

Also how does Geekbench count the number of SPs? Shouldn't it detect the proper number if it's 2X per SM now? Or does it just use some fixed number per SM and multiply it by SM count?

As far as I can tell, the bench doesn't count the number of SPs at all. It only reports number of CUs.
 
Also how does Geekbench count the number of SPs? Shouldn't it detect the proper number if it's 2X per SM now? Or does it just use some fixed number per SM and multiply it by SM count?
Since it uses OpenCL and CUDA, it can read it directly from what the driver reports.
 
hmmm...

NAVI10 is 251mm2 on 7nm for 10.3 billion transistors
TU106 is 445mm2 on 16/12nm for 10.8 billion transistors

Number of transistors is in the same ballpark (within 5%), TDP is also very close despite AMD full node advantage, but RDNA lacks VRS, Ray Tracing acceleration, Tensor cores, DLSS, good video encoder and support of INT4/8 for fast inference !!! It's very clear which one has the upper hand.

So IMHO, RDNA is far away from Turing. AMD can only compete because of their node advantage. They already made their big move with RDNA and RDNA 2 will be a small architecture evolution (mostly bringing VRS and RT) on a refined mode, where Ampere is a totally new architecture with full mode shrink. Like everybody, I want close competition for the sake of reasonable prices. But let's be pragmatic, even with only a node shrink and zero uarch improvement (and it's not), it will be a bloodbath for AMD...

AND...? (is it not obvious?)

Navi10 is faster in games, not equal in games. That is why nVidia released SUPER. (that still can't compete in some games). Understand?


Secondly, navi-10 uses a hybrid design rdna(1), and the long held secret of rnda2 (AMD's full new gaming architecture) has yet to be seen. But we know it is not weighed down with gcn, or what limited that design... it free from that.

TU-106 equalized (for mhz, transistors, etc) can't beat hybrid navi, then how will it compete whit rdna2's gaming efficiency ? We are talking about a uArch, that AMD has been working on (in secret) for 3 years and once shown to Clients years ago, jumped on board. Both the new Xbox & PlayStation will be using rdna2, not to mention we've seen some of the specs. Xbox might have the gaming performance of the rtx2080, using rdna2.

You might want to stop on let that sink in.


Thridly, Ampere is not new, it's architecture is 100% based off of Turing, just further refined. You are fabricating. And yes, on a full node shrink, that is totally new to nVidia. They will have growing pains.

But you still have not refuted the fact that rdna(1) is more powerful (at gaming) than turing architecture. And more of nvidia's design (turing 2.0?), is not going the help in games, because (again) it's based on an antiquated design.

I don't want a bigger 2080 shrink down, I want a revolutionary one. And this ampere, as we know it, is only nvidia's next volta business sector dGPU.
 
AND...? (is it not obvious?)

Navi10 is faster in games, not equal in games. That is why nVidia released SUPER. (that still can't compete in some games). Understand?


Secondly, navi-10 uses a hybrid design rdna(1), and the long held secret of rnda2 (AMD's full new gaming architecture) has yet to be seen. But we know it is not weighed down with gcn, or what limited that design... it free from that.

TU-106 equalized (for mhz, transistors, etc) can't beat hybrid navi, then how will it compete whit rdna2's gaming efficiency ? We are talking about a uArch, that AMD has been working on (in secret) for 3 years and once shown to Clients years ago, jumped on board. Both the new Xbox & PlayStation will be using rdna2, not to mention we've seen some of the specs. Xbox might have the gaming performance of the rtx2080, using rdna2.

You might want to stop on let that sink in.


Thridly, Ampere is not new, it's architecture is 100% based off of Turing, just further refined. You are fabricating. And yes, on a full node shrink, that is totally new to nVidia. They will have growing pains.

But you still have not refuted the fact that rdna(1) is more powerful (at gaming) than turing architecture. And more of nvidia's design (turing 2.0?), is not going the help in games, because (again) it's based on an antiquated design.

I don't want a bigger 2080 shrink down, I want a revolutionary one. And this ampere, as we know it, is only nvidia's next volta business sector dGPU.
Navi isn’t faster in games, it’s roughly on par with Turing while having less features. Also, please don’t peddle this “Navi/RDNA is a hybrid” that came from the lowest of the low tech publications...

RDNA2 will be a solid improvement but nothing revolutionary.
 
Navi isn’t faster in games, it’s roughly on par with Turing while having less features. Also, please don’t peddle this “Navi/RDNA is a hybrid” that came from the lowest of the low tech publications...

RDNA2 will be a solid improvement but nothing revolutionary.

full stop. I just proved my case above, there is no argument here.
If navi10 and TU-106 are the same size and navi-10 is on average +15% faster... how is it not more efficient. More performance ("ipc"), using less transistors..? No matter how you scale it, rdna(1) comes out on top for freq vs output.

And, I am not peddling anything. rdna2 is different, period!
 
full stop. I just proved my case above, there is no argument here.
If navi10 and TU-106 are the same size and navi-10 is on average +15% faster... how is it not more efficient. More performance ("ipc"), using less transistors..? No matter how you scale it, rdna(1) comes out on top for freq vs output.

And, I am not peddling anything. rdna2 is different, period!

Fact that 445mm2 12nm GPU is competing against 251mm2 7nm2 not only in performance but efficiency pretty much ends the argument period. Its AMD that needs to catch up, not the other way.

You are completely ignoring 7nm advantage vs 12nm on same number of transistors. Without that advantage on Performance/Power Savings, Navi wouldn't look pretty against Turing.
 
Last edited:
I just proved my case above, there is no argument here.

Not really. You're just repeating unfounded statements over and over. That doesn't amount to proof.

If navi10 and TU-106 are the same size and navi-10 is on average +15% faster... how is it not more efficient. More performance ("ipc"), using less transistors..?

How did you calculate the transistor count? How many transistors did you allocate to tensors and RT?

Instead of random guessing why don't you compare the 5700xt and 2070 super. They literally have the same specs.
 
Not really. You're just repeating unfounded statements over and over. That doesn't amount to proof.



How did you calculate the transistor count? How many transistors did you allocate to tensors and RT?

Instead of random guessing why don't you compare the 5700xt and 2070 super. They literally have the same specs.

I dont necessarily agree with Wolfram overall but hes correct in terms of transistor counts. 5700xt is 10.3 billion. The 2070 super is 13.6 billion.
 
Status
Not open for further replies.
Back
Top