NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
Besides transistors spent for DX12 ultimate, doubling FP32 thoughput within a SM does more for gaming than improving clocks and using a L3 Cache. Performance inprovement from Turing -> Ampere is in the same ballpark than Navi2 while transistor budget is smaller.
Obviously but it's worth to remind that Dual FP32 gives also Nvidia a total dominance in professional rendering, scientific community and any general creator market that is a good profit center and an important influencer vector
Historically, Nvidia was always strong in these markets but I never saw such big gap with AMD in the past, where a mid range die (GA104) beats down the top die Navi21
 
We don’t know how a theoretical GA104 with less flops but more cache would perform. Given the low hardware utilization when gaming there’s a very good chance that config would be better for games. That depends of course on whether the extra flops were expensive in terms of transistor budget.

There is the 2080TI with 6mb L2 Cache (50% more), 37% more bandwidth and 48% more compute units. Gaming performance is equal between 3070 and 2080TI. So going wider doesnt help with gaming performance.
 
There is the 2080TI with 6mb L2 Cache (50% more), 37% more bandwidth and 48% more compute units. Gaming performance is equal between 3070 and 2080TI. So going wider doesnt help with gaming performance.

2080 ti has more SM then 3070 but 3070's SM are twice as wide, so computation-wise (in FP32) 3070 is much wider than 2080 ti.
 
Doubling the FP32 units was probably just the lowest barrier of entry way to get some more performance since they had no way to improve clocks. It's certainly not ideal for a gaming architecture. Both Nvidia and AMD have already axed this approach many generations ago. Nvidia's R&D seems to go primarily to other markets, gaming is no longer their focus.
 
And yet Ampere is superior to a pure gaming only architecture with same number of transistors...

2080 ti has more SM then 3070 but 3070's SM are twice as wide, so computation-wise (in FP32) 3070 is much wider than 2080 ti.

Only for FP32 and scheduled instructions are identical per SM. Ampere SM has 50% more L1/SM cache but with twice the FP32 performance still only 4MB L2 cache or 66% of 2080TI.
 
Doubling the FP32 units was probably just the lowest barrier of entry way to get some more performance since they had no way to improve clocks. It's certainly not ideal for a gaming architecture. Both Nvidia and AMD have already axed this approach many generations ago. Nvidia's R&D seems to go primarily to other markets, gaming is no longer their focus.

Seems gaming is their primary focus since they outperform their competitors in every area in that market segment. AMD has wide gpus too, its the ps5 gpu thats the exception but that aint amd’s fault.
 
Only for FP32 and scheduled instructions are identical per SM. Ampere SM has 50% more L1/SM cache but with twice the FP32 performance still only 4MB L2 cache or 66% of 2080TI.

So it looks like while 2080 Ti has more bandwidth/cache etc but 3070 with more computation power can keep up with it in games. It seems to support the idea that more computation power (in FP32) helps with gaming performance (or rather 2080 ti has too much bandwidth for its relatively limited computation power).
 
And yet Ampere is superior to a pure gaming only architecture with same number of transistors...



Only for FP32 and scheduled instructions are identical per SM. Ampere SM has 50% more L1/SM cache but with twice the FP32 performance still only 4MB L2 cache or 66% of 2080TI.
Superior because of RT and DLSS. In rasterization it's about on par while using 20-40% more power depending on which brand of GPU you have. 6800xt power draw for most aftermarket cards is typically around 240-290 depending on the game. 3080 is 330-370 in those same games.
 
So it looks like while 2080 Ti has more bandwidth/cache etc but 3070 with more computation power can keep up with it in games. It seems to support the idea that more computation power (in FP32) helps with gaming performance (or rather 2080 ti has too much bandwidth for its relatively limited computation power).

2080TI has more geometry units, too. 2080TI is much wider but it doesnt perform better.

Superior because of RT and DLSS. In rasterization it's about on par while using 20-40% more power depending on which brand of GPU you have. 6800xt power draw for most aftermarket cards is typically around 240-290 depending on the game. 3080 is 330-370 in those same games.

Dont know any 3080 without GGDR6X. But there is a quadro A6000 card with 16gbit/s and 50W less than a 3090: https://www.igorslab.de/was-kann-de...denz-pur-und-sieg-gegen-die-geforce-rtx-3090/
Looks like that with GDDR6 Ampere is superior to RDNA2.
 
Aftermarket 3090s are at 370-420 watts. Cutting 50 watts off still nets it quite a bit higher than a 6900xt which is only a few watts more than a 6800xt.
 
That can hardly be judged. AMD dedicated more transistors to save memory bandwidth, which makes other parts simpler.
A useful feature of a huge cache is that it has massive, fine-grained, redundancy, which means effective yields aren't materially worsened despite the die increasing in size.

It's always full capacity cache, and defects are only affecting CUs and ROPs. RX 6800 has the full 128MB of Infinity Cache.

So not only does the cache save off-die bandwidth (and power) but it helps with yields.
 
There is the 2080TI with 6mb L2 Cache (50% more), 37% more bandwidth and 48% more compute units. Gaming performance is equal between 3070 and 2080TI. So going wider doesnt help with gaming performance.

The 2080 Ti is a good comparison point. The 3070 does seem to be doing more with less. On paper its only advantages are ROPs and flops. The 3070 has a lot fewer RT cores and a lot less INT32 and still manages to match the 2080 Ti with RT on.
 
Even Undervolted / TDP constrained Fiji was able to beat factory-OCed GTX 980 while having lower power consumption. That's quite common behavior of big power-hungry GPUs with wide buses. The same applies to GT200, GF100 etc. The problem is that you need to manipulate one product (more expensive one) and compare it to another product (cheaper one) at factory settings to get these seemingly good results. Apples to oranges.
 
Can someone explain what's going on in this graph?:

Metro-Exodus-EE-FrameTimeVariance-3840-x-2160-DX12-Ultra-RT-High-Reflection-Hybrid.png


"Sometimes it’s really good to evaluate everything, because the details of the variances reveal more subtleties than you might think. Just this metric is really a must and I can only advise every tester to make this effort as well."

Yes, I cherry-picked that graph, but it makes my point.

Overall it's a really nice article, super-impressive work.

There are some data points, affecting AMD and NVidia, where theoretically faster cards are showing worse results than slower cards. Those seem to me to be outliers that should be rejected. He says these outliers are reproducible, so shouldn't be rejected.

It's really tricky stuff doing consistent testing and we already know the dangers of using canned-benchmarks for performance evaluation. But Igor's definitely on another level here.

I just want to point out that while the Suprim X cooler on the 3090 Ti is a 4-slot cooler versus 3-slot coolers on AMD, the difference in cooling capability at "390W" as shown on these pages:

MSI GeForce RTX 3090 Ti Suprim X Review - Cooler Performance Comparison | TechPowerUp
MSI Radeon RX 6900 XT Gaming X Trio Review - Cooler Performance Comparison | TechPowerUp

isn't massive, though, only about 3 celcius. In theory more cooler efficiency = lower power consumption, since GPUs increase their power consumption with temperature. But I don't think it's making the test too unbalanced.
 
The 2080 Ti is a good comparison point. The 3070 does seem to be doing more with less. On paper its only advantages are ROPs and flops. The 3070 has a lot fewer RT cores and a lot less INT32 and still manages to match the 2080 Ti with RT on.
RT cores, async compute and tensor stuff are all dramatically better in Ampere than Turing, so "RT on" is not a good comparison point.
 
Can someone explain what's going on in this graph?:
Sure. That graph is frame time variance, not frame time. Thus, in situations where frame rate is hard capped by some similar metric (bandwidth, cpu, triangle rate, whatever) and that is responsible for the minimum frame rate, then cards will share a similar number. Now, if GPU A is generally faster than GPU B, and GPU A has a higher max frame rate than GPU B but a similar minimum frame rate, then the variance of GPU A will be higher than GPU B because it is offering a better overall experience. Of course, this doesn't have to be the case; it is just the most likely scenario. Without the actual data though, one can't be certain. If one really wants the lowest variance however, this can be achieved on any card by capping the frame rate at whatever the minimum frame rate is for it in that application. Then one can enjoy a 0 variance experience ;).

Igor does graph actual frame times here: https://www.igorslab.de/msi-geforce...nur-ein-testlauf-fuer-die-geforce-rtx-4090/8/

Bizarrely, the frame times for that particular game are not included in the results. It is also quite strange that Borderlands performs better at 2560x1440 than 1920x1080, perhaps there is some issue with that particular game or the testing setup.
 
Last edited:
Status
Not open for further replies.
Back
Top