Nvidia Ampere Discussion [2020-05-14]

3070 for 1440p gaming may well turn out dangerously close to 3080...
...

If the games are being bottlenecked by bandwidth or the raster engines, as people are suggesting, then I don't see how that could be true. More likely, I think as games drop the vertex shader pipeline and switch to mesh shaders or compute front-ends, that difference in compute power will shine more on the 3080.
 
Personally I believe the trend of game engines will be going toward higher compute/bandwidth ratio. The reason is that GPU is actually walking the same path CPU has been through for some time, that is, computation is more power efficient than bandwidth in general.

Ray tracing is a good example. By using ray tracing it’s possible to reduce the number of off-screen rendering passes per frame, because a lot of them can be replaced with ray tracing (e.g. shadow maps, reflection renderings, etc.) This favors GPU with higher compute/bandwidth ratio.
 
I found this for undervolt test :

Yah, I've seen that. That's an undervolt to see how low he can get power while maintaining pretty much stock FE clocks in games. I'm more interested in seeing if you can overclock with a more minor undervolt that just gets it under the power limit. Would be a balancing act of lowering power and pushing clocks as high as they can go while just staying under that limit. I definitely prefer a setup where the clock is 100% stable, which is how I have my card setup right now. It feels a lot smoother when you have an fps cap set.
 
I think he didn't think of what UE5 demo was doing. AMDs gpu's are probably build for the same purpose, gaming and professional market, just like they are now. Just be happy if they can match Turing/ampere performance. Were up to the point of close to 40TF of gpus, with 20 becoming the norm down the line. Progression isn't anything to complain about now in special DLSS and ray tracing advancements. Never thought a 3080 would outperform the 2080 by so much (between 80 and 100%), beating the 2080Ti handidly at a much lower price. And now AMD is coming. Good times.


NO, AMD split their graphic architecture into two separate domain-specific architectures of RDNA & CDNA.

Real-Time Gaming (Frames/Second) architecture -vs- High-Performance Compute (Flops/Second) architecture.
 
You've mentioned this several times before. What I don't understand is why you continue to act like Nvidia hasn't done the exact same thing. V100 and A100 are very different beasts compared to Nvidia's gaming chips.

I think people are making the assumption that because the compute power scaled up much more drastically than the rest of the gpu, then it must be because the architecture is not designed for gaming. I honestly don't really agree, because memory bandwidth is just an incredibly expensive issue to solve and the vertex shader pipeline is being replaced with a highly-parallel compute-driven mesh-shader pipeline which will leverage all of the compute performance. That leaves rops, which we don't know if any of these games are limited by rops.
 
I think people are making the assumption that because the compute power scaled up much more drastically than the rest of the gpu, then it must be because the architecture is not designed for gaming. I honestly don't really agree, because memory bandwidth is just an incredibly expensive issue to solve and the vertex shader pipeline is being replaced with a highly-parallel compute-driven mesh-shader pipeline which will leverage all of the compute performance. That leaves rops, which we don't know if any of these games are limited by rops.

The arch is obviously not designed for gaming, the compute is way out of bounds for other bottlenecks and just tossing out "mesh shaders will solve it!" isn't a compelling argument. It's going to be just as bandwidth strapped running everything else, even if meshes have a lower data rate than buffers it's still just as bottlenecked, there's still a vast amount of hypothetical compute performance laying around doing nothing for most of a frame.

So far Ampere is, transistor for transistor, less efficient than Turing for gaming and will probably remain that way. And given the vast power usage versus the advance in silicon nodes it's almost certainly less efficient in power usage as well. And of course it's not like AMD's CD/RDNA split, those appear to be two distinct architectures while the Ampere gaming models mostly seem to differ from A100 in things like Tensor and Cuda core count over anything else.

I severely doubt Nvidia intended to have their highest end chip perform an average of only 10% or so better than their mass market bin of the exact same die; probably why it's called the 3090 instead of Titan as the PR guys want to preserve that latter's prestige in name.
 
@Frenetic Pony The mesh shader pipeline or compute pipelines like unreal nanite will leverage that compute where the current vertex shader pipeline will not. As for bandwidth, I don’t know what the options are. 512 bit bus or hbm I suppose, but those are very costly choices.

I think they pushed stock config much more than usual which left less headroom and high power consumption. They could have released it sub 300W and nearly the same performance.
 
The arch is obviously not designed for gaming

Nonsense. It's nearly identical on the compute side and what's changed is more related to concurrently being able to RTX and DLSS.

the compute is way out of bounds for other bottlenecks

I don't think so. It's just people being knee-jerk fixated on single dimensional GFLOPS as if that should be the absolute performance metric. It's never been.
There was a very clear opportunity to maximize scheduling rate on the SM while increasing FP32 and they did it the only way posible, by doubling/adding a second FP SIMD (that didn't even get its own data path**) among many other units that were already there. It's never suposed to come with a doubling of performance, there was simply no other way to increase it but doubling the unit (which is the same that has happened with TMU and ROPs in the past, every few generations they seem overkill, but it's just a small percentage of actual trnasistor budget, same here). It's enough if performance increase is greater than area increase and so far, it came with a more than 30% performance uplift against a card (2080 Ti) with same amount of SM (68) for a very minor increase in area.

EDIT:

there's still a vast amount of hypothetical compute performance laying around doing nothing for most of a frame.

Yeah and a lot of texturing perofrmance laying around doing nothing and a lot of ROP performance laying around doing nothing and in Turing, also a lot of INT32 computing performance laying around doing nothing, and a long list of many other hypothetical performances laying around doing nothing. What exactly makes FP32 so special that it requires special consideration?

So far Ampere is, transistor for transistor, less efficient than Turing for gaming

How so? Even accounting for the much improved RT and TC cores + the scheduling changes to make those run concurrent, GA104 is 17.4 billion transistor vs TU102 18.4 billion and will most definitely beat it. As for the 3080, it has 20% of its chip disabled, so it would be equivalent to a 28 * 0.8 = 22.4 billion transistor chip, and that's just 20% more trnasistors for a 30%+ performance uplift.

and will probably remain that way.

It' is not that way and its advantage will do nothing but grow, as games better supporting its advantages start popping up.

** Now that would have been an indication supporting your claim if it had had its own datapath.
 
Last edited:
I severely doubt Nvidia intended to have their highest end chip perform an average of only 10% or so better than their mass market bin of the exact same die; probably why it's called the 3090 instead of Titan as the PR guys want to preserve that latter's prestige in name.

Titans have never been significantly faster than the Ti's below them so that's not it.
 
Regarding the comparison in the white paper for the Titan RTX vs 3090, for AI compute.
In this table the Titan RTX is listed for FP16 Tensor TFLOPS with FP32 Accumulate as having 65.2 TFLOP, where it has actually 130 TFLOP
The question is this by mistake ? And similarly does the 3090 have then 142 TFLOP or the listed 71 TFLOP
 
Last edited:
Previous generations had fp16 accumulate at full throughput and fp32 accumulate halved. I'd guess it's the same situation here.
 
If the games are being bottlenecked by bandwidth or the raster engines, as people are suggesting, then I don't see how that could be true. More likely, I think as games drop the vertex shader pipeline and switch to mesh shaders or compute front-ends, that difference in compute power will shine more on the 3080.
Look through the reviews, single out the titles, were the Radeons shine and cross-check with relative improvement for Ampere vs. Turing.
 
You've mentioned this several times before. What I don't understand is why you continue to act like Nvidia hasn't done the exact same thing. V100 and A100 are very different beasts compared to Nvidia's gaming chips.

Are you suggesting they don't use the same architecture...? (V100/A100)
 
Previous generations had fp16 accumulate at full throughput and fp32 accumulate halved. I'd guess it's the same situation here.

Turing TU102 Quadro and Titan have enabled full rate 130 TFLOP FP16 / FP32 accumulate (unlike the Ampere whitepaper mistakenly quotes for the Titan as 65.2 TFLOP)

upload_2020-9-20_13-20-11.png
https://blog.slavv.com/titan-rtx-quality-time-with-the-top-turing-gpu-fe110232a28e
"Full-rate mixed-precision training (FP16 with FP32 accumulation) — A few paragraphs ago, mixed precision training was explained. When the model utilizes Tensor cores, it performs matrix multiply-accumulate operation really quick. The second step in this operation (accumulate) must be done at FP32 to preserve accuracy and is then converted to FP16. The accumulate operation performs at half speed on RTX 2080 and RTX 2080 Ti, but on full-rate on the Titan RTX. In practice, this makes the Titan RTX perform 10% to 20% faster where Tensor Cores are utilized."
 
Last edited:
I think people are making the assumption that because the compute power scaled up much more drastically than the rest of the gpu, then it must be because the architecture is not designed for gaming. I honestly don't really agree, because memory bandwidth is just an incredibly expensive issue to solve and the vertex shader pipeline is being replaced with a highly-parallel compute-driven mesh-shader pipeline which will leverage all of the compute performance. That leaves rops, which we don't know if any of these games are limited by rops.
I'm honestly getting Deja Vu from the Fury X days with these arguments.
 
Back
Top