Nvidia Ampere Discussion [2020-05-14]

I still dont get why people assume that RDNA2 is more suited for games. GA104 with 392mm^2 has 6 rasterizer, 48 geometry units and 96 rops with 20TFLOPs or 10TFLOPs and 10TOPs. Isnt this gaming focused enough?!
 
I'm honestly getting Deja Vu from the Fury X days with these arguments.
GCN has inherent utilization issues which prevented reaching peak throughput without using new programming paradigms.
Ampere doesn't have any such issues - and the new paradigms are limited to new h/w in it, like RT and TCs.
The scaling being less than peak figures here are mostly due to a) FP32 isn't really doubled as you need to consider the INTs which are there on top of peak FP32 in Turing and are now a part of peak FP32 in Ampere - this is a sizeable chunk of in game math, from 25 to 30% of all throughput.
And b) current gen games are targeting current gen GPUs compute/bandwidth balances. With Ampere providing an increase in compute above the increases in bandwidth it will need more math to fully saturate its FP capabilities than what modern day games are pushing.
So while there is some similarity to past GCN launches the underlying h/w of Ampere is completely different and the extraction of this performance will be significantly easier going forward. You won't need to use Mantle for that.
 
Regarding the comparison in the white paper for the Titan RTX vs 3090, for AI compute.
In this table the Titan RTX is listed for FP16 Tensor TFLOPS with FP32 Accumulate as having 65.2 TFLOP, where it has actually 130 TFLOP
The question is this by mistake ? And similarly does the 3090 have then 142 TFLOP or the listed 71 TFLOP
With Turing, they had reserved unconstrained Tensor throughput for Quadro cards, not 100% sure about Titan though:
https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf
 
GCN has inherent utilization issues which prevented reaching peak throughput without using new programming paradigms.
Ampere doesn't have any such issues [...]
I'm curious, can you summarise precisely why GCN has utilisation issues and why Ampere doesn't?

When you're explaining this, you must disentangle compute from rasterisation/bandwidth/TEX/cache-hierarchy/ROPs/work-distribution/load-balancing.

I'm asking because you state that Ampere "doesn't have any such issues". So I'm curious to see your precise explanation for this.

Careful: DX12 and Vulkan both introduced "new programming paradigms" which Ampere benefits from (thanks, AMD) so your explanation needs to be based upon DX11 or earlier APIs.
 
The scaling being less than peak figures here are mostly due to a) FP32 isn't really doubled as you need to consider the INTs which are there on top of peak FP32 in Turing and are now a part of peak FP32 in Ampere - this is a sizeable chunk of in game math, from 25 to 30% of all throughput.

Yes we probably need a better way to measure throughput rather than just focusing on FLOPS.

FLOPS+IOPS (CUDA OPS?) is better.

In that comparison the 2080Ti would we 27 CUDA OPS and the 3080 would be 30 CUDA OPS. So they're much closer when measured in that way, but Ampere's split of IOP vs FLOP is more efficient than Turings, hence the greater than linear increase in performance.
 
I'm curious, can you summarise precisely why GCN has utilisation issues and why Ampere doesn't?
Weak graphics frontend, low single thread performance, issues with state changes creating pipeline bubbles which can only be solved by running either pure compute (hello compute based culling and such tricks of this console gen) or async compute (which fills the bubbles which aren't there in NV h/w in the first place which is why it doesn't benefit from it as much). None of this exist in Ampere.

When you're explaining this, you must disentangle compute from rasterisation/bandwidth/TEX/cache-hierarchy/ROPs/work-distribution/load-balancing.
Why is that? Do all those who say that Ampere is scaling badly in games "disentangle" all these from compute while they say this?

Careful: DX12 and Vulkan both introduced "new programming paradigms" which Ampere benefits from (thanks, AMD) so your explanation needs to be based upon DX11 or earlier APIs.
Ampere doesn't benefit from DX12 and Vulkan as much as you think (?), it just needs them to access its new h/w - but this is an API level decision not a h/w limitation. You can hit Ampere peak math utilization in DX11 and OpenGL just fine, you just can't access RT and TCs from them.
 
Weak graphics frontend, low single thread performance, issues with state changes creating pipeline bubbles which can only be solved by running either pure compute (hello compute based culling and such tricks of this console gen) or async compute (which fills the bubbles which aren't there in NV h/w in the first place which is why it doesn't benefit from it as much). None of this exist in Ampere.


Why is that? Do all those who say that Ampere is scaling badly in games "disentangle" all these from compute while they say this?


Ampere doesn't benefit from DX12 and Vulkan as much as you think (?), it just needs them to access its new h/w - but this is an API level decision not a h/w limitation. You can hit Ampere peak math utilization in DX11 and OpenGL just fine, you just can't access RT and TCs from them.
All of these sound like very generic statements. I doubt in reality anyone but NVIDIA actually knows how easy or hard it will be to utilize those resources, not until devs get to grips with Ampere, and anything we say right now is wishful thinking at best. The only thing that can be said for certain is that it's a lot of ALU's to feed, and we've seen this before. How it will come out in practice is another story entirely.
 
Careful: DX12 and Vulkan both introduced "new programming paradigms" which Ampere benefits from (thanks, AMD) so your explanation needs to be based upon DX11 or earlier APIs.
AMD has not invented a wheel here.
LibCGM was used way before Mantle came up. Bindless extensions were introduced way before Mantle came up.
Execute indirect extensions were introduced way before Mantle came up.
And people now use more and more GPU-driven pipelines, which became possible with bindless and multi draw indirect extensions, not Mantle.
AMD is not a charity company. The reasons behind pushing low-level console-derived APIs to PC were quite obvious -- to monetize on consoles Wins by backporting optimizations from consoles.
It helped a lot to AMD by making GCN more long-lived than it could have been otherwise.
So thanks to AMD for introducing intrinsic level optimizations on PC and booo to AMD for stopping graphics progress for decade by switching developers attention to low level optimizations backporting rather than fixing broken stuff in hardware, which it did in the end in RDNA.
 
All of these sound like very generic statements. I doubt in reality anyone but NVIDIA actually knows how easy or hard it will be to utilize those resources, not until devs get to grips with Ampere, and anything we say right now is wishful thinking at best. The only thing that can be said for certain is that it's a lot of ALU's to feed, and we've seen this before. How it will come out in practice is another story entirely.

Games are not designed around 30TFLOPs. So i dont think we will see a huge difference for the next few years. But compute benchmarks like Luxmark let Ampere shine:
RTX-3080-Founders-Edition-Luxmark.png

https://www.xanxogaming.com/reviews...her-performance/#Productivity_(Vray_y_Luxmark

So a 3080 will look worse than a 3070 for most games.
 
- AMD launches a graphics card whose performance doesn't scale with TFLOPs throughput as expected.
Armchair experts: GCN has utilization issues.

- nVidia launches a graphics card whose performance doesn't scale with TFLOPs throughput as expected:
Armchair experts: It's the game engines that aren't prepared for this innovative architecture and developers need to optimize for it.



In the meanwhile, actual game developers who have been optimizing for GCN for the better part of the last decade:



¯\_(ツ)_/¯
 
All of these sound like very generic statements. I doubt in reality anyone but NVIDIA actually knows how easy or hard it will be to utilize those resources, not until devs get to grips with Ampere, and anything we say right now is wishful thinking at best. The only thing that can be said for certain is that it's a lot of ALU's to feed, and we've seen this before. How it will come out in practice is another story entirely.
We have the details about Ampere SM architecture already. It's not at all different to Turing and most paths and stores have been beefed up to accommodate for the increased throughput. Can you point to anything in Ampere which looks like it may create flops utilization issues besides what I've desribed already (fp/int and math/bandwidth balances in current gen s/w)?

Because right now it's you who are making very generic statements. GCN issues are well known for years.

Armchair experts: It's the game engines that aren't prepared for this innovative architecture and developers need to optimize for it.
That's the opposite of what I'm saying in case you didn't read. Ampere doesn't require any specific optimizations, it just needs more math then games are pushing right now. What benchmarks we have show this already. Or do you consider Borderlands 3 to be optimized for Ampere?

In the meanwhile, actual game developers who have been optimizing for GCN for the better part of the last decade:
Who have no choice but to extract performance this way since they have to ship games on the thing? What would they say exactly?
 
Last edited:
Ampere doesn't require any specific optimizations, it just needs more math then games are pushing right now.
The act to address deficiencies (underutilisation included) is called optimization, whether or not it is about, say, adding more optional compute heavy effects, or rethinking your VRAM/resource uses to better enable ILP. You are arguing with yourself here.

He is basically arguing that some people have double standards. Putting that debate aside, you can argue that what Ampere does is an easier starting point for optimization, and existing code can coincidentally benefit from the doubled throughput. But arguing that it does not need "specific" optimization is... a bit slippery IMO.
 
Last edited:
The act to address deficiencies (underutilisation included) is called optimization, whether or not it is about, say, adding more optional compute heavy effects, or rethinking your VRAM/resource uses to better enable ILP. You are arguing with yourself here.

He is basically arguing that some people have double standards. Putting that debate aside, you can argue that what Ampere does is an easier starting point for optimization, and existing code can coincidentally benefit from the doubled throughput. But arguing that it does not need "specific" optimization is... a bit slippery IMO.
Making a game with more complex shading isn't really an optimization for any particular h/w. GCN required very specific optimizations to reach its peak processing power.

This is irrelevant anyway however since AMD has essentially admitted these GCN flaws when they made RDNA where most of them are "fixed". And GCN certainly won't compete with Ampere, at least in graphics.
 
If Ampere is 'underutilized' and already hitting a power wall. What exactly would you gain by increasing utilization?

In theory you could optimise to extract more FPS from the same throughout. Which is what game release drivers do already for example. While hardware capability is a known quantity, software isn't. Especially now that RT is really going to start to be used.
 
In theory you could optimise to extract more FPS from the same throughout. While hardware capability is a known quantity, software isn't. Especially now that RT is really going to start to be used.

With no power increase? I don't see that happening. Same thing happens with consoles. Better utilization through console lifespan inevitably leads to higher power consumption. That's one of the mentioned reasons they beefed up cooling this time.
 
With no power increase? I don't see that happening. Same thing happens with consoles. Better utilization through console lifespan inevitably leads to higher power consumption. That's one of the mentioned reasons they beefed up cooling this time.

Depends. If your code is not very efficient, it might be wasting clock cycles. You may improve on it to waste less clock cycles, therefore speeding up performance at the same time you are not increasing power usage. I'm not a game developer, but I develop mobile applications and that happens sometimes, especially on cross platform where something might be very quick on iOS but slower on Android (typical Xamarin...) and you go and do things a little bit different just for that one platform to improve performance. Doesn't mean the app is using more power, in fact it may use less because code was changed to cache things for example, instead of recreating all over again.

Edit - Maybe we are mixing different concepts here as "underutilized". One of quantity, like GPU usage is X% and another more qualitative one. Case in point: fp32 Vs fp16. Where possible, fp32 may be replaced by fp16, which in theory brings a speed up without an increase in power consumption. If the developer is not doing this when it can, then it's a qualitative case of hardware underutilized.
 
Last edited:
Last edited:
- AMD launches a graphics card whose performance doesn't scale with TFLOPs throughput as expected.
Armchair experts: GCN has utilization issues.

- nVidia launches a graphics card whose performance doesn't scale with TFLOPs throughput as expected:
Armchair experts: It's the game engines that aren't prepared for this innovative architecture and developers need to optimize for it.



In the meanwhile, actual game developers who have been optimizing for GCN for the better part of the last decade:



¯\_(ツ)_/¯
DX12 needs more adoption, it's not that good right now -i.e.Star Wars Battlefront II can take up to 16GB of RAM only for the game itself (VRAM not counted) if run on DX12 for whatever reason, and the framerate can go down to 5fps or so if your PC has 16GB of RAM-.

With the new 3080, my ideal of running 99,9% of the games at 1440p 165fps might become a reality.

 
If Ampere is 'underutilized' and already hitting a power wall. What exactly would you gain by increasing utilization?
Couple of steps of GPU boost down seems like a fare trade for additional 20 to 30% of performance. Power usage goes down fast when you drop clocks.
 
Back
Top