Performance scaling on big GPUs discussion ...

DavidGraham · Feb 2, 2025

A former NVIDIA and AMD engineer (he authored TXAA) has shared some interesting insights on the performance of big GPUs, he believes there are several reasons why a big GPU will have worse scaling with current generation games than a smaller one, for various reasons. I will list his reasons here to start a discussion about them.

1- The front-end (command processor) of the GPU is a serial machine, because of this a bigger lower clock GPU is more likely to get bottlenecked there than a smaller higher clock GPU. Rasterizing triangles shares a similar issue.

2-The DX12 API itself is a limitation on big GPUs.

DX12 on PC established a bad API baseline. Drivers disabled ability to pipeline with split barriers by default, and queues are CPU scheduled with high latency ... most games are loaded with non-pipelined workload {drain,idle,fill} regions (gets progressively worse on larger GPUs)

3-Very high fps has higher overhead.

The higher the fps, the higher the costs of frame boundary idle (including an app context switch from game to driver-feature or compositor, etc) goes up, and it gets worse on larger machines (more drain/fill time) ... without also pipelining frames PC is in trouble for 480Hz displays.

4-Ineffecient game coding. As some games do lots of dependent passes with serialization, it takes one long pass to do serious damage on the "total realized performance due to the size of the rest of the machine going idle". Also, larger GPUs suffer more if the caches are invalidated. When upscaling from a lower resolution, games are not doing a good job managing triangle LODs, as often "the amount of cluster culling isn't reducing as resolution drops (think of shadow passes, etc)=poor scaling".

5-Big GPUs with high power consumption often show volatile clocks compared to smaller GPUs, they also need more rapid power state changes.

So big chip gets idle more often, then power state changes more often, this has more latency before it's at peak perf for work = slower.

trinibwoy · Feb 2, 2025

Yeah it’s gotten to a point where throwing hardware at the problem has increasingly diminishing returns because the API and software bottlenecks are just too severe. I was admonished for sharing my amateur interpretation of Nsight traces but I strongly believe modern GPUs are woefully underutilized and it’s probably getting worse.

Scott_Arm · Feb 5, 2025

I think a lot of games still have not switched over heavily to gpu-driven rendering, so the cpu is in the loop. Hopefully something like work graphs will push more in that direction. Seems like GPUs may have the same issue as CPUs, where a lot of work is being done to try to decode the instruction stream to keep the gpu occupied. The whole front end of a cpu is really interesting. I don't know that much about the gpu front-end, but if it bottlenecks all of that width isn't going to do anything.

Rodéric · Feb 5, 2025

I've been wishing to run the GPU independently from the CPU for more than a decade.
Work graphs are clearly interesting, but very little hardware support it yet

trinibwoy · Feb 5, 2025

Scott_Arm said:
The whole front end of a cpu is really interesting. I don't know that much about the gpu front-end, but if it bottlenecks all of that width isn't going to do anything.

It's almost certainly a work distribution problem but it's not clear that the problem is the GPU front-end and not further upstream. I ran some quick numbers using TPU's latest reviews and performance/flop of Nvidia's GPUs hasn't improved much across Ampere/Ada/Blackwell even with the higher clocks, massive L2 size and off-chip bandwidth increases. If the frontend was a significant bottleneck then Ada should be a lot more efficient than Ampere for the same flops but that's not the case.

I didn't look at every card and there are a million other variables but a few comparisons stood out to me.

The 4060 Ti has 43% higher clocks and similar flops to the 3070 Ti yet has worse performance per flop (and worse absolute performance too). This can be partially explained by the massive advantage in bandwidth per flop and higher raw fillrate for the 3070 Ti but that should be mitigated somewhat by the 4060 Ti's much larger L2. That 43% higher clock doesn't seem to be helping much though.

The 5080 vs 3090 faceoff is also interesting. The 5080 has the advantage of 54% higher clocks and massively larger L2 at similar off-chip bandwidth and SM count. The 3090 has a lot more bandwidth per flop but again that's mitigated by the 5080's much larger cache. Here the 3090 wins again in perf per flop and the 5080's huge clock advantage isn't enough to put it ahead.

There's no real evidence that higher clocks are helping to feed the beast.

Scott_Arm · Feb 5, 2025

trinibwoy said:
It's almost certainly a work distribution problem but it's not clear that the problem is the GPU front-end and not further upstream. I ran some quick numbers using TPU's latest reviews and performance/flop of Nvidia's GPUs hasn't improved much across Ampere/Ada/Blackwell even with the higher clocks, massive L2 size and off-chip bandwidth increases. If the frontend was a significant bottleneck then Ada should be a lot more efficient than Ampere for the same flops but that's not the case.

I didn't look at every card and there are a million other variables but a few comparisons stood out to me.

The 4060 Ti has 43% higher clocks and similar flops to the 3070 Ti yet has worse performance per flop (and worse absolute performance too). This can be partially explained by the massive advantage in bandwidth per flop and higher raw fillrate for the 3070 Ti but that should be mitigated somewhat by the 4060 Ti's much larger L2. That 43% higher clock doesn't seem to be helping much though.

The 5080 vs 3090 faceoff is also interesting. The 5080 has the advantage of 54% higher clocks and massively larger L2 at similar off-chip bandwidth and SM count. The 3090 has a lot more bandwidth per flop but again that's mitigated by the 5080's much larger cache. Here the 3090 wins again in perf per flop and the 5080's huge clock advantage isn't enough to put it ahead.

There's no real evidence that higher clocks are helping to feed the beast.

I guess GPUs are different because they're inherently parallel, where CPUs parallelism is a much different problem so targetting the width(cores) is more difficult.

techuse · Feb 6, 2025

trinibwoy said:
It's almost certainly a work distribution problem but it's not clear that the problem is the GPU front-end and not further upstream. I ran some quick numbers using TPU's latest reviews and performance/flop of Nvidia's GPUs hasn't improved much across Ampere/Ada/Blackwell even with the higher clocks, massive L2 size and off-chip bandwidth increases. If the frontend was a significant bottleneck then Ada should be a lot more efficient than Ampere for the same flops but that's not the case.

I didn't look at every card and there are a million other variables but a few comparisons stood out to me.

The 4060 Ti has 43% higher clocks and similar flops to the 3070 Ti yet has worse performance per flop (and worse absolute performance too). This can be partially explained by the massive advantage in bandwidth per flop and higher raw fillrate for the 3070 Ti but that should be mitigated somewhat by the 4060 Ti's much larger L2. That 43% higher clock doesn't seem to be helping much though.

The 5080 vs 3090 faceoff is also interesting. The 5080 has the advantage of 54% higher clocks and massively larger L2 at similar off-chip bandwidth and SM count. The 3090 has a lot more bandwidth per flop but again that's mitigated by the 5080's much larger cache. Here the 3090 wins again in perf per flop and the 5080's huge clock advantage isn't enough to put it ahead.

There's no real evidence that higher clocks are helping to feed the beast.

Nvidia Blackwell Architecture Speculation

I don't know of any new docs. FP32 does not need to double - and I am not aware of such a thing. But in Ada, only one of each SIMD16 pairs was combined INT32/FP32, the other one was FP32-only. Makes sense I guess.

forum.beyond3d.com

Good post by Arun.

Potato Head · Feb 6, 2025

While he may have some points, man is he ever bitter. His tweets are all angry and come across as “I did it better when I was at AMD” and “I’m smarter than the rest of you, you’re all doing it wrong”.

Moaning about game ready drivers like it’s some kind of conspiracy to make AMD look bad, because AMD cards have higher theoretical FLOPs? Not someone I need to read.

Potato Head · Feb 6, 2025

I would like to know what holds a 4090 back though, because on paper it should be much faster than it is.

TopSpoiler · Feb 6, 2025

Not a performance related question:
Does larger die requires more metal layers?

Potato Head · Feb 6, 2025

TopSpoiler said:
Not a performance related question:
Does larger die requires more metal layers?

A more complex die requires more metal layers. Since size in general has an impact on complexity it can be generalized that larger dies require more metal layers.

Metal layers = wires.

DegustatoR · Feb 6, 2025

trinibwoy said:
There's no real evidence that higher clocks are helping to feed the beast.

Nvidia's FE has been distributed in the form of TPCs for ages now, it shouldn't be a problem for them.
Also I've been looking into recent new benchmarks of newer games and to me it doesn't look like there's any sort of scaling problem on Blackwell at all - aside from the fact that GB202 is way more CPU limited than people seem to realize.

Here's MSM2's result:
https://tpucdn.com/review/spider-man-2-performance-benchmark/images/performance-rt-3840-2160.png
+35% 4090 to 5090 which is more than relative flops gain between them (~30%).

And KCD2:
https://tpucdn.com/review/kingdom-c...ce-benchmark/images/performance-3840-2160.png
+32% 4090 to 5090, roughly in line with flops scaling.

People seem to be overly concentrated on "average" results which basically show that CPUs aren't keeping up with GPUs if anything.
Once you start looking into heavy modes which are likely to be mostly GPU limited you're essentially getting the scaling you'd expect from the specs.

DegustatoR · Feb 8, 2025

Another one, probably a fluke or a benchmark error though as the game seem to be VERY CPU limited but still:

https://tpucdn.com/review/civilization-7-performance-benchmark/images/performance-3840-2160.png

+60% 4090 to 5090, well above flops numbers.
5080 is +23% to 4080 though which is also higher than you'd expect so maybe G7 plays its role here.

DegustatoR · Feb 9, 2025

Sniper Elite Resistance im Technik-Test: Wenn 8K spielbar ist

Benchmarks von 2K bis 8K und Fazit

www.pcgameshardware.de

+36% 5090 vs 4090 in 4K (with 150% internal res), roughly in line with flops change again.
Anything below that seem to be CPU limited on the card.
It's possible that cards below 5090 get bandwidth limited in 4K+50% but from comparisons between 3080 vs 4070 or 4080 vs 7970XTX this doesn't seem to be the case.

DegustatoR · Feb 17, 2025

https://www.dsogaming.com/wp-content/uploads/2025/02/Avowed-GPU-benchmarks-3.png

+27% 5090 vs 4090 in 4K without h/w RT in a UE5 title (these tend to be rather CPU heavy).

Rys · Feb 17, 2025

I think everyone gets the point that good scaling is possible. I hope we all knew that already but it’s good to point it out.

Maybe the thread can move on now from individual macro level examples to the micro level details of what’s required in the hardware, programming model and workloads, exploring what’s needed for it to be more common as GPUs get ever wider and have different balance and workload distribution points.

Performance scaling on big GPUs discussion ...

DavidGraham

trinibwoy

Meh

Scott_Arm

Rodéric

a.k.a. Ingenu

trinibwoy

Meh

Scott_Arm

techuse

Nvidia Blackwell Architecture Speculation

Potato Head

Potato Head

TopSpoiler

Potato Head

DegustatoR

DegustatoR

DegustatoR

Sniper Elite Resistance im Technik-Test: Wenn 8K spielbar ist

DegustatoR

Rys

Graphics @ AMD

Similar threads