Performance scaling on big GPUs discussion ...

A former NVIDIA and AMD engineer (he authored TXAA) has shared some interesting insights on the performance of big GPUs, he believes there are several reasons why a big GPU will have worse scaling with current generation games than a smaller one, for various reasons. I will list his reasons here to start a discussion about them.

1- The front-end (command processor) of the GPU is a serial machine, because of this a bigger lower clock GPU is more likely to get bottlenecked there than a smaller higher clock GPU. Rasterizing triangles shares a similar issue.

2-The DX12 API itself is a limitation on big GPUs.
DX12 on PC established a bad API baseline. Drivers disabled ability to pipeline with split barriers by default, and queues are CPU scheduled with high latency ... most games are loaded with non-pipelined workload {drain,idle,fill} regions (gets progressively worse on larger GPUs)

3-Very high fps has higher overhead.
The higher the fps, the higher the costs of frame boundary idle (including an app context switch from game to driver-feature or compositor, etc) goes up, and it gets worse on larger machines (more drain/fill time) ... without also pipelining frames PC is in trouble for 480Hz displays.

4-Ineffecient game coding. As some games do lots of dependent passes with serialization, it takes one long pass to do serious damage on the "total realized performance due to the size of the rest of the machine going idle". Also, larger GPUs suffer more if the caches are invalidated. When upscaling from a lower resolution, games are not doing a good job managing triangle LODs, as often "the amount of cluster culling isn't reducing as resolution drops (think of shadow passes, etc)=poor scaling".

5-Big GPUs with high power consumption often show volatile clocks compared to smaller GPUs, they also need more rapid power state changes.
So big chip gets idle more often, then power state changes more often, this has more latency before it's at peak perf for work = slower.
 
Yeah it’s gotten to a point where throwing hardware at the problem has increasingly diminishing returns because the API and software bottlenecks are just too severe. I was admonished for sharing my amateur interpretation of Nsight traces but I strongly believe modern GPUs are woefully underutilized and it’s probably getting worse.
 
I think a lot of games still have not switched over heavily to gpu-driven rendering, so the cpu is in the loop. Hopefully something like work graphs will push more in that direction. Seems like GPUs may have the same issue as CPUs, where a lot of work is being done to try to decode the instruction stream to keep the gpu occupied. The whole front end of a cpu is really interesting. I don't know that much about the gpu front-end, but if it bottlenecks all of that width isn't going to do anything.
 
I've been wishing to run the GPU independently from the CPU for more than a decade.
Work graphs are clearly interesting, but very little hardware support it yet :(
 
The whole front end of a cpu is really interesting. I don't know that much about the gpu front-end, but if it bottlenecks all of that width isn't going to do anything.

It's almost certainly a work distribution problem but it's not clear that the problem is the GPU front-end and not further upstream. I ran some quick numbers using TPU's latest reviews and performance/flop of Nvidia's GPUs hasn't improved much across Ampere/Ada/Blackwell even with the higher clocks, massive L2 size and off-chip bandwidth increases. If the frontend was a significant bottleneck then Ada should be a lot more efficient than Ampere for the same flops but that's not the case.

I didn't look at every card and there are a million other variables but a few comparisons stood out to me.

The 4060 Ti has 43% higher clocks and similar flops to the 3070 Ti yet has worse performance per flop (and worse absolute performance too). This can be partially explained by the massive advantage in bandwidth per flop and higher raw fillrate for the 3070 Ti but that should be mitigated somewhat by the 4060 Ti's much larger L2. That 43% higher clock doesn't seem to be helping much though.

The 5080 vs 3090 faceoff is also interesting. The 5080 has the advantage of 54% higher clocks and massively larger L2 at similar off-chip bandwidth and SM count. The 3090 has a lot more bandwidth per flop but again that's mitigated by the 5080's much larger cache. Here the 3090 wins again in perf per flop and the 5080's huge clock advantage isn't enough to put it ahead.

There's no real evidence that higher clocks are helping to feed the beast.
 
It's almost certainly a work distribution problem but it's not clear that the problem is the GPU front-end and not further upstream. I ran some quick numbers using TPU's latest reviews and performance/flop of Nvidia's GPUs hasn't improved much across Ampere/Ada/Blackwell even with the higher clocks, massive L2 size and off-chip bandwidth increases. If the frontend was a significant bottleneck then Ada should be a lot more efficient than Ampere for the same flops but that's not the case.

I didn't look at every card and there are a million other variables but a few comparisons stood out to me.

The 4060 Ti has 43% higher clocks and similar flops to the 3070 Ti yet has worse performance per flop (and worse absolute performance too). This can be partially explained by the massive advantage in bandwidth per flop and higher raw fillrate for the 3070 Ti but that should be mitigated somewhat by the 4060 Ti's much larger L2. That 43% higher clock doesn't seem to be helping much though.

The 5080 vs 3090 faceoff is also interesting. The 5080 has the advantage of 54% higher clocks and massively larger L2 at similar off-chip bandwidth and SM count. The 3090 has a lot more bandwidth per flop but again that's mitigated by the 5080's much larger cache. Here the 3090 wins again in perf per flop and the 5080's huge clock advantage isn't enough to put it ahead.

There's no real evidence that higher clocks are helping to feed the beast.

I guess GPUs are different because they're inherently parallel, where CPUs parallelism is a much different problem so targetting the width(cores) is more difficult.
 
It's almost certainly a work distribution problem but it's not clear that the problem is the GPU front-end and not further upstream. I ran some quick numbers using TPU's latest reviews and performance/flop of Nvidia's GPUs hasn't improved much across Ampere/Ada/Blackwell even with the higher clocks, massive L2 size and off-chip bandwidth increases. If the frontend was a significant bottleneck then Ada should be a lot more efficient than Ampere for the same flops but that's not the case.

I didn't look at every card and there are a million other variables but a few comparisons stood out to me.

The 4060 Ti has 43% higher clocks and similar flops to the 3070 Ti yet has worse performance per flop (and worse absolute performance too). This can be partially explained by the massive advantage in bandwidth per flop and higher raw fillrate for the 3070 Ti but that should be mitigated somewhat by the 4060 Ti's much larger L2. That 43% higher clock doesn't seem to be helping much though.

The 5080 vs 3090 faceoff is also interesting. The 5080 has the advantage of 54% higher clocks and massively larger L2 at similar off-chip bandwidth and SM count. The 3090 has a lot more bandwidth per flop but again that's mitigated by the 5080's much larger cache. Here the 3090 wins again in perf per flop and the 5080's huge clock advantage isn't enough to put it ahead.

There's no real evidence that higher clocks are helping to feed the beast.

Good post by Arun.
 
While he may have some points, man is he ever bitter. His tweets are all angry and come across as “I did it better when I was at AMD” and “I’m smarter than the rest of you, you’re all doing it wrong”.

Moaning about game ready drivers like it’s some kind of conspiracy to make AMD look bad, because AMD cards have higher theoretical FLOPs? Not someone I need to read.
 
There's no real evidence that higher clocks are helping to feed the beast.
Nvidia's FE has been distributed in the form of TPCs for ages now, it shouldn't be a problem for them.
Also I've been looking into recent new benchmarks of newer games and to me it doesn't look like there's any sort of scaling problem on Blackwell at all - aside from the fact that GB202 is way more CPU limited than people seem to realize.

Here's MSM2's result:
https://tpucdn.com/review/spider-man-2-performance-benchmark/images/performance-rt-3840-2160.png
+35% 4090 to 5090 which is more than relative flops gain between them (~30%).

And KCD2:
https://tpucdn.com/review/kingdom-c...ce-benchmark/images/performance-3840-2160.png
+32% 4090 to 5090, roughly in line with flops scaling.

People seem to be overly concentrated on "average" results which basically show that CPUs aren't keeping up with GPUs if anything.
Once you start looking into heavy modes which are likely to be mostly GPU limited you're essentially getting the scaling you'd expect from the specs.
 
I think everyone gets the point that good scaling is possible. I hope we all knew that already but it’s good to point it out.

Maybe the thread can move on now from individual macro level examples to the micro level details of what’s required in the hardware, programming model and workloads, exploring what’s needed for it to be more common as GPUs get ever wider and have different balance and workload distribution points.
 
Back
Top