Current Generation Games Analysis Technical Discussion [2023] [XBSX|S, PS5, PC]

Status
Not open for further replies.
But the 4080 with 76 SMs can keep up with the 7900 xtx with 96CUs in most games
Most games have smaller shaders as to not bottleneck on the number of available registers - meaning the nvidia GPUs are well and can run many of them fast and saturate more of their pipelines with them.
 
Last edited:
But the 4080 with 76 SMs can keep up with the 7900 xtx with 96CUs in most games

Sure, because most games are heavily optimized for NV graphics cards with AMD graphics cards generally getting less optimizing due to their respective market positions. That puts them at a performance advantage regardless of the hardware capabilities of a GPU. That's not developers deliberately hamstringing AMD GPUs, it's just the reality of their market positions in the PC space.

Starfield OTOH is heavily optimized for AMD GPUs because it's heavily optimized for the XBS consoles. Considering there was a 1 year delay in release just to get it into releasable form, NV GPUs on PC likely got a similar level of optimization time as AMD GPUs usually get in most other games.

So, for example, in this case, Starfield is optimized to take advantage of the large register size on AMD GPUs which is potentially detrimental on NV GPUs.

Basically, the shoe is on the other foot almost purely due to how heavily the game was optimized to work on the XBS consoles. Again not out of developers deliberately making NV GPUs worse, but just the reality of NV not having a presence on XBS consoles.

Regards,
SB
 
Basically, the shoe is on the other foot almost purely due to how heavily the game was optimized to work on the XBS consoles. Again not out of developers deliberately making NV GPUs worse, but just the reality of NV not having a presence on XBS consoles.
As an nvidia owner it sucks (for heavily optimized Xbox titles). But I think we can hope going forward into the future we see the frame Gen technologies, DLSS, and hardware ray tracing accelerators in action to make up for this loss.
 
Most games have smaller shaders as to not bottleneck on the number of available registers - meaning the nvidia GPUs are well can run many of them fast in saturate more of their pipelines with them.

Yah, it'll be interesting to see if that becomes more of a bottleneck this gen. For Nvidia it looks like they haven't changed much on this front since Kepler (2012)
  • The register file size is 64K 32-bit registers per SM.
  • The maximum number of registers per thread is 255.
Edit: Will really depend on whether large shaders become more commonplace, or if small shaders with low register pressure still rule.
 
Last edited:
Starfield OTOH is heavily optimized for AMD GPUs because it's heavily optimized for the XBS consoles. Considering there was a 1 year delay in release just to get it into releasable form, NV GPUs on PC likely got a similar level of optimization time as AMD GPUs usually get in most other games.

I'd say it's likely that the level of optimsation at play here for RDNA2 (and largely by extension RDNA3) is probably a fair bit higher than Nvidia see's in the majority of titles.

Most games are cross platform and so the PC version of said games likely use Nvidia as the lead PC architecture. So yes that will result in a relative performance advantage for Nvidia. However I don't expect in that scenario that NV GPU's will have been so heavily tuned for as would the Xbox (and by extension AMD GPU's) in Starfields case. This game was basically built from the ground up to max out RDNA2 and Nvidia can't really make that claim for any of it's architectures in any game.
 
The results imply a high level of tuning for RDNA. Those SIMD utilization numbers are extremely impressive. That kind of utilization doesn’t happen by accident. Despite that, utilization on Ampere and Ada is still pretty good. Better than most games I’ve profiled on a 3090.

Nvidia has been stingy on register capacity though. Even A100 and H100 are still on 64KB. Maybe it’s time for them to upgrade.
 
The results imply a high level of tuning for RDNA. Those SIMD utilization numbers are extremely impressive. That kind of utilization doesn’t happen by accident. Despite that, utilization on Ampere and Ada is still pretty good. Better than most games I’ve profiled on a 3090.

Nvidia has been stingy on register capacity though. Even A100 and H100 are still on 64KB. Maybe it’s time for them to upgrade.

The chips and cheese article says they keep the register file small so they can fit more SMs. I'd be curious to know how much die space the register file takes as a percentage of the SM. It's 256kB per SM (64k 32bit entries). Each thread limits to 255 registers. Not sure how much they could grow the register file while still increasing SM count for the next architecture. I'm curious how the 255 register limit for a thread is set. I think there's 32 threads per warp, and a 48 warp limit per SM. Not sure how you get to that 255 register limit.
 
The chips and cheese article says they keep the register file small so they can fit more SMs. I'd be curious to know how much die space the register file takes as a percentage of the SM. It's 256kB per SM (64k 32bit entries). Each thread limits to 255 registers. Not sure how much they could grow the register file while still increasing SM count for the next architecture. I'm curious how the 255 register limit for a thread is set. I think there's 32 threads per warp, and a 48 warp limit per SM. Not sure how you get to that 255 register limit.

The 255 is a per thread limit likely governed by the scheduler and compiler implementation. It’s not related to the per SIMD or per SM hardware limits.

Increasing SM counts probably isn’t the right move at this point. Not only is occupancy often low but warp coherence for issued instructions is also low and getting worse with heavier usage of RT. They should spend transistors on feeding the beast instead of making an even bigger, hungrier monster.
 
So I guess drivers are still doing a lot of shader replacement to optimize games? Or does Nvidia feed them back to Bethesda to implement alternate shaders for nvidia?

In the case of this NerdTech guy, I don't know if what he's suggesting can actually happen. A large complex shader might not be able to be reduced, and I don't know whether a complex shader could be re-implemented into simpler shaders that do the same work, without changes to the actual game, instead of at the driver level.
 
Last edited:

FYI this is the "Nvidia removed their hardware scheduler and that's why Radeon's are so less CPU dependent" guy. This is just repeating what chipsandcheese reported*, not sure what the value of reposting the regurgitated opinions from this antivax loon has here.

*Actually, I'm being overly generous. He's interpreting what they're saying, and incorrectly.

 
Last edited:
FYI this is the "Nvidia removed their hardware scheduler and that's why Radeon's are so less CPU dependent" guy. This is just repeating what chipsandcheese reported*, not sure what the value of reposting the regurgitated opinions from this antivax loon has here.

*Actually, I'm being overly generous. He's interpreting what they're saying, and incorrectly.

This dude has always been a grifter. His stupid driver overhead video was debunked almost immediately back in the day but I still see it pop up on occasion.
 
I think it explains it. Limitation is the register size which lowers
It doesn't. Alex compared 6800 XT to the 3080 in his review, if I recall correctly. RDNA 2 features the same 256 kilobyte register file per Compute Unit (CU) as Ampere or Ada's 256 KB per SM. Chipsandcheese compared the register file size for the 7900 series' CU (384 KB), which isn't applicable to the RDNA 2 and lower-tier RDNA 3 chips that have 256 KB register file per CU. Per partition, RDNA2 features a 128 KB register file for 32 ALUs, whereas Ampere/Ada has just 64 KB for the 16+16 FMA light and heavy pipes. However, Ampere/Ada executes each warp over two cycles for floating-point operations as opposed to a single cycle on RDNA.

Even assuming a scenario where the whole game was 100% occupancy-limited — which is very far from truth — I cannot see how a 4060 Ti with more registers per chip and more threads in flight could be slower compared to an RX 7600, as seen here: https://www.techspot.com/photos/article/2731-starfield-gpu-benchmark/#Ultra-1080p. Therefore, their review doesn't provide any conclusive answers, unfortunately.
 
It doesn't. Alex compared 6800 XT to the 3080 in his review, if I recall correctly. RDNA 2 features the same 256 kilobyte register file per Compute Unit (CU) as Ampere or Ada's 256 KB per SM. Chipsandcheese compared the register file size for the 7900 series' CU (384 KB), which isn't applicable to the RDNA 2 and lower-tier RDNA 3 chips that have 256 KB register file per CU. Per partition, RDNA2 features a 128 KB register file for 32 ALUs, whereas Ampere/Ada has just 64 KB for the 16+16 FMA light and heavy pipes. However, Ampere/Ada executes each warp over two cycles for floating-point operations as opposed to a single cycle on RDNA.

Even assuming a scenario where the whole game was 100% occupancy-limited — which is very far from truth — I cannot see how a 4060 Ti with more registers per chip and more threads in flight could be slower compared to an RX 7600, as seen here: https://www.techspot.com/photos/article/2731-starfield-gpu-benchmark/#Ultra-1080p. Therefore, their review doesn't provide any conclusive answers, unfortunately.

The article isn't conclusive as it didn't profile every compute dispatch or look at every performance indicator but does point to heavy optimization for RDNA.

The 7600 likely has much higher L2 bandwidth than the 4060 Ti based on the earlier analysis here. That could be one factor in its outperformance vs the 4060 Ti. The 6700 XT matches the 7600 in Starfield and has a similarly fast L2.

"In AMD’s favor, they have a very high bandwidth L2 cache. As the first multi-megabyte cache level, the L2 cache plays a very significant role and typically catches the vast majority of L0/L1 miss traffic. Nvidia’s GPUs become L2 bandwidth bound in the third longest shader, which explains a bit of why AMD’s 7900 XTX gets as close as it does to Nvidia’s much larger flagship. AMD’s win there is a small one, but seeing the much smaller 7900 XTX pull ahead of the RTX 4090 in any case is not in line with anyone’s expectations. AMD’s cache design pays off there."
 
The article isn't conclusive as it didn't profile every compute dispatch or look at every performance indicator but does point to heavy optimization for RDNA.
Not just RDNA, Vega 64 is 25% faster than GTX 1080 @1080p, RX 580 is 30% faster than GTX 1060.

 
The article isn't conclusive as it didn't profile every compute dispatch or look at every performance indicator but does point to heavy optimization for RDNA.

The 7600 likely has much higher L2 bandwidth than the 4060 Ti based on the earlier analysis here. That could be one factor in its outperformance vs the 4060 Ti. The 6700 XT matches the 7600 in Starfield and has a similarly fast L2.

"In AMD’s favor, they have a very high bandwidth L2 cache. As the first multi-megabyte cache level, the L2 cache plays a very significant role and typically catches the vast majority of L0/L1 miss traffic. Nvidia’s GPUs become L2 bandwidth bound in the third longest shader, which explains a bit of why AMD’s 7900 XTX gets as close as it does to Nvidia’s much larger flagship. AMD’s win there is a small one, but seeing the much smaller 7900 XTX pull ahead of the RTX 4090 in any case is not in line with anyone’s expectations. AMD’s cache design pays off there."
If I understand the article correctly; occupancy isn’t a requirement for performance, since low occupancy does not correlate to poor performance.

But a Lack of L2 hit rates or bandwidth, can cause a thread stall in which having additional threads in occupancy would be desirable waiting for the correct data to arrive.
 
Not just RDNA, Vega 64 is 25% faster than GTX 1080 @1080p, RX 580 is 30% faster than GTX 1060.

Crazy. I think we’ve often theorized that this was possible, under very specific conditions of pure optimization; I didn’t really believe it could come true. Hell of a lot of things have to fall in place for this to happen though.
 
Not just RDNA, Vega 64 is 25% faster than GTX 1080 @1080p, RX 580 is 30% faster than GTX 1060.

This has been the common scenario for quite some time now in these big multilplatform titles.
 
Status
Not open for further replies.
Back
Top