Current Generation Games Analysis Technical Discussion [2023] [XBSX|S, PS5, PC]

Status
Not open for further replies.
But you can say that still image comparisons (I have issues with this, so it isn't something I agree with) in that type of essentially academic comparison is used as a part of the marketing for games and therefore is relevant in terms of how it's received.

While at the same time if the performance gains are low (say in 5% range mentioned earlier) than they too you would argue as so low you wouldn't notice them outside of an academic comparison either. And as of now that 5% performance gain isn't really something marketable.

Keep in mind that it may be 5% over X time or Y frames, but it might be 20%+ for a single frame or a few frames. IE - enough to prevent there being a serious hiccup or judder that might have happened if you didn't have VRS.

Average framerate is a very imprecise method with which to judge how effective any given technique is, especially if it's use is to provide a smoother and more consistent frametime. In this case it wouldn't even matter if it lead to 0% improvement in frame rendering times 90% of the time if it provided significant improvements in the most problematic 10% of times.

To do the same thing without VRS you would need to degrade the entire screen instead of just a portion of it in order to prevent those problematic performance hiccups.

Regards,
SB
 
I’ve seen enough profiled games to know that low occupancy on my 3090 is a common problem and isn’t unique to Starfield.
Regarding the subject of Occupancy, I was watching a dev talk about it (timestamped), and he explains some interesting facts. Basically, occupancy is not the end goal, it's just a tool to hide latency, it's how many waves you can load on a single CU, but too many waves (high occupancy) can compete for caches and memory controller (memory loads), thus massively reducing performance. Some platforms actually allow the developer to reduce and control the occupancy rate to boost performance.


There is also this blogpost from a developer talking about occupancy, and in summary he concludes that low occupancy isn't always a bad thing and high occupancy is not always a good thing. You need to optimize the occupancy rate best suited for your resources.

there are 2 more questions to consider:
Is low occupancy always bad?
Is high occupancy always good?

For the first question the answer is no. If you notice in the above code, the shader compiler goes to great lengths to add as much distance between the instruction that requests the memory and the instruction that uses it by rearranging the order of any instruction that it can. It may be the case that it manages to fit enough instructions so that by the time it needs to use the memory it is already here and there is no need to stall (and to swap to another batch) at all. Bottom line is not to rely only on a low occupancy metric to start optimizing the shader, check for other bottlenecks like stalls due to memory reads (texture reads) first. If they are high then your shader program may benefit from increasing the occupancy (normally by reducing the number of VGPRs it uses).

For the second question the answer is also no. If a shader program has many memory read instructions and requires a lot of memory traffic, a large buffer of active batches (high occupancy) will fight over the limited cache resources, each batch potentially invalidating cache lines for data belonging to another batch/instruction. Determining how to achieve a good balance with take some profiling within the context of your application and platform. Update: There is another reason why high occupancy/low VGPR count can be bad, especially for shaders with a lot of memory reads. The compiler in its effort to keep the VGPR count low may “serialise” the memory reads in order to reuse the registers more (for example issue a memory fetch, wait for the data, store the value in a register, use it, and then reuse that register to store the value from the next memory read). This can lead to bad memory fetch scheduling and increased memory latency. If the compiler has more registers available it can schedule many memory reads upfront, wait for the data and cache the values in registers before using them, improving memory latency overall. This memory latency is what higher occupancy (large number of in flight waves to swap) is supposed to improve but it does not always achieve to, so the advice to always profile to determine the actual effect of the shader changes still stands.

 
Regarding the subject of Occupancy, I was watching a dev talk about it (timestamped), and he explains some interesting facts. Basically, occupancy is not the end goal, it's just a tool to hide latency, it's how many waves you can load on a single CU, but too many waves (high occupancy) can compete for caches and memory controller (memory loads), thus massively reducing performance. Some platforms actually allow the developer to reduce and control the occupancy rate to boost performance.

Yeah occupancy is just a means to an end. However low occupancy is highly correlated with low hardware utilization and poor latency hiding. At least in the game workloads that I’ve seen.
 
Regarding the subject of Occupancy, I was watching a dev talk about it (timestamped), and he explains some interesting facts. Basically, occupancy is not the end goal, it's just a tool to hide latency, it's how many waves you can load on a single CU, but too many waves (high occupancy) can compete for caches and memory controller (memory loads), thus massively reducing performance. Some platforms actually allow the developer to reduce and control the occupancy rate to boost performance.


There is also this blogpost from a developer talking about occupancy, and in summary he concludes that low occupancy isn't always a bad thing and high occupancy is not always a good thing. You need to optimize the occupancy rate best suited for your resources.







Related research in agreement with the point made in the post.
 
Keep in mind also that what would be considered good or bad levels of occupancy will vary by not only workload (type of work being done) but will also vary by hardware (more or less cache, more or less register files, etc.).

It is quite likely that optimal levels of occupancy on NV hardware is going to be significantly different than optimal levels of occupancy on AMD hardware and that there will be differences even within different lines of the same vendor's hardware.

Regards,
SB
 
However low occupancy is highly correlated with low hardware utilization and poor latency hiding. At least in the game workloads that I’ve seen.

As explained above, that's often a misconception, occupancy is just one of the contributing factors to better utilization, but there are many many others.

Fallacy: Occupancy is a metric of utilization
– No, it’s only one of the contributing factors.

In the end what you care about is achieving "peak" utilization, Ampere and Ada probably work best with low occupancy, the driver feeds the hardware fast code, that fits well into the memory subsystem, and close to peak utilization is achieved. In the research paper, they have achieved 60% to 2x the TFLOPS, by reducing the occupancy by 2x, on GPUs that range from Tesla to Fermi.

Fallacy: Increasing occupancy is the only way to improve latency hiding
– No, increasing ILP is another way

This applies to all CUDA-capable GPUs. E.g. on G80:
‒ Get ≈100% peak with 25% occupancy if no ILP
‒ Or with 8% occupancy, if 3 operations from each thread can be concurrently processed

Fallacy: “Low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation”
– We’ve just seen 84% of the peak at mere 4% occupancy. Note that this is above 71% that cudaMemcpy achieves at best.

Running fast may require low occupancy
• Must use registers to run close to the peak
• The larger the bandwidth gap, the more data must come from registers
• This may require many registers = low occupancy This often can be accomplished by computing multiple outputs per thread

GFLOPS go up, occupancy goes down



Knowing all of this, I argue that this ChipsandCheese article that analyzed Starfield code got it backwards. All NVIDIA GPUs have a higher than typical occupancy in Starfield, yet most of the time they run below their TDP in Starfield (a very strong sign of under-utilization), so naturally higher occupancy is a detriment to NVIDIA's hardware in this game, it needs to come down so performance can go up.
 
Last edited:
I argue that this ChipsandCheese article that analyzed Starfield code got it backwards.
I don't think your definition of backwards is correct here. They specifically state in their article that Nvidia cards operate and are designed for low occupancy. They specifically state that the 4090 has very low occupancy in comparison to the AMD cards.
The AMD cards are over-occupied nearly, to the point that it could be considered a bottleneck.

There's no backwards, Starfield's shaders are pushing cards such that they need more occupancy for latency hiding. They don't indicate what needs to be done to improve performance on Nvidia cards, they didn't say it needs more occupancy to perform better. No where in that article did they mention this.
 
They don't indicate what needs to be done to improve performance on Nvidia cards, they didn't say it needs more occupancy to perform better. No where in that article did they mention this.
They flip flop on the issue, they allude many times to higher occupancy meaning higher utilization, which is again is a fallacy.

However, cache access latency is very high on GPUs, so higher occupancy often correlates with better utilization.

The takeaway from this shader is that AMD’s RDNA 3 architecture is better set up to feed its execution units. Each SIMD has three times as much vector register file capacity as Nvidia’s Ampere, Ada, or Turing SMSPs, allowing higher occupancy. That in turn gives RDNA 3 a better chance of hiding latency.

While L1 hitrates are good, high occupancy still matters because texture sampling incurs higher latency than plain vector accesses.

The takeaway from this shader is that AMD is able to achieve very high utilization thanks to very high occupancy. In fact, utilization is so high that AMD is compute bound. Nvidia hardware does well in this shader, but not quite as well because they again don’t have enough register file capacity to keep as much work in flight

This made many people think the reason why NVIDIA GPUs are underperforming is due to Starfield achieving lower occupancy on their hardware compared to AMD. Which isn't true at all, the article even shows a clear case of this being not true.

Therefore, high occupancy doesn’t help Nvidia get better utilization than AMD. Earlier, I mentioned that higher occupancy doesn’t necessarily lead to better utilization, and this is one such case

Anyway, I reiterate that occupancy needs to come way down for NVIDIA in Starfield, this will lead to far better hardware utilization, and will max out the TDP of NVIDIA GPUs, as inteded.
 
They flip flop on the issue, they allude many times to higher occupancy meaning higher utilization, which is again is a fallacy.









This made many people think the reason why NVIDIA GPUs are underperforming is due to Starfield achieving lower occupancy on their hardware compared to AMD. Which isn't true at all, the article even shows a clear case of this being not true.



Anyway, I reiterate that occupancy needs to come way down for NVIDIA in Starfield, this will lead to far better hardware utilization, and will max out the TDP of NVIDIA GPUs, as inteded.
There is no flip flop here. The shader is working as intended; it’s been designed so that Xbox series GPUs are reaching their maximum. That helps amd, it just doesn’t help nvidia.

It won’t come down. That’s the point, they aren’t going to rewrite all the shaders of a game to benefit nvidia, and no drivers will address it either.

it’s an issue that nvidia needs to solve, as games continue to move away from baked technologies the shaders are going to get longer and more complex. As per the article L2 and VRAM hit rate will determine how well fed your execution units are. If you can’t hit you need to wait and use another thread. Occupancy will rise as a result. Occupancy is a symptom. It’s not something you design for
 
Last edited:
Anyway, I reiterate that occupancy needs to come way down for NVIDIA in Starfield, this will lead to far better hardware utilization, and will max out the TDP of NVIDIA GPUs, as inteded.

Nvidia’s architectures aren’t designed for low occupancy. They are designed for low register usage per thread. Low occupancy works if you can keep the hardware busy with a small number of threads. This is true for AMD architectures too and isn’t unique to Nvidia.

As laid out in the presentation low occupancy works when your dataset fits in shared memory or L1 and there isn’t a lot of latency to hide to begin with. Lots of independent math instructions also helps. That is a very specific set of conditions and it not a general rule that applies to all workloads.

Those conditions are easier to achieve in general compute workloads than in games. The second you have lots of traffic going to L2 or VRAM you need more threads in flight to hide that latency.
 
Those conditions are easier to achieve in general compute workloads than in games.
I don't believe that to be the case, I previously quoted two "game" developers who clearly state that these conditions are wide spread in games. And that chasing occupancy blindly can often result in degraded performance.

it’s an issue that nvidia needs to solve, as games continue to move away from baked technologies the shaders are going to get longer and more complex
I don't believe that to be the case either, as other more complex and shader heavy games showed none of that, despite them being optimized for consoles too (whether Xbox or PlayStation exclusives).

they aren’t going to rewrite all the shaders of a game to benefit nvidia, and no drivers will address it either
That remains to be seen. Though, I really wish this isn't true at all. The game needs more performance.
 
I don't believe that to be the case either, as other more complex and shader heavy games showed none of that, despite them being optimized for consoles too (whether Xbox or PlayStation exclusives).
A convenient response when there is no way for this comparison to possibly occur.

That remains to be seen. Though, I really wish this isn't true at all. The game needs more performance.
Again, a very important item to stress that continues to be left out, this game was optimized by Xbox ATG for a very long time.

Please consider the level of knowledge these guys have. These are the same team that will assist Coalition and all their first party teams. You will not find a team with more intimate knowledge of Xbox hardware and how to extract it than this group.

*****
According to Xbox gaming CEO Phil Spencer, the Xbox Advanced Technology Group (ATG) joined up early in Starfield's development to help Bethesda optimize the new space RPG for Xbox platforms.

Read more: https://www.tweaktown.com/news/9134...ng-optimize-starfields-performance/index.html
******
Discussion around the ATG group:

I largely suspect @Andrew Lauritzen probably has contact with them at some point in time. I saw his dx12 asteroid demos at Build 2015.

I’m all for arm chair engineering fun, but when the best SMEs of a platform are helping to work on a title for a lot of time, you have to give them credit that they know how best to optimize for their own platform.
 
I’m all for arm chair engineering fun, but when the best SMEs of a platform are helping to work on a title for a lot of time, you have to give them credit that they know how best to optimize for their own platform.

Yep, the downsides of high occupancy (cache thrashing etc) also hurt AMD’s architectures so clearly that wasn’t an issue in Starfield else they wouldn’t have pushed occupancy as high as they did. Assuming of course they actually know what they’re doing :p

One big factor missing from this conversation is we have no idea what Starfield occupancy actually looks like on the consoles. The chipsandcheese analysis was done on RDNA3 ! We could be making a big fuss out of nothing.
 
That’s the point, they aren’t going to rewrite all the shaders of a game to benefit nvidia, and no drivers will address it either.
I don't see why they would need that. None of the console ports have required this so far, so why would Starfield be an exception?)

Consoles have 256 Kilobytes of reg file and 32 wavefronts per CU, if I recall correctly; Navi 33 has the same 256 Kilobytes of reg file and 32 wavefronts per CU.

Shaders are mostly tailored for consoles, so I don't see why Ampere or Ada, which both feature 256 Kilobytes of reg file per SM and 48 wavefronts per SM, would have any trouble with console settings.
If anything, Ampere and Ada also have a higher capacity configurable L1/Shared memory space per SM, so I am not sure why anybody thinks RDNA 3 has any advantages here.

Yet, the RX 7600, with fewer CUs and with the downsized 256 KB register file per CU, is somehow faster in this game compared to the typically 25-40% faster 4060 Ti.
Moreover, we all know that power consumption is low on Ada, implying that there are bottlenecks (which I am still inclined to beleave have nothing to do with the registers or cache bandwidth).

While more registers are certainly good for performance with shaders that are the most heavy on register usage, I am pretty sure 256 KB of cache would be the common medium and optimal choice for the performance per square millimeter in modern games, as long as current consoles with the same 256 Kilobytes of reg file exist on the market. And I don't see the 384 KB register file size providing more than a 1 - 3% performance increase at max (certainly not at the 25-40% range) in current games, and likely at a higher percentage of area cost.

AMD may need more registers specifically for Ray Tracing (which was mentioned in the context of rays in flight on their RDNA 3 slides) and ray queries because they need to embed the traversal shaders since they do it in SW, which results in high reg file pressure and low occupancy in the long ray query shaders, as can be seen in the Portal with RTX, for example.
While for the typical rasterization shaders, 384 KB of reg file doesn't seem like a good choice, hence the usage of a 256 KB reg file in Navi 33 - would not be surprised to see the 256 KB reg file come back in high end on AMD, either once or if there is a hardware traversal in some future RDNAs.
 
I don't see why they would need that. None of the console ports have required this so far, so why would Starfield be an exception?)
There seems to be a public opinion that Starfield is woefully unoptimized and to justify that argument they cherry pick screenshots and point at frame rate and resolution on 4090. To which my response is that they built the game they best they could optimized for console, they can’t re-write the fundamental pillars for the game to ensure nvidia cards get more performance from it.

Shaders are mostly tailored for consoles, so I don't see why Ampere or Ada, which both feature 256 Kilobytes of reg file per SM and 48 wavefronts per SM, would have any trouble with console settings.
Absolutely! To the above point, no 4XXX cards have any issue smashing the game at console level settings. They are complaining that ultra settings is not netting them some massively high number above AMD.

Moreover, we all know that power consumption is low on Ada, implying that there are bottlenecks (which I am still inclined to beleave have nothing to do with the registers or cache bandwidth).
I think if you’re sitting around waiting for data to arrive and you keep switching threads and not have many to work on, it’s sitting idle and I think should have a role to play in reducing power draw.
 
They are complaining that ultra settings is not netting them some massively high number above AMD.
I see the same picture across all presets and resolutions, unfortunately.

I think if you’re sitting around waiting for data to arrive and you keep switching threads and not have many to work on
The question is where it is sitting and waiting most of the time. My bet would be it's not the top 3 shaders mentioned in the Chips and Cheese article (and likely not shaders at all), since accelerating all the top 3 shaders by a factor of 1.5x in the game would not bring the RTX 4090 to a point where it usually sits relative to the 7900 XTX. It would add just about +6% of FPS on top of where it was, which still falls short of the average 25% performance difference between these GPUs in pure rasterization. I also have a hard time believing that in a console game optimized for 256 KB reg file of consoles, every shader would suddenly be limited by the number of registers.
 
Yeah occupancy is just a means to an end. However low occupancy is highly correlated with low hardware utilization and poor latency hiding. At least in the game workloads that I’ve seen.
The entire concept of persistent threads is "low occupancy" since you only spawn exactly as many threads needed to match the number of execution units which is 1 wave per SIMD. Spawning more threads than the available execution resources makes the progress of your program even more dependent on the hardware's scheduling algorithm. If a set of threads occupying the execution units are spinning on a lock that's held by another thread that's not active, chances are that your program will likely be deadlocked unless the hardware in question can guarantee a completely fair scheduling algorithm. If all sets of threads spinning on a lock are held by the other sets of threads holding that lock are occupying the execution units then in many more cases we can observe forward progress guarantees for our program since we only need the hardware's scheduling algorithm to fairly schedule work across all active threads.

Despite the low occupancy behind the persistent threads technique, it can be an optimization since it allows threads to efficiently reuse the registers and on-chip cache without having to over synchronize the kernel from the host side so from this example we can see that occupancy isn't always or only the major factor in performance as we can see further below from IHV accounts ...
Optimizing DX12_DXR GPU Workloads_page-0001.jpg
 
The entire concept of persistent threads is "low occupancy" since you only spawn exactly as many threads needed to match the number of execution units which is 1 wave per SIMD. Spawning more threads than the available execution resources makes the progress of your program even more dependent on the hardware's scheduling algorithm. If a set of threads occupying the execution units are spinning on a lock that's held by another thread that's not active, chances are that your program will likely be deadlocked unless the hardware in question can guarantee a completely fair scheduling algorithm. If all sets of threads spinning on a lock are held by the other sets of threads holding that lock are occupying the execution units then in many more cases we can observe forward progress guarantees for our program since we only need the hardware's scheduling algorithm to fairly schedule work across all active threads.

Despite the low occupancy behind the persistent threads technique, it can be an optimization since it allows threads to efficiently reuse the registers and on-chip cache without having to over synchronize the kernel from the host side so from this example we can see that occupancy isn't always or only the major factor in performance as we can see further below from IHV accounts ...

Ideally, we would have single threaded GPUs with perfect caching, tiny latencies and no need for TLP. However, that's not the world we live in. What happens when you run into a long running memory op?

This is a common sight in Nvidia's profiler when looking at warp stall reasons. LGSB dominates. These are long scoreboard stalls - i.e. waiting on memory.

1696297626156.png
 
Status
Not open for further replies.
Back
Top