Current Generation Games Analysis Technical Discussion [2023] [XBSX|S, PS5, PC]

Status
Not open for further replies.
The 7600 likely has much higher L2 bandwidth than the 4060 Ti based on the earlier analysis
Well, the 7600 would also have worse occupancy due to reduced register file capacity. But let's return to the 7900 XTX versus 4090 comparison. In the comparisons, both were tested at 4K with a 20 ms rendering time per frame. If the 4090 was faster by a typical margin of 25%, it would have achieved 62.5 FPS instead of 50. Moreover, even if the top 3 shaders were 2x faster on the 7900 XTX, the 7900 XTX would have reached roughly 54 FPS, making the 4090 still 15% faster.

Their conclusion regarding the superior occupancy seems quite far-fetched given the small amount of their tests. Shaders are rarely bound by register file capacity, which is why AMD opted for reduced capacity in its more cost-sensitive mainstream GPUs. The number of threads can be more critical for shorter shaders vs register file size, and the GPUs capable of scheduling more threads per SM can be faster in such shaders. So better performance in shaders doesn't solely come down to higher register file capacity; scheduling strategies and scheduling capacities play significant roles too.

What I find to be quite misleading is their choice to compare the number of threads in flight per partition because the SM has twice the number of partitions. So for a more direct and less dramatic comparison, both the number of threads in flight for SM and the register file capacity should be doubled. The difference between the 7.4 versus 10 threads in flight doesn't sound as dramatic as 3.7 versus 10, right?
 
Last edited:
Well, the 7600 would also have worse occupancy due to reduced register file capacity. But let's return to the 7900 XTX versus 4090 comparison. In the comparisons, both were tested at 4K with a 20 ms rendering time per frame. If the 4090 was faster by a typical margin of 25%, it would have achieved 62.5 FPS instead of 50. Moreover, even if the top 3 shaders were 2x faster on the 7900 XTX, the 7900 XTX would have reached roughly 54 FPS, making the 4090 still 15% faster.

The frame took 18.1ms on the 4090 and 20.2ms on the 7900 XTX. The 3 profiled dispatches account for 3.22ms on the 4090 and 3.33ms on the 7900 XTX which covers only 17% of the total frametime. We don't know what's happening in the remaining 83% of the frame as it wasn't profiled and we can't assume the 4090 was 25% faster on those workloads. In the end the 4090 was still 12% faster overall so it doesn't seem that mysterious.

Their conclusion regarding the superior occupancy seems quite far-fetched given the small amount of their tests. Shaders are rarely bound by register file capacity, which is why AMD opted for reduced capacity in its more cost-sensitive mainstream GPUs. The number of threads can be more critical for shorter shaders vs register file size, and the GPUs capable of scheduling more threads per SM can be faster in such shaders. So better performance in shaders doesn't solely come down to higher register file capacity; scheduling strategies and scheduling capacities play significant roles too.

Sure, but scheduling capacity is directly increased by large register file sizes so they're not unrelated. Here's a trace from Portal RTX. The longest running shaders are almost entirely occupancy limited due to available register capacity. Compute, L1, L2 and RT hardware are far from being saturated due to insufficient work in flight to cover latencies. It's especially bad on the long TraceRay call in the middle of the frame.

1694925503697.png

What I find to be quite misleading is their choice to compare the number of threads in flight per partition because the SM has twice the number of partitions. So for a more direct and less dramatic comparison, both the number of threads in flight for SM and the register file capacity should be doubled. The difference between the 7.4 versus 10 threads in flight doesn't sound as dramatic as 3.7 versus 10, right?

It should either be number of threads per scheduler, per SM/CU or per chip. Double counting Nvidia schedulers would be confusing and even more misleading.
 
Crazy. I think we’ve often theorized that this was possible, under very specific conditions of pure optimization; I didn’t really believe it could come true. Hell of a lot of things have to fall in place for this to happen though.
Yeah.

Problem is, the performance of Starfield -even on Radeon- is still not good enough for what is on display. The game lacks most modern rendering features, yet it performs like a modern title with modern features, it should perform a lot faster.
 
Last edited:
Yeah.

Problem is, the performance of Starfield -even on Radeon- is still not good enough for what is on display. The game lacks most modern rendering features, yet it performs like a modern title with modern features, it should perform a lot faster.
There are 3 branches of modern rendering going forward:
A) you bake everything and focus on streaming everything in.
B) everything is calculated at run time but using software and as many fast paths as possible
C) everything is run time like above but using ray traced hardware.


I’m fairly positive that Starfield ticks (B).
And (B) is a reasonable stepping stone for engines to transition to (C). I think (A) will have a harder time moving to real time as all the tech is around getting streaming better and better and less focus on getting real time running faster and faster.
 
Lack of SSR is weird.
SSR looks pretty terrible in most games with a third person camera, though. Knowing they were targeting 30fps on console, I imagine they weighed the image quality against their frame budget and decided cube maps were the way to go.

I’m fairly positive that Starfield ticks (B).
It's surprising that people on this forum don't see this. While I do think having baked assets aren't a bad choice for some games, I think a game like this, having a dynamic time of day and reflections that don't break when in 3rd person, and AO/GI that can take into account an airlock full of potatoes is important.
 
Have any details been shared on Starfield's GI and shadow implementations? Looks like reflections are real-time cubemaps per DF. Lack of SSR is weird.
I don’t recall fully but I think most of it could be cube maps for both GI and reflections.

@Dictator any update on how Bethseda approached these ones ?
 
PC games have actually been pretty good on shader compiling recently. I remember Ghostrunner was a problem title in this area, especially bothersome in that game as it's so fast-paced, so downloaded the Ghostrunner 2 demo to see if they addressed it this time and...nope. Come on man.

And hell, I was running in DX11.

Reinhard Von Hamid said:
The demo is a complete stutter fest

Devs, please for the love of God and everything that is holy: Precompile. Your. Shaders. The game is basically unplayable due to the extreme amount of stuttering, especially if you launch it in DirectX 12, which can greatly improve CPU limited performance. I also noticed quite a few traversal stutters as well during the demo, but since this game is using Unreal Engine 4, I sadly don't think they will be fixed before launch, just like in the first game. Oh well...

Definitely have noticed the traversal stuttering too, it's not all shader, but there's no excuse not to at least have some brief precompile process for the shaders at this point.

Also has broken vsync - you will get drops long before you get to 100% GPU utilization if you enable vsync and it can't maintain 60, it will bump around 48fps with 80% GPU. Luckily you can force fast sync since its DX11 but this really shouldn't be necessary. Working vsync and shader precompilation is very basic stuff in 2023.

There's the PS5 version without these of course, but man whatever they're using for reconstruction isn't working too well -the shimmering and flickering (while also simultaneously looking quite blurry) is quite brutal, like we're talking Resident Evil 4 pre-patch levels here. The highest mode available on the PS5 is Quality, which is still at 60 in my brief testing (from the desc the game gives I guess it can drop below?). Quality is supposed to have the highest resolution but I can't tell a difference in clarity between Quality and Performance.

It's guess it's possible all modes on the PS5 have RT enabled which would explain the low base res for reconstruction, in my brief experience with forcing DX12 mode on the PC to try it I was getting in the 40's with DLSS performance at 4k along with even more massive stuttering, possible not even ultra performance mode would have gotten it to 60. However the game crashes about 10 seconds in whenever ray tracing is enabled -guess that's why there's not a menu option to enable DX12 and it defaults to 11.

If RT is always on in the PS5 for all modes it's an odd choice considering the big hit to image quality.
 
Last edited:
And hell, I was running in DX11.
Speaking of which, we are in 2023, and DX11 is still faster than DX12 in a 2023 title like Ghostrunner 2.

@4K, the 4090 is 20% faster in DX11 than DX12, 20%! Same for the 4080. While the 3080Ti and 4070Ti are 15% faster in DX11 than DX12. For AMD, the 7900XTX is 4% faster in DX11 than DX12, while the 6900XT is 8% faster in DX11 than DX12.

So a modern GPU has to lose between 4 and 20% of performance, for no visual gain just to enable DX12. If you are an NVIDIA user and plans to play with no ray tracing, then there are no reasons for you to play with DX12 and lose fps for absolutely no gain whatsoever. You also lose smoothness as stuttering increases with DX12. The same applies to AMD users as well.

Then of coruse, comes in the drop from ray tracing, and this game follows the standard performance difference between AMD and NVIDIA, where at native 4K Ada is 35% faster than RDNA3 (4080 vs 7900XTX) and Ampere is 40% faster than RDNA2 (3080Ti vs 6900XT).

 
Last edited:
Speaking of which, we are in 2023, and DX11 is still faster than DX12 in a 2023 title like Ghostrunner 2.

@4K, the 4090 is 20% faster in DX11 than DX12, 20%! Same for the 4080. While the 3080Ti and 4070Ti are 15% faster in DX11 than DX12. For AMD, the 7900XTX is 4% faster in DX11 than DX12, while the 6900XT is 8% faster in DX11 than DX12.

So a modern GPU has to lose between 4 and 20% of performance, for no visual gain just to enable DX12. If you are an NVIDIA user and plans to play with no ray tracing, then there are no reasons for you to play with DX12 and lose fps for absolutely no gain whatsoever. You also lose smoothness as stuttering increases with DX12. The same applies to AMD users as well.

Then of coruse, comes in the drop from ray tracing, and this game follows the standard performance difference between AMD and NVIDIA, where at native 4K Ada is 35% faster than RDNA3 (4080 vs 7900XTX) and Ampere is 40% faster than RDNA2 (3080Ti vs 6900XT).

These are not low-level programmers building these games. It's not surprising to me that DX11 performs better than DX12 for some of these studios.
 
B) everything is calculated at run time but using software and as many fast paths as possible
Even so, this has got to be the worst performing real time global illumination system ever. Worse than Red Dead Redemption 2, Quantum Break, and Cyberpunk 2077. Also vastly worse than Metro Exodus and Dying Light 2 (with their ray traced global illumination and reflections). Starfield runs far worse than any of them, while not offering a better visual impact.

I could also list Watch Dogs 2, Ghost Recon Breakpoint, Far Cry 6, Days Gone, Battlefield 2042, Anthem (any modern Frostbite game to be honest)
 
These are not low-level programmers building these games. It's not surprising to me that DX11 performs better than DX12 for some of these studios.
I agree, the question is, how long will this remain? Baldur's Gate 3 was released last month with the same problem, DX11 performing better than Vulkan even in CPU limited scenarios.

How can these APIs remain hard to master after all this time? this is a fundamental problem that needs to be addressed as soon as possible, it's a major oversight from the API designers, when your latest API is consistently performing worse than your decades older API, then you know you've done a major mistake, and need to come up with a new solution fast.
 
The chips and cheese findings regarding CU usage and shader size, are particularly interesting - imho -
when considering the recent info from Forza motorsport devs, about the complexity of their shaders,
and how they optimized for a smaller number of total shaders in the shder book/shader bible,
but had more complex shaders.

Perhaps a similar optimization going on there - again which would not surprise me, given the high likely-hood of XTG being involved there.

If only XTG, could get into a 40fps and more VRR modes!
 
The chips and cheese findings regarding CU usage and shader size, are particularly interesting - imho -
when considering the recent info from Forza motorsport devs, about the complexity of their shaders,
and how they optimized for a smaller number of total shaders in the shder book/shader bible,
but had more complex shaders.

Perhaps a similar optimization going on there - again which would not surprise me, given the high likely-hood of XTG being involved there.

If only XTG, could get into a 40fps and more VRR modes!

What's interesting with something like this I feel that isn't often considered in optimization discussions is when an approach might be more optimal in one area but possibly detrimental in another. We tend use the term optimization generically, but is right to use in that way in scenarios like that?

It's worth noting that in Nvidia's DX 12 guide they recommend -

  • Expect to maintain separate render paths for each IHV minimum
    • The app has to replace driver reasoning about how to most efficiently drive the underlying hardware

But in terms of the broader earlier discussion with respect to Starfield and the recent ChipsandCheese article meeting the rounds I have sense that many people might be latching onto it and in some circles interpreting it too much from the IHV battle stand point. Given the actually limited data and that it's only really one side (we don't have any information on the game, and how Starfield is coded), should we making broad critiques on the hardware side's merits/demerits?
 
What's interesting with something like this I feel that isn't often considered in optimization discussions is when an approach might be more optimal in one area but possibly detrimental in another. We tend use the term optimization generically, but is right to use in that way in scenarios like that?

It's worth noting that in Nvidia's DX 12 guide they recommend -



But in terms of the broader earlier discussion with respect to Starfield and the recent ChipsandCheese article meeting the rounds I have sense that many people might be latching onto it and in some circles interpreting it too much from the IHV battle stand point. Given the actually limited data and that it's only really one side (we don't have any information on the game, and how Starfield is coded), should we making broad critiques on the hardware side's merits/demerits?

LoL, expecting game devs to maintain separate render paths for each IHV, thats seems like asking too much.
I suspect most devs maintain 1 path, ie. optimize mostly for the common case, eg NVIDIA.
and let the others be.

This isn't just about slightly different compile flags when compiling your shaders - Though that might help a bit - This is about the overall rendering design and shader complexity.
Sure CPU compilers have an favor speed/favor size flag, but i'm not sure how relevant or comparable that would be to a GPU shader.

I agree that it's a very limited data set to draw conclusions from, but since the current observations are so radically different from what we have seen previously,
i think it's a particularly interesting data point, it also makes the upcoming Forza release such an interesting data point.

IF, and it's a big IF, we see similar results on Forza motorsport, I think we could start to talk about a trend, and possibly even making some confident conclusions
about WHY these titles perform better on AMD hardware vs NVIDIA.

**Caveat - the Forza stuff will be even more muddied by the presence of RT ops, which we already know are much faster on NV hardware.
Perhaps a Non-RT benchmark of Forza @ 4K would be useful to test the "AMD prefers more complex shader" theory.

Whatever the result I love the analysis by chips and cheese, they are digging deeper and presenting more data than anyone else! kudos!
 
With respect, they are discussing the SM not the ray tracing accelerators.
Every execution unit is using the same SM ressources. FP32 units and RT accelerators can work concurrently. Thats the reason why you can use Raytracing without huge performance impact on nVidia GPUs when inefficient rendering techniques gets replaced with hardware raytracing.
 
Yeah.

Problem is, the performance of Starfield -even on Radeon- is still not good enough for what is on display. The game lacks most modern rendering features, yet it performs like a modern title with modern features, it should perform a lot faster.
I really do not get this, I mean its a game and the main point is it a fun game? Also every project I have ever been involved in, that is not open source at nature, we hit our targets and the code and other things that will not be published, can/will be a serious mess (well internal docs should be good to) as long as it hits the goal we have set out to achieve.
And if Bethesda never set out make a technical masterpiece, how can we say it is a problem if is not?
I am not against people analysing it and pointing out things, but expecting the game hit targets that we set is the problem in my book.
 
Given the actually limited data and that it's only really one side (we don't have any information on the game, and how Starfield is coded), should we making broad critiques on the hardware side's merits/demerits?

Starfield is just one data point of many. The relationship between register usage, occupancy, latency and bandwidth has been discussed (mostly by Nvidia) since CUDA 1.0. I think we can objectively say RDNA 3 is better equipped than Ada to utilize available compute and bandwidth in complex workloads. Less register intensive workloads that fit in L2 will favor Ada.
 
Even so, this has got to be the worst performing real time global illumination system ever. Worse than Red Dead Redemption 2, Quantum Break, and Cyberpunk 2077. Also vastly worse than Metro Exodus and Dying Light 2 (with their ray traced global illumination and reflections). Starfield runs far worse than any of them, while not offering a better visual impact.

I could also list Watch Dogs 2, Ghost Recon Breakpoint, Far Cry 6, Days Gone, Battlefield 2042, Anthem (any modern Frostbite game to be honest)
This is such a lazy statement that lacks nuance and detail. None of those titles do GI the same, everyone does it different. Some with hardware RT and others without. Everyone can say they do GI but it doesn’t mean that GI is applied to everything in the scene, it could just be applied to the environment and the characters. What about GI for 1000s of physically movable objects? What about GI coming from light sources millions of miles away. What about light leakage? What about speed of update and draw distance on GI.

What about how much latency there is a game? RDR2 has like tripled buffered frames. RDR2 has baked lighting with a the element of a global lighting source applied on top of that baked lighting.

the GPU only cares about technical weight; art is something else entirely, and most people appear to be mixing art with technical capability.

Most people knocking on Starfield, I suspect have never really played it.

edit: I hate having to do other peoples homework:
RDR2 Graphical Study
  • Baked AO once per day global light shift
  • Global light pass​

    The game renders a fullscreen quad for calculating directional lighting which is moonlight in this case. There is some baked lighting from the "top-down world lightmap" mentioned previously.
  • Not the same as Starfield

Quantum Break Siggraph
  • Slide 21: Let’s begin with global illumination.Ideally, the solution would be fully dynamic, and, for every frame, we could compute the global illumination from scratch
  • slide 22: We experimented with voxel cone tracing and virtual point lights but it became clear that achieving the level of quality we wanted was too expensive with these techniques.
  • Slide 23: So, during pre-production, it became clear that we needed to use some form of precomputation.
Not Starfield

Ubisoft Titles (Watch Dogs 2, GR Breakpoint, FC 6)
All use some iteration of AnvilNext 2.0
  • All pre-computed Global Illumination with Light Probes
  • FB:WATCH_DOGS does support Global Illumination. We prioritized developing our Global Illumination technique since we believe that’s one of the key feature that differentiates between current-generation and next-generation games running in a fully dynamic world (i.e.: no pre-rendering possible). Our technique is custom made to fit all the requirements of our environment (urban city, vast exteriors, detailed interiors). At its core, it uses light probes that are baked at the highest quality using a vast cloud of computers.
Not Starfield

All Frostbyte Engines
* BVH based BRUTE forced renderer for baking
* All GI is baked, indirect lighting is done by lightmaps

Let's talk about the Ray Traced Solutions of Metro Exodus and Days Gone, as well as Cyber Punk
Yea, that's going to be fairly super high quality and runs well because of hardware acceleration.
How well will the BVH function with 10,000 objects on screen that can all move around due to physics, can they update a BVH that quickly?

I have my doubts.
 
Last edited:
with 10,000 objects on screen that can all move around due to physics
This can't be the reason everytime someone crticizes an aspect of Starfield, he is shut down because of the 10,000 object thing. No, when the game is running this suboptimally it is not rendering thousands of objects on screen, it is running regular, plain scenes found in dozens of games with standard obejct/NPC/animation count.

Not Starfield
What is Starfield exactly? What does it do so differently that elevates it above the mentioned examples and justify it's significantly higher cost?
 
Last edited:
Status
Not open for further replies.
Back
Top