Digital Foundry Article Technical Discussion [2023]

Status
Not open for further replies.
Not really expecting M1 Max to run NMS at such a terrible framerate. What's the bottleneck here? Maybe CPU?

I hope DF can do a in-depth discussion about NMS' mac port in the future (or more mac & pc ports comparision). From what I've seen most games like NMS and BG3 have a significant graphics downgrade under the "same" settings.
 
Not really expecting M1 Max to run NMS at such a terrible framerate. What's the bottleneck here? Maybe CPU?

I hope DF can do a in-depth discussion about NMS' mac port in the future (or more mac & pc ports comparision). From what I've seen most games like NMS and BG3 have a significant graphics downgrade under the "same" settings.

It's emulating x86 on ARM as well as taking D3D12 and translating it to Metal in real-time. That introduces a ton of performance issues that wouldn't exist in native rendering.
 
It's emulating x86 on ARM as well as taking D3D12 and translating it to Metal in real-time. That introduces a ton of performance issues that wouldn't exist in native rendering.
Pretty sure they announced NMS to be a native mac port in WWDC. I mean they even integrate the metalfx temporal upscaling. No way this is emulation.
 
Pretty sure they announced NMS to be a native mac port in WWDC. I mean they even integrate the metalfx temporal upscaling. No way this is emulation.
Not sure if there are 2 versions but this says it's available on Apple silicone and Intel. MetalFX doesn't require Apple silicone. It will run on AMD and Intel GPUs as well.
 
Not sure if there are 2 versions but this says it's available on Apple silicone and Intel. MetalFX doesn't require Apple silicone. It will run on AMD and Intel GPUs as well.
Did a bit of searching, and here's what they say on the official website:

"No Man’s Sky has been built from the ground up with a new rendering pipeline to take full advantage of Metal and Apple silicon. In addition to the Apple silicon Mac lineup, it also runs on select Intel-based Macs."

So I think it should be native.
 

Quick summary: Still issues with core utilization, and while the asynchronous shader cache+ skip draw helps with shader stutter, it is not a solution by itself - it can still produce significant stutters on a cold cache, so games will likely have to combine a precompute step along with asynchronous - hope most will!

There is also still traversal stutter. Overall, improved but...not exactly encouraging.
 

Quick summary: Still issues with core utilization, and while the asynchronous shader cache+ skip draw helps with shader stutter, it is not a solution by itself - it can still produce significant stutters on a cold cache, so games will likely have to combine a precompute step along with asynchronous - hope most will!

There is also still traversal stutter. Overall, improved but...not exactly encouraging.
Sounds like overall, the PC platform as a whole is just not equipped to deal with the increased complexity and quantity of shaders in games in any kind of ideal way.

Chock this down to another win for consoles this generation on how fixed hardware has some pretty significant advantages.

Not to say PC gaming is bad in comparison, it also still has plenty of advantages, but it's not going to be this 'consoles but inherently better' situation that it's been for the past decade or so.
 

Quick summary: Still issues with core utilization, and while the asynchronous shader cache+ skip draw helps with shader stutter, it is not a solution by itself - it can still produce significant stutters on a cold cache, so games will likely have to combine a precompute step along with asynchronous - hope most will!

There is also still traversal stutter. Overall, improved but...not exactly encouraging.

Great video - one note is that this looks like the classic case where there's a bottleneck somewhere that isn't CPU / GPU compute resources, especially seeing the loads spread across all the cores nice and evenly, but utilization extremely low, in some cases <50% on all cores.

If you're memory subsystem bound, you can throw as many CPU cores and even CPU frequency at the problem as you want, but you won't get much faster, you just have different things stalling waiting for data.

I'd be really curious to re-run the same tests either with:

1) Raptor Lake CPU, with fast/slow DDR4 and fast/slow DDR5
2) AMD platform, two nearly identical SKUs with and without V-Cache (or a 7950x and test each CCD separately)

My hunch is that it's going to scale with memory latency/bandwidth more than it will with cores.
 
Last edited:
Sounds like overall, the PC platform as a whole is just not equipped to deal with the increased complexity and quantity of shaders in games in any kind of ideal way.
I obviously don't understand the problem because I've worked in many software environments where you need to track certain resources. Why can't shaders be specific types of assets that are tracked, isolated, and allowed to be pre-compiled during download/installation/setup?

I feel like I'm missing something.
 
Sounds like overall, the PC platform as a whole is just not equipped to deal with the increased complexity and quantity of shaders in games in any kind of ideal way.

We have to see more actual UE5 games built with 5.2 and later, with precompilation this async method could indeed result in vastly reduced shader stuttering to the point where it's extremely negligible. Bear in mind Alex was testing the async/draw skip in a worst-case scenario, relying on it totally vs complementary. AFAIK UE5 has also significantly increased the scope of their pso gathering process as well, so the precompilation process is far more complete now. I'm just concerned if developers will include it, or think async is 'good enough'.

But yes, it's a very complex problem, which is why there isn't one single, simple solution to it. Frankly I'm more concerned with the traversal stutters.
 
Last edited:
Great video - one note is that this looks like the classic case where there's a bottleneck somewhere that isn't CPU / GPU compute resources, especially seeing the loads spread across all the cores nice and evenly, but utilization extremely low, in some cases <50% on all cores.

If you're memory subsystem bound, you can throw as many CPU cores and even CPU frequency at the problem as you want, but you won't get much faster, you just have different things stalling waiting for data.

I'd be really curious to re-run the same tests either with:

1) Raptor Lake CPU, with fast/slow DDR4 and fast/slow DDR5
2) AMD platform, two nearly identical SKUs with and without V-Cache (or a 7950x and test each CCD separately)

My hunch is that it's going to scale with memory latency/bandwidth more than it will with cores.

If we were talking about good gains for 6=>8, but not, let say, 8=>10, why not, but here I doubt this is mainly a bw problem. Threading is hard, and I just believe it was not the focus of UE5, and the problem is rooted deep in the UE rendering pipeline. I'm afraid we we'll be in a Unity situation, where it's used a lot because the editor is good, but perfs are irregular/bad. You can argue we're already are. Time will tell, but it's not looking good.

Great vidéo.
 
If we were talking about good gains for 6=>8, but not, let say, 8=>10, why not, but here I doubt this is mainly a bw problem. Threading is hard, and I just believe it was not the focus of UE5, and the problem is rooted deep in the UE rendering pipeline. I'm afraid we we'll be in a Unity situation, where it's used a lot because the editor is good, but perfs are irregular/bad. You can argue we're already are. Time will tell, but it's not looking good.

Great vidéo.

The core scaling comparison wasn't specifically what I was referring to. It looks like they've done pretty great work job-ifying everything, and in the parts of the video where Alex shows the detailed performance counters, load is spread extremely well across all the cores. None of them are working very hard though, nor is the GPU. There's one CPU core that's got a bit higher utilization than the others and hits the 60s and 70s percentage wise, but none of them get even close to 100% like you'd expect if it was the typical single core bottleneck / lack of multithreading. It looks to me like the cores just aren't being fed.

Just for fun, and I know it's very much wishful thinking, I'd also be curious to see how one of the modern frequency optimized Epyc SKUs does in the same test.
Something like a 9174f with 256MB L3 cache, 4.4ghz boost clock, and 12-channel DDR5: https://www.amd.com/en/products/cpu/amd-epyc-9174f

DFUE52.jpg
 
Great video - one note is that this looks like the classic case where there's a bottleneck somewhere that isn't CPU / GPU compute resources, especially seeing the loads spread across all the cores nice and evenly, but utilization extremely low, in some cases <50% on all cores.

If you're memory subsystem bound, you can throw as many CPU cores and even CPU frequency at the problem as you want, but you won't get much faster, you just have different things stalling waiting for data.

I'd be really curious to re-run the same tests either with:

1) Raptor Lake CPU, with fast/slow DDR4 and fast/slow DDR5
2) AMD platform, two nearly identical SKUs with and without V-Cache (or a 7950x and test each CCD separately)

My hunch is that it's going to scale with memory latency/bandwidth more than it will with cores.

Sounds more like an incredible amount of tech debt that needs to be dealt with on the pc side since there are so many hardware configurations that you can't simply send out a shader compile. The question is are they going to go back and address these issues or are they going to just let hardware brute force their way forward.


I personally wouldn't mind a 30-60 second compile when the game updates or my hardware changes if it means better performance in the game. I also wouldn't mind sharing with the game company my hardware configuration so that can be used for other machines with similar configurations.
 
Sounds like overall, the PC platform as a whole is just not equipped to deal with the increased complexity and quantity of shaders in games in any kind of ideal way.
It's an API (DX12) problem, Vulkan outlined this problem specifically in their latest blogpost update, which is why they thought to update Vulkan with extensions that mitigate this problem.
 
The core scaling comparison wasn't specifically what I was referring to. It looks like they've done pretty great work job-ifying everything, and in the parts of the video where Alex shows the detailed performance counters, load is spread extremely well across all the cores. None of them are working very hard though, nor is the GPU. There's one CPU core that's got a bit higher utilization than the others and hits the 60s and 70s percentage wise, but none of them get even close to 100% like you'd expect if it was the typical single core bottleneck / lack of multithreading. It looks to me like the cores just aren't being fed.

Just for fun, and I know it's very much wishful thinking, I'd also be curious to see how one of the modern frequency optimized Epyc SKUs does in the same test.
Something like a 9174f with 256MB L3 cache, 4.4ghz boost clock, and 12-channel DDR5: https://www.amd.com/en/products/cpu/amd-epyc-9174f

View attachment 9213
UE5 has a non parallelised render thread (single threaded) talking to Unreal people behind the scenes, that May be the core issue whitnessed Here (pun not intended). The one core that is 65% there while most of the rest are 30ish.

Game thread, Render thread, then a bunch of jobs, right?
 
Last edited:
Status
Not open for further replies.
Back
Top