Current Generation Games Analysis Technical Discussion [2023] [XBSX|S, PS5, PC]

trinibwoy · Sep 3, 2023

Seanspeed said:
Well I didn't realize you weren't the original person I was talking to, so my bad.

In that light, I really dont know what you were trying to say then. My response was mainly intended for the arguments troyan was making.

troyan said the 4090 has 4x the compute throughput on paper than the 6900 XT. You countered by saying the 4090 needs to run INT32 on the same pipes as FP32. My point is the 6900 also runs INT32 on the same pipes so the paper flops comparison is valid. The difference is that only half of the 4090’s ALU pipes can do INT32. The 6900 would theoretically have an efficiency advantage in workloads where over half of the instructions are INT32.

Flops don’t come anywhere close to telling the whole story though so it’s probably a moot point. Bandwidth, cache efficiency, drivers, scheduling all seem to matter more than raw flops these days.

techuse · Sep 3, 2023

They tested in a particularly GPU heavy area. Wonder what is going on with the rendering here. Performance doesn't scale well at all with resolution and performance is very low.

troyan · Sep 3, 2023

function said:
4090 has additional hardware units to assist with RT in Cyberpunk that use only CUs on AMD.

Are you sure iroboto was talking about Starfield there? I followed the quote chain back and it becomes ambiguous.

I used Starfield and UE5 as an example.

These RT Cores do not matter. A modern nVidia GPU can process multiple different kinds of workload currently.

Here a few numbers from the latest games benched by Techpowerup - 4090 vs. 6900XT in 4K:
Ratchet: 1.79x
Armored Core: 1.84x
Baldurs Gate 3: 1.92x
Atlas Fallen: 2.12x
====================
Immortals: 1.6x
Starfield (other sites): 1.55x

And for Raytracing - in Ratchet a 4090 can get 60FPS+ with ultra settings and is 3x faster: https://gamegpu.com/action-/-fps-/-tps/ratchet-clank-rift-apart-v1-808-test-gpu-cpu

iroboto · Sep 3, 2023

troyan said:
I used Starfield and UE5 as an example.

These RT Cores do not matter. A modern nVidia GPU can process multiple different kinds of workload currently.

Here a few numbers from the latest games benched by Techpowerup - 4090 vs. 6900XT in 4K:
Ratchet: 1.79x
Armored Core: 1.84x
Baldurs Gate 3: 1.92x
Atlas Fallen: 2.12x
====================
Immortals: 1.6x
Starfield (other sites): 1.55x

And for Raytracing - in Ratchet a 4090 can get 60FPS+ with ultra settings and is 3x faster: https://gamegpu.com/action-/-fps-/-tps/ratchet-clank-rift-apart-v1-808-test-gpu-cpu

I think from my perspective of what you wrote, it's important to separate 'optimized for Nvidia', and optimized as in it's inherent technologies have been optimized. If you want to make the case that Nvidia should be performing better at least relative to AMD, that wasn't particularly what I was addressing, I think if we wait long enough, a game of this caliber will resolve those issues.

but the core technologies for UE5 for instance, don't leverage RT hardware, and both Nanite and Lumen are essentially at their core, software solutions. And at the heart of it, so is Starfield. They can technically run on really old GPUs because they've done everything on compute, versus looking for hardware accelerators.

I remember a long time ago, when nvidia first came out with RT hardware, and a lot of people gave it a thumbs down because they couldn't see the difference, and they felt that the performance trade off was not worth it, and that hardware producers should have continued on the pure compute path and that they would make something equivalent. But I think, for me, with UE5 and starfield, who have effectively gone that route, I think you can see that's its not that performant. And I think that's why imo, RT hardware and tensor cores are the future of GPUs. We can't scale this performance forever, unless somehow we scale the amount of bandwidth with our compute.

That is what I wanted to point out. Not that the performance of the nvidia chips couldn't get better. I'm just pointing out that, this particular workload, isn't playing to the strength of Nvidia cards, a lot of accelerators are left unused.

trinibwoy · Sep 3, 2023

iroboto said:
I remember a long time ago, when nvidia first came out with RT hardware, and a lot of people gave it a thumbs down because they couldn't see the difference, and they felt that the performance trade off was not worth it, and that hardware producers should have continued on the pure compute path and that they would make something equivalent. But I think, for me, with UE5 and starfield, who have effectively gone that route, I think you can see that's its not that performant. And I think that's why imo, RT hardware and tensor cores are the future of GPUs. We can't scale this performance forever, unless somehow we scale the amount of bandwidth with our compute.

We’re just now getting games that are pushing the envelope of geometric complexity and dynamic lighting. The debate around for the past ~5 years has been mostly theoretical but soon we’ll have a lot more data from truly next gen AAA titles. It should be easier to have a real discussion of the tradeoffs between tech choices - Nanite, upscaling, path tracing etc.

All that aside though it’s becoming increasingly apparent that modern games are not well suited to current hardware or vice versa. Subjectively the delivered visuals and performance just don’t make sense. And the limited profiling data we have supports the idea that hardware is being very poorly utilized, at least on PC. Starfield and Aveum are just the latest examples.

techuse · Sep 3, 2023

trinibwoy said:
We’re just now getting games that are pushing the envelope of geometric complexity and dynamic lighting. The debate around for the past ~5 years has been mostly theoretical but soon we’ll have a lot more data from truly next gen AAA titles. It should be easier to have a real discussion of the tradeoffs between tech choices - Nanite, upscaling, path tracing etc.

All that aside though it’s becoming increasingly apparent that modern games are not well suited to current hardware or vice versa. Subjectively the delivered visuals and performance just don’t make sense. And the limited profiling data we have supports the idea that hardware is being very poorly utilized, at least on PC. Starfield and Aveum are just the latest examples.

We still don’t have a game tailored to current hardware by a technically competent studio. Spiderman 2 seems disappointing but that will be a good sign of whether or not this gen will deliver much over last gen.

trinibwoy · Sep 3, 2023

techuse said:
We still don’t have a game tailored to current hardware by a technically competent studio. Spiderman 2 seems disappointing but that will be a good sign of whether or not this gen will deliver much over last gen.

Maybe but what if it never happens? What if the modern rendering pipeline is fundamentally broken and no amount of talent can fix it? It may be time for a complete paradigm shift. The current situation on PC is appalling.

iroboto · Sep 3, 2023

trinibwoy said:
Maybe but what if it never happens? What if the modern rendering pipeline is fundamentally broken and no amount of talent can fix it? It may be time for a complete paradigm shift. The current situation on PC is appalling.

Developers can always move back to baked lighting and such. That's where we are seeing massive gains in performance. No one is stopping developers from doing this, but the cost, and the limitations on the game design will make the game feel like a last generation game.

Once a developer has made the move to get away from baked lighting, geometry, shadows and the old way of making things, I don't see any likelihood of them returning.
The old way makes the cost significantly higher, games require more labour, and less flexibility for changes, and more difficult to scale games to be larger or more procedural.
There's significantly more inherent risk with this method.

Our move to dynamic and real-time calculations is to free ourselves of these limitations, but the costs are being put back onto the hardware owners to resolve. The pressure will be on IHVs to deliver hardware acceleration to obtain performance. Nvidia predicted this well, and AMD decided to go another route to try to bridge the compute bandwidth problem. That's just 2 ways to go about it. I don't know which solution will win out, but I think hardware acceleration is going to be an important piece moving forward unless there is some crazy bandwidth issues we can sort out.

Starfield could have easily been a smaller game with baked lighting therefore massive performance boosting, but they decided to be bigger. It can't bake the thousand procedural worlds, and it's not going to bake the atmospheric changes and day night cycles of every major area you can interact with either. It's already a large game weighing in at 130GB (still smaller than COD), and it has a tiny VRAM footprint of 4GB and 6GB of system memory. I think that says a lot about where we are headed. Whereas with Sony titles, you're sort of seeing something else, you are seeing a lot of games take on very large VRAM pools that is causing problems for cards under 12GB. I mean that's another type of issue, the performance is there, but it's a mess once you run of memory. You can push the limits of VRAM for baking, but eventually you'll hit a hard wall, and then you're forced to go real-time if you want to continue to have higher fidelity lighting at the cost of performance. That's the tradeoff I'm seeing right now, and people are struggling to see the difference between baked and real-time. Over time, game design will continue to take more advantage of how real-time systems can be manipulated by players and that will be the end of baked. At least imo.

Seanspeed · Sep 3, 2023

trinibwoy said:
troyan said the 4090 has 4x the compute throughput on paper than the 6900 XT. You countered by saying the 4090 needs to run INT32 on the same pipes as FP32. My point is the 6900 also runs INT32 on the same pipes so the paper flops comparison is valid. The difference is that only half of the 4090’s ALU pipes can do INT32. The 6900 would theoretically have an efficiency advantage in workloads where over half of the instructions are INT32.

Flops don’t come anywhere close to telling the whole story though so it’s probably a moot point. Bandwidth, cache efficiency, drivers, scheduling all seem to matter more than raw flops these days.

Clearly the solutions aren't remotely similar given the effective performance gaps between what happened with pre-Ampere GPU's and Ampere+ GPU's versus AMD GPU's in terms of basically *doubling* the theoretical TFlops figures, while only gaining maybe 20-25% at best in performance, all else being roughly equal. This shows a massive, massive discrepancy in performance per theoretical flops in Nvidia vs AMD(before RDNA3) in a before and after, showing the way they work is clearly not remotely comparable and it has nothing to do with lack of optimization(which is the main point here), cuz it manifests at any range of GPU in their respective series.

We can literally compare Turing and Ampere GPU's in terms of TFlops and see that Ampere GPU's have *massively* less performance per flop than Turing did. Meanwhile, we can compare Vega vs RDNA1 vs RDNA2 and see that nothing remotely similar exists there. Yet with RDNA3, we also see a massive regression here(even bigger than Turing vs Ampere, cuz RDNA3 seems to just be terrible). A 7800XT has a stated 37TF, yet will perform very similarly to the 21TF 6800XT.

trinibwoy · Sep 3, 2023

Seanspeed said:
Clearly the solutions aren't remotely similar given the effective performance gaps between what happened with pre-Ampere GPU's and Ampere+ GPU's versus AMD GPU's in terms of basically *doubling* the theoretical TFlops figures, while only gaining maybe 20-25% at best in performance, all else being roughly equal. This shows a massive, massive discrepancy in performance per theoretical flops in Nvidia vs AMD(before RDNA3) in a before and after, showing the way they work is clearly not remotely comparable and it has nothing to do with lack of optimization(which is the main point here), cuz it manifests at any range of GPU in their respective series.

We can literally compare Turing and Ampere GPU's in terms of TFlops and see that Ampere GPU's have *massively* less performance per flop than Turing did. Meanwhile, we can compare Vega vs RDNA1 vs RDNA2 and see that nothing remotely similar exists there. Yet with RDNA3, we also see a massive regression here(even bigger than Turing vs Ampere, cuz RDNA3 seems to just be terrible). A 7800XT has a stated 37TF, yet will perform very similarly to the 21TF 6800XT.

Yes there are 100 reasons flops aren’t comparable between architectures. I was addressing your specific comment about INT32.

trinibwoy · Sep 3, 2023

iroboto said:
Developers can always move back to baked lighting and such. That's where we are seeing massive gains in performance. No one is stopping developers from doing this, but the cost, and the limitations on the game design will make the game feel like a last generation game.

Once a developer has made the move to get away from baked lighting, geometry, shadows and the old way of making things, I don't see any likelihood of them returning.
The old way makes the cost significantly higher, games require more labour, and less flexibility for changes, and more difficult to scale games to be larger or more procedural.
There's significantly more inherent risk with this method.

Our move to dynamic and real-time calculations is to free ourselves of these limitations, but the costs are being put back onto the hardware owners to resolve. The pressure will be on IHVs to deliver hardware acceleration to obtain performance. Nvidia predicted this well, and AMD decided to go another route to try to bridge the compute bandwidth problem. That's just 2 ways to go about it. I don't know which solution will win out, but I think hardware acceleration is going to be an important piece moving forward unless there is some crazy bandwidth issues we can sort out.

Starfield could have easily been a smaller game with baked lighting therefore massive performance boosting, but they decided to be bigger. It can't bake the thousand procedural worlds, and it's not going to bake the atmospheric changes and day night cycles of every major area you can interact with either. It's already a large game weighing in at 130GB (still smaller than COD), and it has a tiny VRAM footprint of 4GB and 6GB of system memory. I think that says a lot about where we are headed. Whereas with Sony titles, you're sort of seeing something else, you are seeing a lot of games take on very large VRAM pools that is causing problems for cards under 12GB. I mean that's another type of issue, the performance is there, but it's a mess once you run of memory. You can push the limits of VRAM for baking, but eventually you'll hit a hard wall, and then you're forced to go real-time if you want to continue to have higher fidelity lighting at the cost of performance. That's the tradeoff I'm seeing right now, and people are struggling to see the difference between baked and real-time. Over time, game design will continue to take more advantage of how real-time systems can be manipulated by players and that will be the end of baked. At least imo.

You’re right, there’s a cost to moving away from baked lighting and the IQ benefits aren’t always obvious (vs a good baked implementation). It would be easier to swallow that cost if the hardware was being used effectively. It would be interesting to see how well XSX/PS5 are being utilized in modern dynamic lit games and how that compares to what we’re seeing on PC. It’s a similar concern as RT but with RT could be explained away as the workload wasn’t a good fit for GPU hardware - i.e. branchy, latency sensitive code.

The promise of compute pipelines is that you can do anything with enough bandwidth and flops. If hardware was being slammed because bandwidth and compute was being pushed to the limit that would be a great sign the workload is a good fit for the architecture. Then we would just need more hardware. What we’re seeing so far though is hardware isn’t being slammed. Performance is poor while hardware is sitting idle which indicates a poor match between software and hardware.

iroboto · Sep 3, 2023

trinibwoy said:
You’re right, there’s a cost to moving away from baked lighting and the IQ benefits aren’t always obvious (vs a good baked implementation). It would be easier to swallow that cost if the hardware was being used effectively. It would be interesting to see how well XSX/PS5 are being utilized in modern dynamic lit games and how that compares to what we’re seeing on PC. It’s a similar concern as RT but with RT could be explained away as the workload wasn’t a good fit for GPU hardware - i.e. branchy, latency sensitive code.

The promise of compute pipelines is that you can do anything with enough bandwidth and flops. If hardware was being slammed because bandwidth and compute was being pushed to the limit that would be a great sign the workload is a good fit for the architecture. Then we would just need more hardware. What we’re seeing so far though is hardware isn’t being slammed. Performance is poor while hardware is sitting idle which indicates a poor match between software and hardware.

Agreed. Looking at the charts there is definitely an optimization issue happening with Nvidia chipsets and so hope that is resolved at least with respect to Starfield: I may wait for a patch as well I have no intention of upgrading my 3070 until next Gen and somehow I gotta push this game to 3840*1440 or just 4K.

But yea, its clear either some hard bottlenecks are being hit that need a work around or the compute based method there is simply not enough bandwidth to do that type of technique

Below2D · Sep 3, 2023

Jupiter · Sep 3, 2023

Then AMD actively blocks graphical progress. DLSS a very popular technique but now consumers have to accept inferior and flickering FSR for this.

trinibwoy · Sep 3, 2023

I don’t really get the ROI on this sort of stuff. It’s not going to help AMD sell more cards. It’s just going to cement their reputation for being second class with a cherry on top of actively plotting to degrade the end user experience for 90% of the market. Not sure what’s in it for Bethesda either as AMD can’t be paying them enough to compensate the hit to their reputation and integrity.

Now that AMD has publicly said Bethesda is free to implement DLSS and John is claiming is that they already did maybe a patch will be coming soon.

Maybe one benefit is that it will open more eyes for those who still naively think AMD is a white knight.

Seanspeed · Sep 3, 2023

I dont think AMD is lying when they say they didn't 'forbid' Bethesda from implementing DLSS, but I think there were enough strong implications to not have to say it outright. Kind of a like a mob boss. "I need you to take care of our FSR implementation, if you know what I mean. Keep it safe." lol

Not to say AMD are evil or anything, Nvidia have done some extremely anti-competitive things in the past as well, even very recently, but I think we can safely shelve any notion that AMD are the 'good guys'.

troyan · Sep 3, 2023

iroboto said:
That is what I wanted to point out. Not that the performance of the nvidia chips couldn't get better. I'm just pointing out that, this particular workload, isn't playing to the strength of Nvidia cards, a lot of accelerators are left unused.

Compute is not an argument. Here are some blender results in which the 4090 is 3x+ faster than a 6900XT: https://techgage.com/article/blender-3-6-performance-deep-dive-gpu-rendering-viewport-performance/
In Luxmark 4.0 the 4090 is 2x+ faster: https://techgage.com/article/nvidia-geforce-rtx-4060-ti-creator-review/

Even with just Screenspace lighting UE5 cant produce high enough frames. The difference between rasterzing with Screenspace and Pathtracing is ~2.2x better performance: https://imgsli.com/MjAzMDAw

In DESORDRE the 4090 can render a Pathtracing frame in 1080p without reflection faster than a rasterizing frame in Starfield. How is this possible?! Just two years ago real time Pathtracing was unimaginable.

Seanspeed · Sep 3, 2023

troyan said:
Compute is not an argument. Here are some blender results in which the 4090 is 3x+ faster than a 6900XT: https://techgage.com/article/blender-3-6-performance-deep-dive-gpu-rendering-viewport-performance/
In Luxmark 4.0 the 4090 is 2x+ faster: https://techgage.com/article/nvidia-geforce-rtx-4060-ti-creator-review/

Even with just Screenspace lighting UE5 cant produce high enough frames. The difference between rasterzing with Screenspace and Pathtracing is ~2.2x better performance: https://imgsli.com/MjAzMDAw

In DESORDRE the 4090 can render a Pathtracing frame in 1080p without reflection faster than a rasterizing frame in Starfield. How is this possible?! Just two years ago real time Pathtracing was unimaginable.

Do we really need to explain how non-gaming results dont apply to gaming?

iroboto · Sep 3, 2023

troyan said:
Compute is not an argument. Here are some blender results in which the 4090 is 3x+ faster than a 6900XT

So therein lies the issue. When would you be satisfied with the performance of UE5 and Starfield? When a 4090 is operating 2x-3x faster than a 6900XT? Are you trying to make the argument that a properly optimized starfield the 4090 is anywhere between 140fps-210fps at 1080p? Because this is what you're suggesting. Is there any data today that suggests a 4090 can go 50% faster? Is there any data that suggests today that a CPU can run at those speeds?

Can't you see using other unrelated data points as floating anchors isn't really useful?
It may be useful in trying to say your 4090 should be doing better, sure. I don't doubt it. I mean, I want my 3070 to perform better too. What exactly is the threshold here, this is a massive game that is updating massive amounts of data, if there was a CPU and memory that could support 200fps, I'd be blown away that starfield would be able to recalculate lighting at 200fps. It's insanity if you think about it. This isn't valorant or counterstrike where all the prebaked calculations are done and you're just doing some simple math and filling the screen.

97 fps is 10ms frame time. 120fps is 8ms frame time. 240 fps is 4ms. If you consider how much bandwidth is required to recalculate all the real-time cube map reflections, atmospherics, lighting and shadows, how much more bandwidth would you need to keep going faster and faster? As far I know the 4090 has 1000GB/s of bandwidth. Napkin math is 1GB per frame of bandwidth at 1080p. 1.5GB per frame at 1440p and 2GB per frame at 4K.

So if we scale that to a 3080 that has 760GB/s. At 1080p it does nearly 76FPS! Guess what the benchmark shows! 71.5!
If we look at a 3060TI that has 488GB/s. At 1080p it should do 48.8FPS! Guess what the benchmark shows! 47.4!

Is this a trend? I dunno, we need to do a deeper more thorough investigation. Napkin math is just that, napkin math.

Is bandwidth not a consideration? It has to be a natural bottleneck if compute is not.

function · Sep 3, 2023

troyan said:
In DESORDRE the 4090 can render a Pathtracing frame in 1080p without reflection faster than a rasterizing frame in Starfield. How is this possible?! Just two years ago real time Pathtracing was unimaginable.

Probably because at 1080p the 4090 is heavily CPU bottlenecked in Starfield, making it a really odd comparison to make.

But it would be a strange comparison to make anyway, given that DESORDRE is a really simple game, with really simple geometric shapes, and that all the stuff that gives Starfield it's character (detailed models, detailed texturing, atmospherics, animated meshes, NPCs, dense vegetation) is missing from every screenshot I've seen of it, anyway.

It's so different from Starfield that I can't see anything meaningful coming from comparing the two.

Current Generation Games Analysis Technical Discussion [2023] [XBSX|S, PS5, PC]

trinibwoy

Meh

techuse

troyan

iroboto

Daft Funk

trinibwoy

Meh

techuse

trinibwoy

Meh

iroboto

Daft Funk

Seanspeed

trinibwoy

Meh

trinibwoy

Meh

iroboto

Daft Funk

Below2D

Jupiter

trinibwoy

Meh

Seanspeed

troyan

Seanspeed

iroboto

Daft Funk

function

None functional

Similar threads