Digital Foundry Article Technical Discussion [2023]

Status
Not open for further replies.
I wholeheartedly feel like direct storage will be an absolute gamechanger on PC for a variety of applications. The real issue right now is getting the momentum up and running to standardize it across the platform. But this technology really is impressive just from looking at early results with tests and things. I don't think it can be underestimated just because we have not seen the full utilization yet
 
I wholeheartedly feel like direct storage will be an absolute gamechanger on PC for a variety of applications. The real issue right now is getting the momentum up and running to standardize it across the platform. But this technology really is impressive just from looking at early results with tests and things. I don't think it can be underestimated just because we have not seen the full utilization yet
I mean, there should be no doubt about it. It's inevitable and necessary. The days of getting rapid increases in VRAM are over because of how slow the cost per GB improves nowadays. I mean, it's pretty insane that we still have midrange 8GB GPU's these days, all while they first showed up nearly a decade ago with the R9 290/290X, and were already normal midrange models by 2015.

DirectStorage is absolutely critical to getting a ton more out of the VRAM we do have, much like this same paradigm is for the new consoles and their limited 16GB of total RAM. It has to happen. And it will not just be some 'feature option', it's going to soon become a core, fundamental part of the way new games are built, with SSD's as a true minimum requirement.

To my subjective eyes, the texture quality in TLOU is clearly better. I'm in the market at the start of the game for APT, where NPCs are repeated and the dedicated VRAM usage is still 9GB at 4k with DLSSQ.
The only advantage I think I could give to TLOU in the texture/shader department here is the sheer amount of decent quality materials they have authored. It's one of those areas that really demonstrates Naughty Dog's AAAA-level budget, with just how many different assets they(and their dozen+ outsourced asset creation studios...) deliver throughout the chapters, all at a consistent, quite high quality.

But at an individual asset level, I dont really think it's doing anything particularly special here at all.
 
You were talking about PCIe being a bottleneck earlier. Now you're talking about DDR. Which is it? They're completely different things. Also the above statement makes no sense. If we swap in the bandwidth of the different segments as follows: 7GB/s -> 64GB/s -> 500GB/s how do you conclude that the RAM (the middle one) is the bottleneck?
I don't know if you remember but, at the start of this, i said "excluding storage bottlenecks". The reason i said that is that the both hUMA and NUMA suffer from this issue. With regards to the rest of my statement, I should have been more clear. I've included a picture below to better represent what I was trying to explain. Architecturally speaking, there are a lot of inefficiencies in this one picture.

numa.png
 
The most interesting part for me has been the that everybody tries to pinpoint why PC TLOU does not just crush PS5 TLOU for every instance possible, without really acknowledging that TLOU was a game made for PS3 that got ported to PS4 then ported again to PS5 and again ported to PC.
Heck, the design choices and implementation choices done per port must add so much cruft to jury rig stuff to work, that the it probably got more comments about "Here be dragons" than the map Columbus had when he set sail for India.

Btw, did ND redo graphics each time or atleast P3 -> P4? Or was the original files high def enough to just scale down again? Or are they just using the same graphics?
 
I wholeheartedly feel like direct storage will be an absolute gamechanger on PC for a variety of applications. The real issue right now is getting the momentum up and running to standardize it across the platform. But this technology really is impressive just from looking at early results with tests and things. I don't think it can be underestimated just because we have not seen the full utilization yet
Assuming developers use it which so far seems not to be the case. It's been out for months and only one game bothered with it. I understand why they don't do GPU decompression because it requires them to repackage their assets using GDeflate or a GDeflate-compatible algorithm, but they could just use it to at least take advantage of the much faster nvme drives. That doesn't seem to hard to implement.
 
I don't know if you remember but, at the start of this, i said "excluding storage bottlenecks". The reason i said that is that the both hUMA and NUMA suffer from this issue. With regards to the rest of my statement, I should have been more clear. I've included a picture below to better represent what I was trying to explain. Architecturally speaking, there are a lot of inefficiencies in this one picture.

numa.png
The PC architecture is so incredibly inefficient. When you compare Windows laptops with Apple's M-series SOCs its becoming really apparent.

Really hope for an unified architecture next generation, atleast for laptops. Or I won't upgrade for another 2 years.
 
The most interesting part for me has been the that everybody tries to pinpoint why PC TLOU does not just crush PS5 TLOU for every instance possible, without really acknowledging that TLOU was a game made for PS3 that got ported to PS4 then ported again to PS5 and again ported to PC.
Heck, the design choices and implementation choices done per port must add so much cruft to jury rig stuff to work, that the it probably got more comments about "Here be dragons" than the map Columbus had when he set sail for India.

Btw, did ND redo graphics each time or atleast P3 -> P4? Or was the original files high def enough to just scale down again? Or are they just using the same graphics?
TLOU Remastered on PS4 is essentially the PS3 game.

TLOU Part I on PS5 has been completely redone.
 
Our dear friend PJ is a long ways away from being "totally correct". As you perfectly explained below, the conversation was never about which platform was the most powerful or the like. It was strictly centered around systems integration and data management.

As noted in my previous post to you, my argument is against the following claims you made: "[PS5's] more performant i/o architecture and unified memory that has allowed the consoles to be more nimble and outclass the PC platform with these newer games", which are false because the PS5 does not have a more performant i/o architecture than what is possible on a PC by almost any measure. And the PS5 has not 'outclassed' (which I take to mean outperformed) the PC as a platform (which I take to mean any possible PC configuration), in any modern game.

If I've misinterpreted your argument, then that's entirely different, just say what you actually meant and perhaps we can come to an agreement.

What I will say is that although the PS5 has less potential data throughput than a top end PC (11GB/s vs 14-20GB/s), it can decompress that data using far less CPU resources than a PC that is using CPU based decompression, and even a little less CPU resource than a PC using GPU based decompression. Therefore we can say that it's considerably more efficient, albeit less performant than a PC using the Win32 API (which basically every game you are using as an example of PS5's advantages is doing) and it's a little more efficient while being less performant than a PC using Direct Storage and GPU based decompression.

You have complained about the references to Direct Storage as "hypotheticals" however you are using the performance of a small number of games which are clearly terrible ports (e.g. TLOU) and/or don't use the latest features of the PC's "IO architecture" to draw conclusions about the relative theoretical performance of the PS5's "IO architecture" vs that of the PC. If you want to talk about real world results in games, then do so, and we can discuss why we might be seeing different results across the different platforms. But it was yourself that chose to talk about the theoretical capabilities of the platform in referencing the capability of "IO architecture". So of course we are going to talk about the theoretical capabilities of the architecture in that case. It's also disingenuous to say we cannot bring up those very real aspects of the architecture just because they are yet to be used in actual games, given that GPU decompression is only a few months old, and would likely need to be integrated into the game development quite early on. Again, if you want to talk about the theoretical capabilities of the PC's "IO architecture" as it pertains to currently released games when using the Win32 API, that's one thing, but you placed no such restriction on your claims of the PS5's "more performant I/O architecture". If you had, this would be a different discussion.


And he has since doubled and tripled down on the belief that consoles do not have a leg up on PC in this area as of present day. As it relates to the topic at hand, he is totally wrong. You don't get cookies for giving answers to a question that was never asked. You get straws.

As noted above, if you are now reigning in your claim to reference only the application of the PC's "IO architecture" in currently released games (which is heavily limited due to the use of Win32 in almost every title) then I'll happily agree that the PS5 is significantly more efficient in that regard. But still not more performant since it is quite possible, even with Win32 to stream higher amounts of data on a PC with a sufficiently powerful CPU to handle the decompression. For example it should take around 9-10 Zen 2 cores at 3.6Ghz to decompress a 14GB Kraken stream from a high end PC SSD. Much less on a high end Zen4 core. And of course high end CPU's can comfortably spare that many cores and still have a lot more CPU power left over than the PS5. Obviously this would be far less efficient than the PS5 solution though, which is where GPU decompression comes in.

 
Last edited:
Assuming developers use it which so far seems not to be the case. It's been out for months and only one game bothered with it. I understand why they don't do GPU decompression because it requires them to repackage their assets using GDeflate or a GDeflate-compatible algorithm, but they could just use it to at least take advantage of the much faster nvme drives. That doesn't seem to hard to implement.
Your not thinking widely. Isn't it the case that it's just too early and games have been in development for a long time? Devs have a hard enough time trying to get done with what they have to focus on a new feature right now. But it's inevitable in the same way RTX is
 
The most interesting part for me has been the that everybody tries to pinpoint why PC TLOU does not just crush PS5 TLOU for every instance possible, without really acknowledging that TLOU was a game made for PS3 that got ported to PS4 then ported again to PS5 and again ported to PC.
Heck, the design choices and implementation choices done per port must add so much cruft to jury rig stuff to work, that the it probably got more comments about "Here be dragons" than the map Columbus had when he set sail for India.

Btw, did ND redo graphics each time or atleast P3 -> P4? Or was the original files high def enough to just scale down again? Or are they just using the same graphics?
ps5 tlou has little to do with ps3 version, its ps4 lou2 engine and assets
 
I don't know if you remember but, at the start of this, i said "excluding storage bottlenecks". The reason i said that is that the both hUMA and NUMA suffer from this issue. With regards to the rest of my statement, I should have been more clear. I've included a picture below to better represent what I was trying to explain. Architecturally speaking, there are a lot of inefficiencies in this one picture.

numa.png

I think you're misunderstanding what actually needs to travel over that PCIe interface. Also the diagram is wrong. PCIe 4.0 16x is 32GB/s in each direction, 64GB/s total.

However it doesn't need that much bandwidth, as amply demonstrated by basically any PCIe generational comparison benchmarks which show little to no improvement whenever a new standard is introduced that doubles the bandwidth of the previous one.

The reason for that is while the CPU and GPU both need their own dedicated bandwidth, those are isolated pools for operations that only take place between the GPU and it's memory, or the CPU and it's memory. Some data sets need to be visible to, or worked on by both processors, and in those instances you would traditionally copy the data back and forth between the pools to allow that. So some limited PCIe bandwidth is required for those copy operations, but it's quite limited and no-where near going to saturate that bandwidth. The bigger issue in that scenario is the added latency incurred by the copy operation, which a well developed application will attempt to hide behind other ops - but this will not always be possible. That's where GPU Upload Heaps come in, because using that mechanism the CPU can both read and write date in VRAM without the need to copy it back across the PCIe bus to system memory first. Much the same way you would do it in a hUMA system, albeit it with slower accesses.

The other big use of that PCIe interface is of course transferring IO data from the SSD into the VRAM. But as we know the absolute maximum that can hit is 7GB from the SSD, decompressed to ~14GB sec on the CPU. So IO can never take up more than 14GB/s of the available 32GB/s. But that will be halved back to 7GB/s when GPU decompression is used. And of course, some of that data coming from the SSD is not intended for the GPU (it's CPU only data) and so does not need to be sent to VRAM, further reducing the usage.

In addition, PCIe5 is in the process of being rolled out which doubles that bandwidth again to 64GB/s in each direction or 128GB/s total.

In other words the above diagram, as well as being wrong, does not show any specific bandwidth bottleneck.
 
The PC architecture is so incredibly inefficient. When you compare Windows laptops with Apple's M-series SOCs its becoming really apparent.

Really hope for an unified architecture next generation, atleast for laptops. Or I won't upgrade for another 2 years.

I'd suggest you read through the last few pages of this thread which discuss that topic in great detail rather than rely on one incorrect diagram that doesn't actually show any major inefficiency. If we did get UMA in PC's then you're basically waving goodbye to upgradability. And you're saying hello to less effective bandwidth, along with lower CPU performance thanks to higher memory latencies.

A far better solution for the PC space is to bring as many of the advantages of UMA as possible to the PC's split memory design while retaining all the advantages of that split design. Short term, things like gpu upload heaps, re-bar and faster PCIe standards are helping to make that happen. Longer term we probably need to be looking at some sort of cache coherency between CPU and GPU.
 
I think everyone has been pretty consistent in agreeing that PC development is harder, that's pretty much a given. The main thrust of much of the discussion has been
...
They could have a real impact today, while also having an elegant fallback path for systems that don't support them.
...
And even if you can't support that, it'll simply fall back to CPU decompression.
...
So much less efficient (basically what happens now) but still a totally viable fall back.
I think there's a real misconception here about where bad performance and bugs in PC releases come from. Listing multiple features that are only availible on high end devices, and not necessarily all availible at once, but touting their fallbacks, illustrates exactly the kind of complexity that makes PC less practical to develop for. Game code isn't "load or render as much as you can this frame, and otherwise just show a black screen for the parts that aren't ready" -- everything has to be pipelined, replaced with placeholders, streamed in at low resolution, and scheduled so that the entire frame renders with consistent performance. You can't read from your gpu memory until you've finished writing to it. You can't draw a texture which isn't decompressed. "Just falling back to system memory" without a complicated and scheduled system to upload memory to the gpu means throwing away a huge percentage of your frame time. Each one of these configurations has to be considered, planned around, placed in conditionals throughout the code, have content measured against it, etc. Supporting a lot of optional features results in a renderer that's staggeringly complex. Complexity is the enemy, no amount of skill lets you spend infinite time tracing down different bug fixes for dozens of different configuraitons. A skilled developer makes smart choices about what features are worth supporting and what aren't worth supporting, and that's usually based on technologies that have been around long enough to be widely availible to consumers.
 
I think there's a real misconception here about where bad performance and bugs in PC releases come from. Listing multiple features that are only availible on high end devices, and not necessarily all availible at once, but touting their fallbacks, illustrates exactly the kind of complexity that makes PC less practical to develop for. Game code isn't "load or render as much as you can this frame, and otherwise just show a black screen for the parts that aren't ready" -- everything has to be pipelined, replaced with placeholders, streamed in at low resolution, and scheduled so that the entire frame renders with consistent performance. You can't read from your gpu memory until you've finished writing to it. You can't draw a texture which isn't decompressed. "Just falling back to system memory" without a complicated and scheduled system to upload memory to the gpu means throwing away a huge percentage of your frame time. Each one of these configurations has to be considered, planned around, placed in conditionals throughout the code, have content measured against it, etc. Supporting a lot of optional features results in a renderer that's staggeringly complex. Complexity is the enemy, no amount of skill lets you spend infinite time tracing down different bug fixes for dozens of different configuraitons. A skilled developer makes smart choices about what features are worth supporting and what aren't worth supporting, and that's usually based on technologies that have been around long enough to be widely availible to consumers.

Fortunately GPU decompression isn't only available on high end systems. Any Windows 10 upwards system with a Maxwell or GCN GPU will support it. And I believe it can also be set as a user toggle to use CPU decompression instead if the users system is more suited to that.

For the gpu upload heaps I hear what you're saying but can't a lot of that be handled by the driver? e.g. I know that currently the AMD driver will automatically move data to the bar from the upload heap if it determines it's more optimally located there. That's how we can see speed ups from resizable bar in existing games without patches.

Still, as noted in the post you quoted, no-one is saying PC development is as easy as console development and perhaps GPU upload heaps won't see extensive usage until they are supported on the minimum targetted spec. I've no idea what level of hardware currently supports the function.
 
Assuming developers use it which so far seems not to be the case. It's been out for months and only one game bothered with it.

The GPU decompression portion has been out of beta only since last November, that's really the development of DS that holds the most interest. I'm not sure loading times per-se has really been a bottleneck in ports until now, it's more the extra CPU resources potentially required, and you can't expect a groundswell of integration happening after a few months of release.

I understand why they don't do GPU decompression because it requires them to repackage their assets using GDeflate or a GDeflate-compatible algorithm, but they could just use it to at least take advantage of the much faster nvme drives. That doesn't seem to hard to implement.

MS has basically said that integrating DS needs to be done quite early in development. All indications I've seen is that it is not trivial to implement as a patch.
 
TLOU Remastered on PS4 is essentially the PS3 game.

From an asset quality, kind of - Joel and Ellie's in-game models were substantially improved, but nothing drastic. But it sounds like quite a lot had to change:

Porting the Last of Us Was Hell

Remastering The Last of Us for PlayStation 4 isn’t as simple as flipping a switch, according to Naughty Dog’s Neil Druckmann. Speaking in an interview with Edge, the creative director explained the developer “expected it to be hell, and it was hell.”

You see, the PlayStation 3 is a tricky piece of hardware, one for which The Last of Us “was optimized on a binary level.” And even with some of the best engineers in the industry working on the project, “just getting an image onscreen, even an inferior one with the shadows broken, lighting broken and with it crashing every 30 seconds… that took a long time.”
 
I mean, there should be no doubt about it. It's inevitable and necessary. The days of getting rapid increases in VRAM are over because of how slow the cost per GB improves nowadays. I mean, it's pretty insane that we still have midrange 8GB GPU's these days, all while they first showed up nearly a decade ago with the R9 290/290X, and were already normal midrange models by 2015.

DirectStorage is absolutely critical to getting a ton more out of the VRAM we do have, much like this same paradigm is for the new consoles and their limited 16GB of total RAM. It has to happen. And it will not just be some 'feature option', it's going to soon become a core, fundamental part of the way new games are built, with SSD's as a true minimum requirement.


The only advantage I think I could give to TLOU in the texture/shader department here is the sheer amount of decent quality materials they have authored. It's one of those areas that really demonstrates Naughty Dog's AAAA-level budget, with just how many different assets they(and their dozen+ outsourced asset creation studios...) deliver throughout the chapters, all at a consistent, quite high quality.

But at an individual asset level, I dont really think it's doing anything particularly special here at all.

I'm now 7 hours in with TLOU and the game has made me go 'woah' in so many cases that I can agree with the reviews I linked before. I don't like these kind of games, but I bought TLOU on the basis of those reviews and was ready to refund it if it didn't live up to the great expectations.


With Plagues' Tale, it's been 3hrs of playtime using the gamepass, and from what I've seen uptil now, I doubt it's gonna even match, let alone be better than TLOU.
 
  • Like
Reactions: snc
The GPU decompression portion has been out of beta only since last November, that's really the development of DS that holds the most interest. I'm not sure loading times per-se has really been a bottleneck in ports until now, it's more the extra CPU resources potentially required, and you can't expect a groundswell of integration happening after a few months of release.



MS has basically said that integrating DS needs to be done quite early in development. All indications I've seen is that it is not trivial to implement as a patch.
As I said, I understand not using DirectStorage 2 with GPU decompression. That’s much harder to implement but the guys at Square managed to get 1.1 in Forspoken and it came out like 2 months before their game shipped. Surely, Naughty Dog could have used it for this game considering the awful load times.
 
Why would data be struggling to get to where it needs to be on a properly equipped PC and application? Like I'm seriously asking you to explain the bottleneck here in terms of interfaces, bandwidth and processing capability vs workload because all I'm hearing are wild claims about PS5 superiority without any technical details to back them up.

Obviously a PS5 is going to outperform an ill-equipped PC in this respect, (and as we've seen, if the application isn't properly utliising a correctly equipped PC, the result will be the same), but lets take an PCIe 4.0 NVMe equipped system utilising Direct Storage 1.1 with GPU decompression as a baseline along with a CPU and GPU that is at least a match for those in the console. And as a preview I will say that a little more CPU power on the PC side should be required for a similar result, but I'll leave you to explain the detail of why...



Why would Direct Storage be unable to keep up? As with my above point, please explain this in terms of interfaces, bandwidth and processing capability vs workload. We already have benchmarks showing decompression throughput far in excess of the known limits of the hardware decompressor in the PS5, so what is it that you think will not be able to keep up, and why?

Perhaps it's the GPU's ability to keep up with the decompression workload at the same time as the rendering. Which begs the question of how much data are you expecting to be streamed in parallel to actual gameplay? Even a modest GPU of around PS5 capability can decompress enough data to fill the entire VRAM of a standard 8GB GPU in less than 1 second. And you're never going to completely refresh you're VRAM like that mid gameplay. If I recall the Matrix awakens demo was streaming less then 150MB/s and I think @HolySmoke presented some details for Rifts Apart here before which show even that has modest streaming requirements on average.

Of course there will be full scene changes which including load screens or animations (which includes ultra short loads animations like the rift transitions in Rifts Apart) where a relatively large amount of data will be loaded from disk in a very short period, but you aren't actually rendering much of anything on the GPU at those points and so the entire GPU resources can be dedicated to the decompression much in the same way that the hardware block on the PS5 is used.

For normal, much more modest streaming requirements, async compute is used targeting the spare compute resources on these GPU's as a GPU is very rarely 100% compute limited.
So i still believe that Direct Storage had no real test - even Forespoken was not a real test ( in not much data was requested ). The real test will be if we see how it fares against PS5 with the hopefully soon coming port of R&C Rift Apart. Even though that game did not fully utilize PS5s bandwith entirely (visible when people run the game of Gen 4 SSDs with actuall less bandwidth than Sony officially asks for), but anyway it would be the best test we could have for now.
I also simply dont believe any figures that tel the GPU usage would be only small, so no real burden parallel to rendering the game. I happily repeat that i simply DONT BELIEVE that a cobbled together Software solution is on par with PS5s very well thought out hardware array. Period. Dont harrass me with further questioning - i dont believe it and you going to accept that - are we clear?! If you continiue you wander to the blocklist until we have a final good test at hand. Oh and btw - when i speak "Direct Storage is not on par with PS5" then i mean mainly the RTX 20xx cards and their AMD Counterparts. Those are the cards that everyone tests against PS5. A RTX 30xx or even 40xx cards will have most likley enough GPU ressources to handle decompression while rendering / raytracing a game . But i stand to what i proclaimed often - that RTX 20xx cards will see no light against later Gen PS5 Exclusives.
That i happily post here to be later quoted.
 
Status
Not open for further replies.
Back
Top