State of GDeflate *spawn

Remij

Veteran
Going to bring it up here because this thread is always being looked at by everyone.. and well maybe it will be pertinent to DF game analysis discussions in the future.

Is there an obvious way we could discern on PC whether a DirectStorage supported game is using CPU decompression or GPU decompression for GPU assets? Is there a way to tell if GDeflate specifically is being used by a game on the GPU? If not, would PCIe(Tx) bandwidth readings give potential clues? Am I missing something more obvious?

Also, do you guys think GDeflate has a real future? Does the GPU get a dedicated GDeflate decompression block which somehow avoids contention with GPU resources, while also allowing for fallback support on the GPU for cards without the hardware block?

I'd like to start seeing some positive developments for DirectStorage on PC. I mean, FF16 uses it (don't know if it's using GPU decompression) and it loads extremely fast. Faster than Playstation 5 in fact.. but it doesn't change the fact that for gameplay streaming scenarios, doing large amounts of decompression can impact performance and potentially cause hitching and other issues which games of the future will undeniably have to contend with.
 
You can use Special K to see if a game uses GPU or CPU Decompression, including GDeflate use. I think GDeflate has a debatable future atm given how poor the first few results have been. I think it is actively bad perhaps in a game like Ratchet.
 
Going to bring it up here because this thread is always being looked at by everyone.. and well maybe it will be pertinent to DF game analysis discussions in the future.

Is there an obvious way we could discern on PC whether a DirectStorage supported game is using CPU decompression or GPU decompression for GPU assets? Is there a way to tell if GDeflate specifically is being used by a game on the GPU? If not, would PCIe(Tx) bandwidth readings give potential clues? Am I missing something more obvious?

Also, do you guys think GDeflate has a real future? Does the GPU get a dedicated GDeflate decompression block which somehow avoids contention with GPU resources, while also allowing for fallback support on the GPU for cards without the hardware block?

I'd like to start seeing some positive developments for DirectStorage on PC. I mean, FF16 uses it (don't know if it's using GPU decompression) and it loads extremely fast. Faster than Playstation 5 in fact.. but it doesn't change the fact that for gameplay streaming scenarios, doing large amounts of decompression can impact performance and potentially cause hitching and other issues which games of the future will undeniably have to contend with.
I was talking about that years ago on this forum and it was not hard to predict. Sure for traditionnal loadings with black screens GPUs will be faster than anything else (CPU or dedicated I/O), but games engines are more and more streaming things during gameplay and less during loadings screens and there you need super low latencies that both CPU and GPU currently lack.

The future is with accelerated I/O without impacting CPU or GPU ressources like the way it's done on PS5. PS5 has games like Demon's Souls, Ratchet or AstroBot that are with virtually no loading screens (or < 2 sec) with seamless playable transitions between scenes and streaming been done during gameplay without impacting the rather limited CPU/GPU ressources of PS5.

This is how were were playing during 8-16, 64 bits era too. And please could someone also start doing 4K CRT TVs with modern technology? Those LCD, LED, OLED screens... just don't work for videogames.
 
The future is with accelerated I/O without impacting CPU or GPU ressources like the way it's done on PS5. PS5 has games like Demon's Souls, Ratchet or AstroBot that are with virtually no loading screens (or < 2 sec) with seamless playable transitions between scenes and streaming been done during gameplay without impacting the rather limited CPU/GPU ressources of PS5.

But so does PC.

I said years ago that we need to see something similar to PS5's I/O block integrated in to all PC CPU's in the future as it would benefit the platform in it's entirety.
 
But so does PC.
Read the whole thing to get the whole message: without impacting the rather limited CPU/GPU ressources of PS5.
I said years ago that we need to see something similar to PS5's I/O block integrated in to all PC CPU's in the future as it would benefit the platform in it's entirety.
That's exactly what Globby said. A standardised IO acceleration unit is a good idea.
 
The problem with the PC is that it will take a long time before new hardware will used by the majority of people. Every PC also has a different configuration so you never know if you have CPU or GPU cycles to spare on decompression. Usually it's not a problem as you can make the software fallback to an different system, or a give the choice to the end user. But with asset loading the compression scheme that might be efficient for CPU decompression if different for what is best for GPU decompression or dedicated hardware decompression. So to support all the different variants you need to have you assets compressed in those variants, that will double your install size.
 
The problem with PC is that there isn't going to be a standard data path for all data from all storage devices. For these techniques to be of use, they would have to send compressed data past any bottlenecks before decompression. But with PCIe speeds increasing almost yearly (we should have 242GB/s per direction PCIe7 next year), we are reaching the point that compression's main purpose will be to keep file size smaller. But drives are getting larger, too. If you aren't limited by bandwidth, and you store your assets in an executable state, there shouldn't be a CPU or GPU cost to decompress those assets.

I guess what I'm saying is that the pace transfer speeds looks like it's going to outpace the need for hardware decompression in the near future. If hardware decompression was on the market next week, it would take long enough for that hardware to get any substantial install base that future will make the advantages of hardware decompression moot.
 
You can use Special K to see if a game uses GPU or CPU Decompression, including GDeflate use. I think GDeflate has a debatable future atm given how poor the first few results have been. I think it is actively bad perhaps in a game like Ratchet.
Oh really? That's good! I'll have to check it out then. I'm curious if FF16 is using the GPU at all for decompression. Yea I agree, it hasn't given very great first impressions thus far. I can understand the point of GDeflate as an interim solution until a dedicated solution can be implemented and adopted across a large enough portion of the market.. but what's THAT going to look like, or when? And that's what leads me to ask the question of whether it has a future or not. If the way GPU asset decompression is going to work for the foreseeable future on PC is by keeping the assets compressed over the PCIe bus and decompressing on the GPU.. AND you need to maintain some form of fallback which can decompress on the GPU (compute) which doesn't have the dedicated block... what other option is there?

But so does PC.

I said years ago that we need to see something similar to PS5's I/O block integrated in to all PC CPU's in the future as it would benefit the platform in it's entirety.
The problem with implementing a dedicated block with PC CPUs is that the assets would have to be decompressed into RAM and sent over the PCIe bus uncompressed.. which is one of the reasons why they want to decompress assets on the GPU in the first place. So doing that wouldn't gain you anything.. the CPUs we have are already capable of doing that super quick at load time.

I think the bigger issue is contention for resources on both the CPU and GPU.. which is why we need dedicated hardware. Do we get dedicated hardware in both the CPU and GPU? Or does something else fundamentally change with PC architecture going forward? I wish we'd get some hints as to what the plan is.
 
The problem with PC is that there isn't going to be a standard data path for all data from all storage devices. For these techniques to be of use, they would have to send compressed data past any bottlenecks before decompression. But with PCIe speeds increasing almost yearly (we should have 242GB/s per direction PCIe7 next year), we are reaching the point that compression's main purpose will be to keep file size smaller. But drives are getting larger, too. If you aren't limited by bandwidth, and you store your assets in an executable state, there shouldn't be a CPU or GPU cost to decompress those assets.

I guess what I'm saying is that the pace transfer speeds looks like it's going to outpace the need for hardware decompression in the near future. If hardware decompression was on the market next week, it would take long enough for that hardware to get any substantial install base that future will make the advantages of hardware decompression moot.

I feel that outlook might be optimistic as it pertains to the consumer market.

Enterprise workloads recently have driven PCIe speeds (after a relatively long lull with PCIe 3.0) but actual consumer adoption of just PCIe 5.0 is quite behind ratification and enterprise adoption due to costs that aren't entirely addressable via semiconductor scaling (which itself as slowed). Signal integrity for PCIe 7 is going to be even tighter and drive costs even higher due to the physical materials/manufacturing required, much less anything beyond that.

If anything looking at the broader industry in general we're seeing that the scaling cost to move data around isn't trending in the direction that it can easily solvable by just more hardware anymore and we need more software/hardware synergistic solutions.
 
Last edited:
Oh really? That's good! I'll have to check it out then. I'm curious if FF16 is using the GPU at all for decompression. Yea I agree, it hasn't given very great first impressions thus far. I can understand the point of GDeflate as an interim solution until a dedicated solution can be implemented and adopted across a large enough portion of the market.. but what's THAT going to look like, or when? And that's what leads me to ask the question of whether it has a future or not. If the way GPU asset decompression is going to work for the foreseeable future on PC is by keeping the assets compressed over the PCIe bus and decompressing on the GPU.. AND you need to maintain some form of fallback which can decompress on the GPU (compute) which doesn't have the dedicated block... what other option is there?


The problem with implementing a dedicated block with PC CPUs is that the assets would have to be decompressed into RAM and sent over the PCIe bus uncompressed.. which is one of the reasons why they want to decompress assets on the GPU in the first place. So doing that wouldn't gain you anything.. the CPUs we have are already capable of doing that super quick at load time.

I think the bigger issue is contention for resources on both the CPU and GPU.. which is why we need dedicated hardware. Do we get dedicated hardware in both the CPU and GPU? Or does something else fundamentally change with PC architecture going forward? I wish we'd get some hints as to what the plan is.
Yes I think so. The model is the PS5 CPU + GPU + I/O architecture done by Cerny and his team. They'll need to create a third unit: the I/O chip that will directly feed the GPU ram with the data asked (from CPU or even GPU) with zero cost to both CPU / GPU and also minimal cost to bandwidth as the data is loaded only once already GPU ready to use. They could also implement the cache scrubbers of PS5 GPU that will even improve the efficiency and further lower the latencies.

The I/O chip and its caches will take care of the difference of compressions / variants @Pjotr was talking about. So that will have to be taken care by the developers. They'll have to code for the CPU, the GPU and the I/O chip. More complexity here but overall a more streamlined game development process.

We are talking about a complete redesign of PC motherboards with a third socket, the I/O socket. That chip could also be already wielded on the motherboards at the beginning.

custom_IO.jpg
 
Last edited:
Yes I think so. The model is the PS5 CPU + GPU + I/O architecture done by Cerny and his team. They'll need to create a third unit: the I/O chip that will directly feed the GPU ram with the data asked (from CPU or even GPU) with zero cost to both CPU / GPU and also minimal cost to bandwidth as the data is loaded only once already GPU ready to use. They could also implement the cache scrubbers of PS5 GPU that will even improve the efficiency and further lower the latencies.

The I/O chip and its caches will take care of the difference of compressions / variants @Pjotr was talking about. So that will have to be taken care by the developers. They'll have to code for the CPU, the GPU and the I/O chip. More complexity here but overall a more streamlined game development process.

We are talking about a complete redesign of PC motherboards with a third socket, the I/O socket.

custom_IO.jpg
Ok.. but that doesn't happen all at once. That would take probably a decade for the hardware to penetrate the market and for OS support and software support to materialize. I don't disagree that some form of this is what the future holds.. but we must crawl before we walk, and must walk before we run. So in the meantime, what is the best way.. without changing the I/O architecture of PC as it is today... to do it? I think it's quite obvious we need to get some form of dedicated hardware decompression on the GPU which doesn't contend with GPU resources. The GPU is probably the most frequently updated component of PC hardware.. that would ensure the quickest adoption, and also allow them to have some fallback by still being able to decompress on the GPU compute cores for older GPUs.

Once that is settled, then I think you move on to some fundamental change to the motherboard I/O.
 

How God of War Ragnarök was ported to PC


This seems relevant to this thread:

Digital Foundry: Ragnarök is both a PS4 and PS5 game, and it's easy to notice the differences in install sizes for each platform. How are you handling asset compression and decompression for PC?

Matt DeWald
: One of the things that got called out with our spec sheet is just the size of the game - and it's a really large game in fairness, it's all of Ragnarök plus all of Valhalla plus all of the patches that we released since launch.

But on PC, we lack dedicated hardware for decompression, so we looked at multiple decompression methodologies and systems that we could use. We just didn't find one that would benefit us on the performance side or the stability side, because there have been stability problems with certain decompression technologies. So we just chose to use up more disk space, because disk space is relatively cheap these days. It does take a large amount of space, but it's much better than having a poor experience because the decompression stalls you out or causes frame hitches or other things.
 
With that line I wonder if the possiblity going forward is to have SSDs on the GPU and basically just have duplicate data. The GPU handles decompression and i/o off the SSD connected to the GPU for VRAM and the CPU does it for the system memory.
 
We've had conversations in the past about the GPU having direct access to storage. As discussed before, the challenge to literally make this work exactly as it sounds requires dedicated storage used only by the GPU and nothing else, which becomes difficult because (to @arandomguy 's point) this requires duplication of game data -- one copy on your typical user storage device and another copy on the "GPU storage device." Unfortunately, if you carry this thought to its logical conclusions, you end up with GPU storage issues with sizing, autonomous management and cleanup.

Why can't GPUs directly access the SSD you're already using today? Because filesystems are still a thing, and touching a file inside a filesystem requires CPU and main memory activity which can't be duplicated by the GPU. Filesystems mandate a litany of access processes, arguably the most egregious is access controls. Your file system must parse which file(s) are being touched, if the user(s) involved are allowed to touch those, and then log the access requests and approvals/denials. You also need to think about EDR controls (malware protection) and user-data or full-disk encryption (or both, for high security compute needs) and how files in a filesystem are mapped to a partition table, which are mapped to logical volumes, which are mapped to physical blocks on one or more disks -- all of which have their own unique topologies, which these mappings are all stored in main system memory.

DirectStorage has actually solved most of the heavy lifting parts of this equation, which drives the unique requirement of compressing all game assets into a singular file. This singular file then permits a singular filesystem ACL check during the original creation of the open file handle, after which a separate DS call directly block-maps the file through all the aforementioned translation layers (FS -> Partition -> volume -> physical disk(s)) so the storage driver can functionally transfer block reads directly to the GPU. This block transfer also skips one memory copy step too, yet more performance enhancement.

Finally, there's already an I/O port -- this is how PCIe "works." In the above picture, it's shown on the CPU package (but not inside the CPU cores die) because all modern CPUs have included the PCIe root complex on the package for nearly a decade now, which significantly improves latency and speed. There's no rational benefit to moving the PCIe root complex out of the CPU package; they don't intertwine in an obstructive way today. In fact, going to a dedicated I/O "socket" would be moving backwards to the days where we had northbridges, which caused their own signaling latency and congestion issues.

I think people really need to stop and understand what DirectStorage has already solved before trying to revert some things we purposefully left to history.
 
Last edited:
Why can't GPUs directly access the SSD you're already using today? Because filesystems are still a thing, and touching a file inside a filesystem requires CPU and main memory activity which can't be duplicated by the GPU. Filesystems mandate a litany of access processes, arguably the most egregious is access controls. Your file system must parse which file(s) are being touched, if the user(s) involved are allowed to touch those, and then log the access requests and approvals/denials.
Can't you have a scratchpad area that's read only dedicated to the GPU/games? The file system duplicates necessary game data onto that partition and you can ignore file access responsibilities, because it'll just be flat out "anything can read, nothing can write", and the GPU can read whatever bits it wants. I presume technically that's feasible but not practical due to existing legacy structures imposing a certain system-wide way of doing things.
 
Yes I think so. The model is the PS5 CPU + GPU + I/O architecture done by Cerny and his team. They'll need to create a third unit: the I/O chip that will directly feed the GPU ram with the data asked (from CPU or even GPU) with zero cost to both CPU / GPU and also minimal cost to bandwidth as the data is loaded only once already GPU ready to use. They could also implement the cache scrubbers of PS5 GPU that will even improve the efficiency and further lower the latencies.

The I/O chip and its caches will take care of the difference of compressions / variants @Pjotr was talking about. So that will have to be taken care by the developers. They'll have to code for the CPU, the GPU and the I/O chip. More complexity here but overall a more streamlined game development process.

We are talking about a complete redesign of PC motherboards with a third socket, the I/O socket. That chip could also be already wielded on the motherboards at the beginning.

custom_IO.jpg

There's no need (or use) for a 3rd "IO socket" on the PC. The I/O complex noted in the PS5 diagram above already exists on the PC. It's part of the Zen CPU package. On PS5 they simply added a hardware decompression unit into the existing Zen IO complex.

As noted already by others, that would be inefficient on PC because 1. it's wasted die space for any CPU that isn't being used for gaming purposes and 2. it means decompressing the data before sending it over the CPU-GPU PCIe link, and thus losing those data compression benefits there. Although you would still get those benefits over the SSD-CPU PCIe link, which is usually the slower one, so some benefit can still be had in that respect.

The solution to my mind on PC would very closely resemble the existing GDeflate based GPU decompression solution, but instead of using the SM's/CU's on the GPU, it would use a dedicated block on the GPU to decompress the GPU native data while any data intended for processing by the CPU would still be decompressed by the CPU in main memory. Yes that is still an impact on the CPU but represents only a small potion of the total data to be decompressed.

To be clear, as I understand it, the issue with GDeflate based GPU decompression at present is more around how it handles the operation is such a way that control is taken away from the application and can cause the application to stall (i.e. stuttering etc..) rather than actual resource contention- i.e. it's just implemented poorly. Resource contention would still be a thing of course but I don't think that's why we haven't seen much use of GDeflate so far. So the hardware based decoder would need to be implemented in such a way as to avoid that kind of code stalling. And it would of course solve any resource contention issues as well.

The more interesting question for me is what formats and fall back options would be supported? My guess would be something CPU friendly (e.g. zlib and maybe Kraken like PS5) with the fall back option being direct to CPU rather than supporting GDeflate in hardware with a fall back option of the GPU's shaders. That is unless they can fix the GDeflate shader option to not cause stalls in the application. The issue with PC as noted by others is you can't know what the end use will be running. But in all likelihood, the people with older GPU's lacking this hardware decompression unit are also likely to be the people with older CPU's that can less easily cope with the CPU fallback path. So devs are left with the slightly contradictory choice of recommending a hardware decompression based GPU with a smaller CPU, or a bigger CPU with a potentially older/weaker GPU. And to side step that dilemma they could just forgo a lot of the compression altogether and expect the user to suffer a bigger install - just as happened with Ragnarök. And I assume offering the game in two separate packages would be too great a burden on that packaging and validation process.

And so perhaps GPU shader based decompression is still the best solution as a baseline, with the hardware unit being the luxury for high end systems. But they have to make it work properly first, or it's a non-starter. And I'm not sure if that's even possible.
 
Can't you have a scratchpad area that's read only dedicated to the GPU/games? The file system duplicates necessary game data onto that partition and you can ignore file access responsibilities, because it'll just be flat out "anything can read, nothing can write", and the GPU can read whatever bits it wants. I presume technically that's feasible but not practical due to existing legacy structures imposing a certain system-wide way of doing things.

I think that's quite doable, as one can make a special partition reserved for that purpose. On the other hand, if you already solve the I/O path problem, it's really not that big a difference between a special partition and a specific big file.
 
As noted already by others, that would be inefficient on PC because 1. it's wasted die space for any CPU that isn't being used for gaming purposes.

If it becomes a standard hardware feature in CPU's it will benefit everything to do with Windows, not just gaming.

Video editing could use it, Photoshop, Maya..... they could even have Windows itself more compressed and smaller on disk.
 
If it becomes a standard hardware feature in CPU's it will benefit everything to do with Windows, not just gaming.

Video editing could use it, Photoshop, Maya..... they could even have Windows itself more compressed and smaller on disk.

Generally lossless compression/decompression is not a big bottleneck on CPU. You can already enable compression in NTFS for ages (probably more than a decade) and that's generally considered as a net win for I/O if your data is not already compressed. If that's the case for CPU ten years ago, it'd be even less demanding on contemporary CPU.
 
Generally lossless compression/decompression is not a big bottleneck on CPU. You can already enable compression in NTFS for ages (probably more than a decade) and that's generally considered as a net win for I/O if your data is not already compressed. If that's the case for CPU ten years ago, it'd be even less demanding on contemporary CPU.

But if they've never had the ability to decompress GB's of data per second then the software isn't going to use it.

If you gave every CPU the ability to decompress 10GB/s without taxing the CPU, I imagine some data heavy programs would look to take advantage of it.

It wouldn't be something Joe Blogs would see the benefit from, but someone working on music/video production might.
 
Back
Top