DirectStorage GPU Decompression, RTX IO, Smart Access Storage

PCs will do just fine with GPU decompression but it does require additional care from developers to make sure that it's not blocking graphics rendering resulting in a hitch/stutter. A naive implementation will lead to just that since loading and decompression may easily saturate all bandwidths leaving nothing for the rendering system.
Developers cant control how powerful somebody's GPU is gonna be and so whether it will end up 'blocking' any graphics pipeline. Some of those results demonstrate some pretty heavy penalties.

If we really want to allow developers to take advantage of DirectStorage in the future, and not just use it in some light form for faster loading screens or bit of extra streaming help or whatever, that performance loss needs to be mitigated. Like, almost completely. I'd say less than 5% penalty would be in the acceptable realm, but anything more and you're simply getting into the territory of "Why do we not just have something that can do this better and not sap precious GPU resources?".
 
Developers cant control how powerful somebody's GPU is gonna be and so whether it will end up 'blocking' any graphics pipeline.
They can and do pretty much always. It's called "minimum and recommended system requirements".

If we really want to allow developers to take advantage of DirectStorage in the future, and not just use it in some light form for faster loading screens or bit of extra streaming help or whatever
Not sure what's "really take advantage" means because DirectStorage is in it's essence "faster loading".

that performance loss needs to be mitigated. Like, almost completely.
No, it's not. You can budget your GPU usage for DirectStorage to perform GPU decompression just as you budget the GPU for any other task it may be doing in modern games whether it's rendering, simulation or some ray traced sound calculations. The point is to make sure that your DS load doesn't block other loads running on the same GPU from executing.

I'd say less than 5% penalty would be in the acceptable realm, but anything more and you're simply getting into the territory of "Why do we not just have something that can do this better and not sap precious GPU resources?".
What's "something"? Are you willing to pay for a dedicated expansion card which will handle this? I'd imagine that 99.9% of gamers would prefer their existing GPU to take care of that, and if the results on screen are solid the 5, 10, 15 or 20% performance cost would be irrelevant since that'd be what the developers intended to spend on this workload.
 
Maybe nVidia will add a dedicated storage block in a futur gpu génération, with dx storage xx suppporting it later on :eek: And then amd and Intel will follow, as usual...
 
Back on the GPU front: how do modern GPUs handle multiple simultaneous work kernels, in the sense of prioritization? I'm obviously asking from the perspective of, how do we keep a GDeflate work kernel moving but at the same time deprioritize it enough to keep regular shader traffic moving at an acceptable rate? Is such a thing possible on today's hardware? Is this perhaps a possible variation / derivation on Variable Rate Shading?
 
Modern PC's usually have a surfeit of CPU power--I'm surprised no one has explored how an optimized vectorized CPU decompression implementation compares.
 
Back on the GPU front: how do modern GPUs handle multiple simultaneous work kernels, in the sense of prioritization?
Nv and AMD have ways of specifying thread priority AFAIR but I don't think that these are exposed anywhere in DX so in essence the only way is to make sure that the payload fits into async window(s). Which in itself is hard to do for all the relevant/supported h/w.

I do wonder if DS needs better priority controls which are missing at the moment.
 
Modern PC's usually have a surfeit of CPU power--I'm surprised no one has explored how an optimized vectorized CPU decompression implementation compares.
The intent of all these newfangled DirectI/O and DirectStorage features is to minimize or eliminate CPU and main memory accesses, therefore reducing overall latency of the transfer and unpack from disk to VRAM. Looping the CPU back into the equation would be a step away from the desired outcome. Each move on the CPU also requires main memory access, which is (at least) one order of magnitude slower than GPU <-> VRAM accesses. Now, does that really impact the performance of modern games? Maybe not today, maybe it does tomorrow.

Nv and AMD have ways of specifying thread priority AFAIR but I don't think that these are exposed anywhere in DX so in essence the only way is to make sure that the payload fits into async window(s). Which in itself is hard to do for all the relevant/supported h/w.

I do wonder if DS needs better priority controls which are missing at the moment.
Thanks @DegustatoR for the answer. Since you mention the two big IHVs have ways ot specifying priority for GPU work but not via DX, am I to assume this is then exposed either by GPGPU methods (eg OpenCL / HIP) and/or Vulkan? Time for the googles!
 
The intent of all these newfangled DirectI/O and DirectStorage features is to minimize or eliminate CPU and main memory accesses, therefore reducing overall latency of the transfer and unpack from disk to VRAM. Looping the CPU back into the equation would be a step away from the desired outcome. Each move on the CPU also requires main memory access, which is (at least) one order of magnitude slower than GPU <-> VRAM accesses. Now, does that really impact the performance of modern games? Maybe not today, maybe it does tomorrow.
But, as I understand it, there is no storage -> GPU path right now even with GPU decompression enabled--a staging area in CPU memory is still required. Which begs the question of, why not just do it on the CPU?
 
But, as I understand it, there is no storage -> GPU path right now even with GPU decompression enabled--a staging area in CPU memory is still required. Which begs the question of, why not just do it on the CPU?
There can be no "storage -> GPU path" right now or ever because the storage in question isn't connected to a GPU. Even if we entertain the idea of a system where a GPU can just read any data from system storage (whether its connected to CPU's PCIE or directly to the GPU) that wouldn't bring us any improvements in comparison to current implementation where a data is read into an upload heap in system memory and then copied to the GPU local memory - because system RAM b/w is several times higher than PCIE b/w over which such copy occurs.

The question of why not just do it on the CPU doesn't have a definitive answer because you can't predict the CPU or GPU a system will have. Which is why MS explicitly recommend a user facing option of what to use for data decompression - something which is completely missing for some reason from the implementations we have up to now.

Generally though transferring data in a compressed form improves effective bus b/w so if this is the limiting bottleneck in whatever you're doing it may lead to sizeable performance improvements to decompress the data on the GPU.
 
The biggest bottleneck is, and has been the game software itself for a long time now. If a future game requires fast loading.. they will put in the work to code and optimize it to work as it should, DStorage required or not.

It simply needs to be a priority. If the studio sets out to make a game pushing loading and streaming tech.. then they can do it with what's provided.
 
There can be no "storage -> GPU path" right now or ever because the storage in question isn't connected to a GPU. Even if we entertain the idea of a system where a GPU can just read any data from system storage (whether its connected to CPU's PCIE or directly to the GPU) that wouldn't bring us any improvements in comparison to current implementation where a data is read into an upload heap in system memory and then copied to the GPU local memory - because system RAM b/w is several times higher than PCIE b/w over which such copy occurs.

The question of why not just do it on the CPU doesn't have a definitive answer because you can't predict the CPU or GPU a system will have. Which is why MS explicitly recommend a user facing option of what to use for data decompression - something which is completely missing for some reason from the implementations we have up to now.

Generally though transferring data in a compressed form improves effective bus b/w so if this is the limiting bottleneck in whatever you're doing it may lead to sizeable performance improvements to decompress the data on the GPU.
In theory PCI-E should support peer to peer transfers that bypass host memory, but I don't know if Intel or AMD support this properly. And yes, you wouldn't realize a throughput benefit, only possibly a latency benefit, by enabling this.
 
In theory PCI-E should support peer to peer transfers that bypass host memory, but I don't know if Intel or AMD support this properly. And yes, you wouldn't realize a throughput benefit, only possibly a latency benefit, by enabling this.
Make Radeon SSGs a thing again. Game data stored on SSDs directly attached to GPUs
 
In theory PCI-E should support peer to peer transfers that bypass host memory, but I don't know if Intel or AMD support this properly.
It could if all storage options out there would be smart enough to read data from themselves from any file system known to men.
Otherwise you'll still need CPU to read the data and put it into a GPU recognizable form somewhere first for the GPU to be able to load and decompress it.
In other words I don't see this as a viable option, and also I don't really see any benefits from that. CPUs are completely capable of handling data reads and staging buffers in system memory. By avoiding all that with a complex and expensive solution you're freeing up what is likely <1% of CPU time and <100MB of system memory.
 
It could if all storage options out there would be smart enough to read data from themselves from any file system known to men.
Otherwise you'll still need CPU to read the data and put it into a GPU recognizable form somewhere first for the GPU to be able to load and decompress it.
In other words I don't see this as a viable option, and also I don't really see any benefits from that. CPUs are completely capable of handling data reads and staging buffers in system memory. By avoiding all that with a complex and expensive solution you're freeing up what is likely <1% of CPU time and <100MB of system memory.

It can be made that the CPU issues commands to read from specific sectors on the storage and set the video memory as the target address. This way, there's no need to move the data into the main memory, so you can free up some main memory bandwidth. Note that the saves on main memory bandwidth is doubled because the GPU also has to read the data from the main memory.
However, due to historical reasons many PC systems do not have storage connected to the CPU directly, but on a separate IO chip. In many cases the CPU connects to the IO chip via some proprietary connections, so I'm not sure if this can be reliably done on most systems.
 
But, as I understand it, there is no storage -> GPU path right now even with GPU decompression enabled--a staging area in CPU memory is still required. Which begs the question of, why not just do it on the CPU?
If the GPU and the platform supports resizable BAR (“Smart Access Memory”), the GPU can expose its local memory in full in the system physical address space, and in turn you can ask the storage controller to write (DMA) to the GPU local memory.

APIs like DirectStorage are meant to be OS-level support of this capability.

rDMA products in HPC work similarly, but with a NIC rather than a NVME/storage controller.
 
Last edited:
In theory PCI-E should support peer to peer transfers that bypass host memory, but I don't know if Intel or AMD support this properly. And yes, you wouldn't realize a throughput benefit, only possibly a latency benefit, by enabling this.
There was a catch - the NVMe spec didn't specify that different queues should have been flagged to permit out-of-order transfers so that a stalling CPU memory controller wouldn't prevent transfers to the GPU and vice versa. It didn't matter when you use one queue per CPU core (which simplifies scheduling from the CPU side), but everything ends up being buffered in the same L3 regardless. It matters a lot when one of the peers can't make guarantees about when it can accept the transfer.

It's just a fine detail - but it renders scheduling unreliable and can have a negative impact on system performance. And with that behavior not required by specification, it's not supported by NVMe firmwares...

Software wise (OS and partially also on the driver end), as well as on the hardware side (CPU and GPU), it appears to have been all prepared to just open up a couple of additional NVMe queues to stream directly to ReBAR capable devices....
 
We've discussed this in the recent past; it's worth reminding people again:

Just because it's technically possible to have a storage controller directly target video memory for a data transfer, doesn't mean you can actually accomplish this task without involving main memory and CPU. You must remember whole-disk encryption (aka Bitlocker) is mandated to all manufacturers who sell Windows 10 and Windows 11-labeled PCs. Further, Windows 11 will try to enforce Bitlocker without asking during a new installation. When Bitlocker is enabled, the literal raw blocks on the storage device are encrypted and require the OS to broker the decryption process. This isn't something a GPU will accomplish.

Even if we ignore full disk encryption, partitions and filesystems and encryption at a filesystem level (EFS) are also still a thing. All of these require CPU overhead to manage access patterns, logical identity and access controls, file system decompression and decryption, and to maintain filesystem metadata such as last access time.

Unpacking the various layers of NTFS will never be so simple as DMA from storage to VRAM.
 
Further, Windows 11 will try to enforce Bitlocker without asking during a new installation.
when i installed windows 11 fresh on my new computer in october 2023, it did not default to bitlocker on
to be fair i did the sneaky oobe trick install where i enabled the feature where you can skip the usually required microsoft account step so that may be related
 
Back
Top