DirectStorage GPU Decompression, RTX IO, Smart Access Storage

Anyone know if it's been confirmed to be in use? The Agility SDK ships with these DStorage files by default but it's not really a sign that they are used in any way. Diablo 4 also ships with those files but the application never actually loads them.
The game's process is loading the DLLs. Whether it's actually in use though I don't know.

On a side note, I find it both funny and sad how much the hype surrounding DStorage has died. The new Forza supports it but people seemingly no longer care enough to have tested it in any meaningful way.
That's because people generally misunderstand what APIs are. The fact that some new API feature becomes available doesn't mean that there will suddenly appear lots of applications which will be using it. DStorage is like any other API - the engines are what need to change the most to make proper use of it.
 
That's because people generally misunderstand what APIs are. The fact that some new API feature becomes available doesn't mean that there will suddenly appear lots of applications which will be using it. DStorage is like any other API - the engines are what need to change the most to make proper use of it.
And even when they do use them no API is a magic bullet that will change everything overnight, there's million other things affecting loadtimes, file streaming performance and whatnot besides raw data transfer and unpacking speed
 
And even when they do use them no API is a magic bullet that will change everything overnight, there's million other things affecting loadtimes, file streaming performance and whatnot besides raw data transfer and unpacking speed
That's what I've said. The improvements people are craving for from DStorage should actually happen in the game engines.
 
I think the people who are actually following things understand that the issue was always software related. Ratchet and Clank came out and essentially confirmed that PC could easily keep up with the PS5 without DirectStorage, and it's one of the best examples of SSD usage that there is for the console to this day. It became clear with Forspoken as well that the games can load extremely quickly if that's a goal of the developers for their game. So you had two examples of games which were PS5 exclusives, and their claim to fame was ultra fast loading... and they both worked perfectly fine on PC without GPU based decompression, or even DirectStorage at all.

Bandwidth honestly isn't being stressed at all, especially in streaming scenarios. 5GB/s is even overkill for this generation. People understand that better now than they did before. I'm not saying that wont change as the generation goes on, but currently I think people's minds have been put to ease a bit, that's why hype has died.
 
On a sidenote: What's the current compression ratio you can achieve for textures respectively models with DStorage? Are we yet at the point where it caught up to WebP and alike, or is this still a rather "naive" approach to compression where it attempts to directly compress raw bitmaps without isolating the entropy or accounting for the need for mipmaps?

Do we have trans-coding to BC family codecs on the GPU yet? Or is this still only an option for GPUs with sufficient VRAM to afford entirely uncompressed textures on the GPU?
 
That's just the explanation of how deflate works.

I meant whether the necessary transformations to render deflate efficient were in the default chain yet. You know, stuff like representing color channels in a multi-planar layout as offsets to each other to reduce entropy in the derived channels. Runlevel coding to transform gradients into constants. Reducing higher level mips to a diff on the lower level mip so that each mip level only encodes a narrow frequency spectrum of the information.

Basically, every possible transformation that ensures that information is not repeated or masked by interference with other entropy within the uncompressed, raw buffer passed to deflate. Just the basic preprocessing necessary to ensure that the alphabet size for the Huffman encoder is reduced, and common "features" actually end up as a common symbol.

All the stuff which makes the difference between zipping a raw .bmp file which will top out at somewhere between 20-40% compression ratio and a more advanced algorithm like WebP which will usually get you close to a 1-5% compression ratio instead. Despite both using the same deflate family algorithm as the final encoding step and WebP even having a slightly larger input buffer.
 
Last edited:
Transcoding to something like BCn is probably still too expensive.

More like: It has been viable half a decade ago. And known for a full decade. Only that it hasn't been applied for transconding yet, only for pure encoding from render to texture. A 3080 is re-compressing a 4k texture with reasonable quality in around 40us.
 
Last edited:
More like: It has been viable half a decade ago. And known for a full decade. Only that it hasn't been applied for transconding yet, only for pure encoding from render to texture. A 3080 is re-compressing a 4k texture with reasonable quality in around 40us.

But why do you want to do that? The only upside seems to be savings on disk spaces, which is not really that limiting these days.
Most games are not really limited by disk transfering time, and if you do JPEG/WebP decoding + BCn compression it's almost certainly going to be a performance loss (not to mention quality loss).
 
That's just the explanation of how deflate works.

I meant whether the necessary transformations to render deflate efficient were in the default chain yet. You know, stuff like representing color channels in a multi-planar layout as offsets to each other to reduce entropy in the derived channels. Runlevel coding to transform gradients into constants. Reducing higher level mips to a diff on the lower level mip so that each mip level only encodes a narrow frequency spectrum of the information.

Basically, every possible transformation that ensures that information is not repeated or masked by interference with other entropy within the uncompressed, raw buffer passed to deflate. Just the basic preprocessing necessary to ensure that the alphabet size for the Huffman encoder is reduced, and common "features" actually end up as a common symbol.

All the stuff which makes the difference between zipping a raw .bmp file which will top out at somewhere between 20-40% compression ratio and a more advanced algorithm like WebP which will usually get you close to a 1-5% compression ratio instead. Despite both using the same deflate family algorithm as the final encoding step and WebP even having a slightly larger input buffer.
Hm, I don't think that texture compression is a focus for DStorage. GDeflate is a general purpose compression option which allows you to store the assets on disk in a compressed format, send them to the GPU in that format and decompress them there. The assets themselves are BCn compressed most likely if we're talking about textures specifically. GDeflate is an added compression level for storage / distribution.

As for textures the biggest interesting thing lately was Nv's AI texture compression paper.
 
The only upside seems to be savings on disk spaces, which is not really that limiting these days.
I guess perception differs here? 100GB+ of assets are still not a nice thing to have, when you have to account for the entire infrastructure cost for deployment.
The players own storage may not be something paid out of your pocket - but everything leading up to that is.

There is not much the distribution channel can do for you to get the transfer volume down either. Assets which already do have a half-baked compression applied don't exactly compress well.
JPEG/WebP decoding + BCn compression it's almost certainly going to be a performance loss (not to mention quality loss).
If you skip the BCn part: Virtually lossless. The 5% compression rate figure for WebP is a good estimate for the equivalent of a >95% quality JPEG compression. Still beating BCn quality by a long shot.
The BCn part is necessary though to cut the VRAM cost in half after all. You still have target systems where that's needed.

Performance loss? Certainly, but the biggest part is already GDeflate itself. Intelligent choice of intermediate formats renders that worthwhile in the first place. Reorganizing the data layout together with some non-packing run level coding adds only minimal extra cost.
The assets themselves are BCn compressed most likely if we're talking about textures specifically.
I would certainly hope not. Trying to compress naked BCn with a deflate doesn't yield a good compression rate at all. GDeflate is just a common building block, not a general purpose tool. Applying it to BCn compressed content would barely eliminate uniform-color patches, but nothing beyond that.
Same goes for geometry. You can take a "everything looks like a nail if all you have is a hammer" approach and apply it to raw vertex data and index lists - but that's really completely missing out on the potential.
As for textures the biggest interesting thing lately was Nv's AI texture compression paper.
Great choice if your concern is to make the most of VRAM, at the expense of extra computational power. 4x less storage space than BCn traded for a 4x -32x slowdown for texture accesses. I don't think we can affords to waste that much computational power at runtime any time soon. And yet in terms of compression ratio to computational expense in any scenraio where you can stream assets and therefor amortize decompression costs, this is still beaten by more advanced algorithms by 1.5-5x in terms of compression ratio.

So no, this is not really a viable approach yet in the forseeable future.
 
Last edited:
I guess perception differs here? 100GB+ of assets are still not a nice thing to have, when you have to account for the entire infrastructure cost for deployment.
The players own storage may not be something paid out of your pocket - but everything leading up to that is.

There is not much the distribution channel can do for you to get the transfer volume down either. Assets which already do have a half-baked compression applied don't exactly compress well.

If you skip the BCn part: Virtually lossless. The 5% compression rate figure for WebP is a good estimate for the equivalent of a >95% quality JPEG compression. Still beating BCn quality by a long shot.
The BCn part is necessary though to cut the VRAM cost in half after all. You still have target systems where that's needed.

Performance loss? Certainly, but the biggest part is already GDeflate itself. Intelligent choice of intermediate formats renders that worthwhile in the first place. Reorganizing the data layout together with some non-packing run level coding adds only minimal extra cost.

I think the problem here is that players don't like long loading times, so it's probably more acceptable for most people to have larger game files but shorter loading times. If download size is really a big concern, you can always have assets compressed with WebP/JPEG and transcode them at installation time.

I'm not sure if it's possible on all GPU but since many GPU have hardware video decoders, it might be possible to use some sort of H.264/H.265 intra frame to encode images and decode them on the GPU, which might be fast enough (not sure about this because consumer GPU's video decoder only needs to be fast enough for playing real time video).
 
I'm not sure if it's possible on all GPU but since many GPU have hardware video decoders, it might be possible to use some sort of H.264/H.265 intra frame to encode images and decode them on the GPU, which might be fast enough (not sure about this because consumer GPU's video decoder only needs to be fast enough for playing real time video).
That would work surprisingly bad, at least as we talk those two codecs. H.264 and H.265 intra frames need around 1MB respectively 500kB for a 4k frame to achieve a visually insignificant PSNR. Those codecs have aged badly.

Coincidentally, the P and especially the B frames are however incredibly well suited to compress similar textures. So if you sort your textures by theme, and your artist used a lot of common brushes, chances that entire features will be simply re-projected between frames is great. You can even "cheat" the codec and use both (I or P) boundary frames as texture catalogs, and the B-frames in between can then simply cross-reference despite not actually sharing visual similarity with either.

The performance wouldn't be great though - just 100-200 4k frames per second. And even less if you go all-intra.

I think the problem here is that players don't like long loading times, so it's probably more acceptable for most people to have larger game files but shorter loading times.
I don't really see long loading times either though... GDeflate is fast enough even for older architectures, I think we have established that by now. Some simple format conversions on top in the 2-digit microsecond range per texture can't exactly hurt either, I expect?
 
That would work surprisingly bad, at least as we talk those two codecs. H.264 and H.265 intra frames need around 1MB respectively 500kB for a 4k frame to achieve a visually insignificant PSNR. Those codecs have aged badly.

Coincidentally, the P and especially the B frames are however incredibly well suited to compress similar textures. So if you sort your textures by theme, and your artist used a lot of common brushes, chances that entire features will be simply re-projected between frames is great. You can even "cheat" the codec and use both (I or P) boundary frames as texture catalogs, and the B-frames in between can then simply cross-reference despite not actually sharing visual similarity with either.

The performance wouldn't be great though - just 100-200 4k frames per second. And even less if you go all-intra.


I don't really see long loading times either though... GDeflate is fast enough even for older architectures, I think we have established that by now. Some simple format conversions on top in the 2-digit microsecond range per texture can't exactly hurt either, I expect?

That's the problem though. If you use WebP or JPEG you'll have to decode using the CPU, which is slow. Then you'll have to transfer the decoded data, which means with uncompressed size, through the PCIe to the GPU, then use the GPU to do the transcoding (and maybe build the MIP map at the same time). This, compared to using GDeflate to load and decompress BCn textures, does not seem to be a win.

I don't think H.265 intra-frame is too bad, but anyway, WebP uses VP8, which newer NVIDIA GPUs support up to 4Kx4K decoding through NVDEC. I'm not sure how fast it is at decoding (NVIDIA claims "much faster than real time", at least on professional GPUs) so it might be possible, but compatibility with other GPU vendors still need to be considered, which will require different codes for each GPU vendor.

I'm not so sure about using P and B frames for textures though. P and B frames tend to have less clarity and that's fine for a video frame which is displayed only for 1/30 seconds but for a static texture it can be too blurry. I frame only is probably the safe bet.
 
That's the problem though. If you use WebP or JPEG you'll have to decode using the CPU, which is slow.
Actually you don't. Well, kind of don't.

You can just take 95% of the WebP codec but substitute the final deflate implementation with GDeflate. Everything else in that codec is already a perfect fit for the GPU. And suddenly you got in on the GPU, with all the benefits. Benefits such as: Mipmap chains with perfect frequency preserving sampling come for free as part of the compression scheme. You just need some extra meta data to correctly dispatch the reconstruction of the final target buffers, and you need a container format you can pass straight into GDeflate.

JPEG? Well, the same applies. Used that a decade ago, back then only the JPEG depacketization and the Huffman decoding needed to be done on the CPU, as it didn't scale properly on the GPU. Everything else was better placed on the GPU, scaled perfectly. Actually better scalability than h.264 via hardware-decoder at that time, at least if you didn't go for the bottom of the line model...
Nowadays, you would also just take a modified JPEG format with the deflate part substituted, wrap it in a nicer container, but still keep the whole macro block handling as it is.

Neither is WebP or JPEG in the sense that it would be binary-compatible, but the important part is neither the container format nor the bitstream, but the know-how in the block encoding (JPEG) respectively channel/frequency domain isolation (WebP). Trans-coding an existing WebP or JPEG file this way is trivial. It's literally just depacketizing, deflate decompression of the payload, recompress, write the meta data somewhere else, and that's it.

Fun fact: Even the AI decompression paper @DegustatoR mentioned is in most parts building up on the design principles of the lossless WebP codec - at least the part where they break into down into a mip-chain with isolated frequency ranges before applying their own encoder to each level individually.
WebP uses VP8
Optionally, it only falls back to VP8 intra frame coding for the lossy codec.
The interesting part about WebP is the lossless codec. Which is the one I'm referring to when talking about switching to a multi-planar layout, run-level-coding and built-in mip-chains.
I'm not so sure about using P and B frames for textures though. P and B frames tend to have less clarity and that's fine for a video frame which is displayed only for 1/30 seconds but for a static texture it can be too blurry.
That's just encoder configuration. Nothing else.
 
Last edited:
Optionally, it only falls back to VP8 intra frame coding for the lossy codec.
The interesting part about WebP is the lossless codec. Which is the one I'm referring to when talking about switching to a multi-planar layout, run-level-coding and built-in mip-chains.

Lossless WebP is unlikely to be better than say BC3 or BC7 in compression ratio. Normally you get maybe 12 ~ 14 bpp but BC3/BC7 is 8 bpp with alpha (or 4 bpp if you use BC1).
 
A new DirectStorage test is released, developed by AMD, "The DirectStorage sample application renders a scene while asynchronously loading assets. A camera moves through a scene into bounding boxes which trigger asynchronous streaming of the assets. The assets appear once they have completed loading." The test compares Win32 API (DirectStorage off), vs DirectStorage on with CPU decompression, vs DirectStorage on with GPU decompression.

With CPU decompression, bandwidth is increased by 2x to 5x vs Win32 API, loading times are reduced by 40% to 200% according to the scene. However, sometimes CPU decompression leads to about ~2% performance loss vs Win32 API.

With GPU decompression, bandwidth is increased at least 3 times vs CPU decompression, loading times are reduced by a further 50% to 400% too. However performance loss is increased by 10%.

 
Last edited:
So is that performance loss occurring from overloading VRAM or because of the actual compute costs?
Either way, this isn't good enough. Looking perhaps like dedicated compression hardware is gonna be needed like on consoles, unless PC's are just expected to brute force this forever.
 
Either way, this isn't good enough. Looking perhaps like dedicated compression hardware is gonna be needed like on consoles, unless PC's are just expected to brute force this forever.
PCs will do just fine with GPU decompression but it does require additional care from developers to make sure that it's not blocking graphics rendering resulting in a hitch/stutter. A naive implementation will lead to just that since loading and decompression may easily saturate all bandwidths leaving nothing for the rendering system.
 
Back
Top