DirectStorage GPU Decompression, RTX IO, Smart Access Storage

Nvidia could probably use peer-to-peer DMA on Linux but this would require major revisions to the WDDM driver model on Windows.
What are the chances NVIDIA is intending to use their myriad of Tensor cores to do the decompression heavy lifting? It would make sense since they are expanding the support to all RTX GPUs (Turing + Ampere), and Tensor cores are only found on RTX GPUs, they also set mostly idle unless DLSS is engaged, and even with that, they work for a fraction of a second after each frame.
 
What are the chances NVIDIA is intending to use their myriad of Tensor cores to do the decompression heavy lifting?
Tensor cores use reduced-precision floating point processing, intended for training of neural networks.
https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/

But LZ-family compression algorithms treat the data as ASCII text (8-bit integer), and decoding is a simple dictionary look-up. There's no use for reduced-precision arithmetic during decompression.


Now neural network algorithms can be used during the encoding process to improve compression ratios, as they can discover additional repeating patterns with a larger dictionary - at the cost of reducing the processing bandwidth by an order of magnitude.
 
Last edited:
I also wonder if the new type of compression is dictionary based, as wouldn't that make it harder to work well on GPUs?

It would make sense that it's BCPack, but the fact that they went out of their way to not say BCPack muddies the waters for me.

How much details do we have on BCPack?
 
Last edited:
This is what I'm saying. Lack of h/w support isn't an issue, PC h/w is updated fast. The issue is with the current install base.
What exactly do you consider fast? For example RTX has achieved 10% share of NVIDIA GPUs in little over a 2 years, is that fast?
Often comments like these are caused by tech forums giving wrong kind of image of how most people update and upgrade hardware.
 
What exactly do you consider fast? For example RTX has achieved 10% share of NVIDIA GPUs in little over a 2 years, is that fast?
Often comments like these are caused by tech forums giving wrong kind of image of how most people update and upgrade hardware.
Most games will still be targeting previous gen consoles h/w by the time PC will get parts with dedicated decompression units. I consider this "fast".
 
what about having automatic programmable xpress8K compression? So far since it is NOT part on NTFS re-compression after write must be done manually (could be done by OS? So far it's not the case). Other compression methods outside xpress4K for slower CPUs is about meh (xpress16 and lzx have worst compression/overhead ratio).
 
I also wonder if the new type of compression is dictionary based, as wouldn't that make it harder to work well on GPUs?
The DEFLATE algorithm - i.e. LZ77/LZSS+Huffman coding, used in the ZIP format (RFC 1951) - is already a combination of dictionary coding and entropy coding (Huffman). Dictionary coder works well with repeating patterns of bytes (i.e. text data), while entropy coding uses a smaller 'prefix code', a 3-4 bit integer, to encode bytes with high occurrence.
Modern lossless image and audio compression formats also use entropy coding, specifically arithmetic coding.

Other lossless compression methods are run-length encoding (RLE), which only works for streams of repetitive data, and wavelet encoding which is good in decoding transients (i.e. audio signals and smooth gradients) but is computationally expensive. These are not supported by DEFLATE though.

How much details do we have on BCPack?
Still not much. GameStack's DirectStorage for Windows session featured a slide where texture compression method is described as DEFLATE over standard BC (i.e. S3TC/DXTC); they didn't specifically name it BCPACK though (see 11:50 time mark in the video, and slide #12 in the PDF file).

I'd still stand by my earlier assumption that BCPACK uses a two-stage process similar to Oodle Leviathan/Kraken, i.e. the LZ-family compression pass over a "lossless transform" step, which reorders bytes in a BCn texture to improve compression ratio of LZ pass. It could also include an improved lossy texture compression algorithm, similar to Oodle Texture Rate Distortion Optimization (RDO) processing, with finer control of quality/compression ratios, which decodes to BCn formats.
 
Last edited:
what about having automatic programmable xpress8K compression? So far since it is NOT part on NTFS re-compression after write must be done manually (could be done by OS? So far it's not the case)
We've discussed this earlier in the DirectStorage for Windows thread.

CompactOS compression is currently implemented as a file system filter driver in the Windows I/O Manager stack. It's not just a command-line tool - actually it's available to any Windows application through a well-documented 'Compression API'. Yes, it's true that any write will decompress the file, but game assets only need to be written once during installation process (or by game updates), and new user data can be re-compressed by the application using the Compression API.

As an added benefit, CompactOS uses contiguous write allocations, so the compressed file is not fragmented and is written in one single chunk (provided there is enough free space on the disk) - whereas the older NTFS cluster-based compression results in a very heavily fragmented file (and it only works with 4K clusters, even though up to 64 KB, and recently up to 2 MB, have been supported by NTFS).


I believe the CompactOS filter driver could be refactored into using GPU compute (or a hardware decoder) to process the data directly in the GPU memory, using some 'fast path' in the I/O Manager stack. But it remains to be seen if actual Windows implementation of DirectStorage includes improvements to CompactOS.
 
Last edited:
Would be nice to have automatically re-compression even on those folders that are not directly controlled by a digital distribution client like steam/gog/egs etc.. A lot of game would gains something, at least on main loading times, if not on almost all resource loading from disk.Unfortunately this is only possible with the old and inefficient NTFS compression.
Hardware decoder would be sweet for xpress16K or lzx, but for xpress4k and xpress8k even on low-tier cpu we get some improvements.
 
Last edited:
Requirements for DirectStorage API are posted:
Any 1TB NVMe SSD
Any DirectX 12 Ultimate GPU

So that means only RTX 20, RTX 30 and RX 6000 GPUs.

RX 5000/ Radeon VII cards are left in the dust.

Question is, will RTX IO bring something new to the table on top of DirectStorage?
 
Requirements for DirectStorage API are posted:
Any 1TB NVMe SSD
Any DirectX 12 Ultimate GPU

So that means only RTX 20, RTX 30 and RX 6000 GPUs.

RX 5000/ Radeon VII cards are left in the dust.

Question is, will RTX IO bring something new to the table on top of DirectStorage?

The requirements read:
Windows 11 Specifications - Microsoft
  • DirectStorage requires an NVMe SSD to store and run games that use the "Standard NVM Express Controller" driver and a DirectX12 GPU with Shader Model 6.0 support.
That would mean this:
  • Shader Model 6.0 — GCN 1+, Kepler+, DirectX 12 (11_0+) with WDDM 2.1.
  • Shader Model 6.1 — GCN 1+, Kepler+, DirectX 12 (11_0+) with WDDM 2.3.
  • Shader Model 6.2 — GCN 1+, Kepler+, DirectX 12 (11_0+) with WDDM 2.4.
  • Shader Model 6.3 — GCN 1+, Kepler+, DirectX 12 (11_0+) with WDDM 2.5.
  • Shader Model 6.4 — GCN 1+, Kepler+, Skylake+, DirectX 12 (11_0+) with WDDM 2.6.
  • Shader Model 6.5 — GCN 1+, Kepler+, Skylake+, DirectX 12 (11_0+) with WDDM 2.7.
  • Shader Model 6.6 — GCN 1+, Kepler+, Skylake+, DirectX 12 (11_0+) with WDDM 2.9.
While these are excluded:
Souce:
High-Level Shading Language - Wikipedia
 
So at first it seemed as if "Sampler Feedback" was a key component in DirectStorage, but that seems to have disappeared.

Wondering, though, whether Sampler Feedback will be a "highest performance tier" option...
 
Back
Top