Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

DirectStorage for Windows is going to be introduced at Game Stack Live (April 20-21, 2021).
https://developer.microsoft.com/en-us/games/events/game-stack-live/

System & Tools
DirectStorage for Windows


Microsoft is excited to bring DirectStorage, an API in the DirectX family originally designed for the Velocity Architecture to Windows PCs! DirectStorage will bring best-in-class IO tech to both PC and console just as DirectX 12 Ultimate does with rendering tech. With a DirectStorage capable PC and a DirectStorage enabled game, you can look forward to vastly reduced load times and virtual worlds that are more expansive and detailed than ever. In this session, we will be discussing the details of this technology will help you build your next-generation PC games.
 
Awesome, I was just thinking about this last night wondering if it was going to turn out to be one of those much talked about techs that never see the light of day. This is definitely a date for my diary.
Yes! I can't wait to find out more about this. Hopefully we get details on the requirements and how DirectStorage incorporates RTX I/O and AMD's support.
 
Yes! I can't wait to find out more about this. Hopefully we get details on the requirements and how DirectStorage incorporates RTX I/O and AMD's support.
Well a Microsoft engineer from the dev discord said DirectStorage works in conjunction with Sampler Feedback Streaming.

So chances are you just need a DX12U GPU and a regular NVMe SSD. But I too wonder if it's really that easy.. April can't come soon enough
 
Well a Microsoft engineer from the dev discord said DirectStorage works in conjunction with Sampler Feedback Streaming.

So chances are you just need a DX12U GPU and a regular NVMe SSD. But I too wonder if it's really that easy.. April can't come soon enough

I'm wondering if it's going to require resizable BAR support, or something like that.
 
Microsoft probably provides API. From API pov it's likely easy but will require rewriting engine to support streaming optimally. Behind the API is layer of hw that could require specific implementation both in driver and hw to be optimal. Works isn't necessarily same as optimal.

Worst case is engines like gtav which uses single thread for very long loading period. It wouldn't magically become faster as there is some insane cpu bottle neck that would need to be worked around. Similarly other games/engines could do something in cpu to process loaded data that would require to be changed for optimal performance.
 
a Microsoft engineer from the dev discord said DirectStorage works in conjunction with Sampler Feedback Streaming.
They both make two important parts of what they call the Xbox Velocity Architecture (which also includes hardware LZ-family decompression).
On Xbox Series X (and DirectX 12 Ultimate GPUs), Sampler Feedback augments Tiled Resources to help determine which missing tiles and MIP levels are to be streamed into video memory, while DirectStorage would help improve loading times on NVMe disks.

https://devblogs.microsoft.com/dire...edback-some-useful-once-hidden-data-unlocked/

https://news.xbox.com/en-us/2020/07/14/a-closer-look-at-xbox-velocity-architecture/

https://devblogs.microsoft.com/directx/directx-12-ultimate-for-holiday-2020/
 
Last edited:
I looked through the pre-recorded video and the session slides, and it seems like the design is still in a preliminary stage, though this is expected given the ambitious goals.


First of all, DirectStorage is indeed a user-mode layer on top of existing I/O stack, which is primarily designed to issue multiple parallel I/O requests. To make it work, they are redesigning the actual Windows I/O Manager subsystem described above to make it handle batch processing of I/O packets - so for example you would schedule loading (paging) of a thousand new MIP textures (as hinted by Tiled Resource/Sampler Feedback shaders) and only track the status of the entire I/O batch, not each individual I/O packet.

It wasn't made clear whether this involves a redesign of the StorPort/StorNVMe driver or the use of block size / alignment / granularity hints from NVMe 1.3.

He did mention they can bypass the filesystem driver and volume manager, but only in passing. I guess this would introduce another 'fast path' which could involve continuous pre-allocation of clusters/sectors on file write operations, similar to what CompactOS file compression is doing, to enable reading back with large continuous I/O batches and without the need to track complex LBA sector chains in the filesystem driver and Cache Manager.


Second, the data from NVMe drive is initially loaded into system memory, then moved to GPU video memory (though without any CPU processing). So it's not using peer-to-peer DMA this time, and no specific requirements for NVMe / PCIe drive were raised.

Texture compression is assumed to be a DEFLATE (i.e. LZ77/LZSS+Hoffman, the ZIP format) pass over standard BCn (i.e. S3TC/DXTC). DEFLATE decompression is performed in the GPU memory by either compute shaders or dedicated hardware blocks. They are still working on the compressor / decompressor toolset so I guess this will be different/improved from Xbox BCPACK toolset. They said it can keep up with typical NVMe bandwith, which didn't sound like multi GB/s performance to me.

When specifically asked about NVidia RTX I/O slides (where the data is flown directly from the NVMe drive to onboard GPU memory) in the chat, the presenter said that peer-to-peer transfers could be implemented in future releases, but referred to Nvidia for RTX I/O details.


Developer preview is expected this Summer, so the tentative release date would be end of 2021.
 
Last edited:
I looked through the pre-recorded video and the session slides, and it seems like the design is still in a preliminary stage, though this is expected given the ambitious goals.


First of all, DirectStorage is a user-mode layer which is primarily designed to issue multiple parallel I/O requests. To make it work, they are redesigning the actual Windows I/O Manager subsystem described above and make it handle batch processing of I/O packets - so you could for example schedule loading of a thousand new MIP textures (as hinted by Tiled Resource/Sampler Feedback) and only track the success of the entire I/O batch, not each individual I/O packet.

It was not made clear whether this involves a redesign of the StorPort driver and the use of block size / queue depth hints from NVMe 1.3.

He did mention the bypassing the filesystem driver and volume manger - not sure what this means. I guess it could involve a 'fast path' involving continuous pre-allocation of clusters/clusters, similar to what CompactOS file compression is doing, to enable large continuous I/O batches without the need to track cluster chains in the filesystem driver or Cache Manager.


Second, the data from NVMe drive is initially loaded into system memory, then to GPU video memory (though without any CPU decoding). So it's not using peer-to-peer DMA.

Texture compression is assumed to be DEFLATE (i.e. ZIP ) over BCn (i.e. S3TC/DXTC), and the DEFLATE part is going to be performed by either compute shaders or dedicated hardware blocks. They are still working on the compressor and decompressor tooling, so I guess this will be different/improved from Xbox BCPACK toolset.

When specifically asked about NVidia RTX I/O slides (where the data is flown directly from the NVMe drive to onboard GPU memory) in the chat window, the presenter said that peer-to-peer could be implemented in future releases, but referred to Nvidia for RTX I/O details.


Developer preview is expected this Summer, so I guess the tentative release date would be end of 2021.

So the most important confirmation from this is that the GPU based decompression is a standard feature of DirectStorage which essentially solves the CPU decompression bottleneck and will be available on all DX12U class GPU's (at least). Hopefully this puts to rest any lingering arguments that GPU based decompression is just an Nvidia marketing ploy and not viable without hardware based units.

Interesting also that there is no P2P transfer as part of the standard but not that surprising. So it seems as though that may well be a unique feature of RTX-IO over and above the standard DirectStorage. Although I'm not sure how much benefit that brings given the decompression element is already covered by DirectStorage.
 
They said it can keep up with typical NVMe bandwith, which didn't sound like multi GB/s performance to me.

Just wanted to add now that I've had a chance to watch the full video, he specifically says (23:30) that their GPU based decompressor easily saturates gaming NVMe SSD bandwidths so I'd take that to mean it can handle the 7GB/s rate of current SSD's without too much trouble. That also aligns with Nvidia's claims for RTX-IO. He uses the specific example earlier in the video of this being used on a 2.5GB/s drive (for obvious reasons) at full rate.

I found the statements about eventually moving this decompression into dedicated hardware (on PC) particularly interesting though. Presumably this is on the roadmap for GPU vendors as a dedicated hardware unit in future GPU's.

Also to note and closely linked to the above is that this is using a new and specific compression/decompression solution and not just using the GPU to decompress existing compression formats. That means 1. developers will have to use this specific compression tech for their games similar to how PS5 devs use Kraken and XSX devs use BCPACK and 2. that creating a dedicated hardware block for it in the future is more straight forward as we can rely on all DirectStorage based games to be using the same compression format.
 
I found the statements about eventually moving this decompression into dedicated hardware (on PC) particularly interesting though. Presumably this is on the roadmap for GPU vendors as a dedicated hardware unit in future GPU's.

Also to note and closely linked to the above is that this is using a new and specific compression/decompression solution and not just using the GPU to decompress existing compression formats. That means 1. developers will have to use this specific compression tech for their games similar to how PS5 devs use Kraken and XSX devs use BCPACK and 2. that creating a dedicated hardware block for it in the future is more straight forward as we can rely on all DirectStorage based games to be using the same compression format.
random thought:
is it possible to use hardware decoder/encoders that they would use for video codecs to serve a similar purpose?
ie
nvidia RTX I/O is routing it from system memory directly to the hardware decode/encode (manipulated by) nvdecode and CUDA (nvencode in the reverse direction if necessary) then dumping directly into vram for this process? I suspect those hardware accelerators can handle a great deal of codecs, and I wonder if it's just firmware to support this.
 
Since he talks about sustaining 2.5 Gbyte/s earlier (at 13:40-15:15), it's safe to assume it's the same speed he talks about at a later point (23:30).

Then again, there were no specifics about algorithm used and compression ratio achieved, and lossless dictionary-based compression is known to resist attempts at parallel processing implementations, as discussed above. So it made a lot of sense to me when he said that Microsoft works on a 'GPU-friendly' compressor/decompressor, and called it 'a new class of compression tech' (at 22:10-23:30). I'd assume this new toolset would require some fine-tuning before it could saturate PCIe 4.0/5.0 bandwidth figures.
 
Last edited:
Since he talks about sustaining 2.5 Gbyte/s earlier (at 13:40-15:15), it's safe to assume it's the same speed he talks about at a later point (23:30).

I'm not sure I'd draw that conclusion. His specific wording is that "the GPU can support a very consistent, constant maxed IO rate. So let's say you have drive capable of say 2.5GB/s, your GPU is capable of maintaining that".

I'd assume he's just using the 2.5GB/s example because it's the speed of the XSX SSD which most of this work to date has been based around.

He later goes on to say "There's an initial prototype [of the GPU decompression algorithm] that we have, it easily saturates gaming SSD bandwidths"

That to me isn't setting a fairly low limit of 2.5GB/s but rather saying that there are no SSD's out there that the algorithm can't keep up with. And that would align perfectly with Nvidias own statement of:

"GeForce RTX GPUs [I take this to mean anything from an RTX 2060 upwards] are capable of decompression performance beyond the limits of even Gen4 SSDs, offloading dozens of CPU cores’ worth of work to deliver maximum overall system performance for next generation games."

Here's the video in case anyone wants to view it:


The question now how much of GPU resources 2.5 Gbps data decompression takes.

If you're transferring at 2.5GB/s it probably doesn't matter as you're likely at a load screen rather than background streaming which likely wouldn't be transferring at anywhere near that speed (or else you'd have streamed the entire game content of most games in about 20 seconds). Background streaming is likely to be just a small fraction of that transfer rate.
 
The question now how much of GPU resources 2.5 Gbps data decompression takes.
Nvidia stated that with RTX I/O the performance hit was "negligible" and that was assuming a full Gen 4 7GB/s saturation.. likely due to what pbjliverpool said directly above me. During load screens is when the GPU would actually saturate the bus at full Gen3 and Gen4 speeds. The streaming requirements during gameplay would likely be far less. Even still, I don't think the performance impact is going to be very large. Games will be designed with it in mind, and will perform as well as they can.

Think about it this way... the performance hit to the GPU is guaranteed to be far less than it would be to the CPU :)

This will easily hold us over until dedicated decompression blocks can be added in hardware to the GPU in the future.
 
For the existing situation, one of the diagram is showing that the data is doing nvme drive =>ram=>cpu=>ram=>gpu.

I thought that data could do cpu=>gpu directly now ?
 
Why not nvme => Gpu ?

Of course, it'll be nice if it's possible to bypass main memory. But I guess there are just too many compatibility hurdles, and the performance gain is probably not big enough to worth it.
Even resizable BAR, which is in the standard for quite some time and is supposed to be a relatively simple feature, is still treated quite cautiously by the vendors.
 
Of course, it'll be nice if it's possible to bypass main memory. But I guess there are just too many compatibility hurdles, and the performance gain is probably not big enough to worth it.
Even resizable BAR, which is in the standard for quite some time and is supposed to be a relatively simple feature, is still treated quite cautiously by the vendors.
Isn't this exactly what AMD did with Radeon SSGs? To my understanding the GPUs communicated directly with the pair of SSDs via PCIe bridge chip without the roundtrip through system memory.
Whether it would be doable as "universal solution" which would work with every vendor is of course another matter.
 
Last edited:
Back
Top