DirectStorage GPU Decompression, RTX IO, Smart Access Storage

It does? Completely missed that. So as long as your GPU is getting driver updates (at all) you should be good? Or is this tied to a specific Windows 10 minor release?
1909+ AFAIR. My general expectation is that it will work on any Win10 which is still being supported at least.
DS seem to be distributed similar to D3D12 Agility? Meaning that the API DLLs will ship with titles using it and will proxy to newer versions if these are present in the OS.
 
Unless I'm entirely mistaken, then AMD has already demonstrated that they can stream directly to BAR from NVMe, for a specific combination of own chipset, CPU generation, GPU and qualified NVMe devices which work with their own (not the MS one) NVMe driver, but still with standard NTFS.

Not from standard NTFS, no. We've already discussed at length why this isn't true, but here's a cliff's notes refresher: the NTFS file system requires very a specific process to access any file, all of which invoke multiple calls to the CPU and aren't something the GPU can solve. Authorization (auth-z) and authentication (auth-n) is one such requirement, permissions checking of the file itself after handling authN+Z is another, tracking of last file access metadata which can't be bypassed except for at an entire volume level, management of the NTFS file caching hierarchy is yet another, and every single part of that is a cakewalk compared to disentangling file-level fragmentation that absolutely exists and must be drilled back into the underyling non-linear sectors of the physical storage.

No, this isn't solved.

It's not as tricky as people tend to make it for a vendor which has control over all the involved components. Just requires clean tracking of resources over multiple driver stacks. The file system overhead some are so afraid of are not that bad when you can just build lookup tables in RAM as a one-time cost per application. Even alignment issues and alike don't make much of a difference for typical asset sizes - just something the GPU driver needs to mask.

Turns out, there is already a lookup table in RAM -- that's how filesystems work! What doesn't work is putting that lookup table in GPU format, not the least of which solves any of the problems I stated above.

I keep seeing people who are clueless about how filesystems, logical volumes, physical volumes and physical storage interfaces actually integrate and work together telling the rest of us "it isn't so hard as you think!" For those of us who literally work in this space, yes it is that hard. I'm sorry if you don't believe me, maybe you can explain it specifically instead of handwaving it off next time.

Going to system RAM does have a huge impact on the CPU after all. With PCIe 4.0 4x on the storage side, that's still 16GB/s of memory bandwidth (half-duplex!) burnt. Accounting for some inefficiencies, that's almost one DDR4 memory channels bandwidth worth lost. One of these nasty details you won't see in a synthetic benchmark (due to lack of easily accessible load statistics!), but which will bite you later on.
Requiring transfer to main system memory is the "backward compatible" method; it only must happen if the GPU cannot directly pull storage assets physically off the NVMe drive into the dGPU memory pool directly. As it turns out, you don't need blah blah special chipsets and CPUs and anything else, you need industry standard PCIe spec devices which permit the native instructions to perform the peer-to-peer transfer and that's foundationally it. That's ultimately where all this D3DIO was supposed to shine: have the GPU load assets directly from the storage, bypassing the CPU, thus also bypassing main memory. I've extensively covered this topic several times in this very thread, it's "simple" if you ignore how the NVMe storage in every commodity PC on the planet isn't a dedicated resource specifically linked to the GPU.

When we find ourselves in a place where that storage asset is shared with a modern filesystem inside of a modern operating system, both of which place strict controls around access methods and storage abstraction layers, it is no longer "so simple as..."
 
It does? Completely missed that. So as long as your GPU is getting driver updates (at all) you should be good? Or is this tied to a specific Windows 10 minor release?

1909+ AFAIR. My general expectation is that it will work on any Win10 which is still being supported at least.
DS seem to be distributed similar to D3D12 Agility? Meaning that the API DLLs will ship with titles using it and will proxy to newer versions if these are present in the OS.
It has DirectStorage support, but the new I/O stack is available only on Win11

edit:
With graphs:
Windows 10 has DirectStorage, GPU decompression etc, but it's still going through this I/O stack
1668366444483.png

while Win11 gets to enjoy this when using DirectStorage
1668366469388.png
 
Authorization (auth-z) and authentication (auth-n) is one such requirement, permissions checking of the file itself after handling authN+Z is another, tracking of last file access metadata which can't be bypassed except for at an entire volume level, management of the NTFS file caching hierarchy is yet another, and every single part of that is a cakewalk compared to disentangling file-level fragmentation that absolutely exists and must be drilled back into the underyling non-linear sectors of the physical storage.
I honestly still don't get why the end result of this isn't supposed to be cachable? I mean the real mapping from a file handle's address space straight to the tuple of disk, sectors and offsets within?

Both authorization and authentication don't look like there would be any need to do it repeatedly.
Metatada constraints should be possible to relax safely with an exclusive lock in place without further side effects, just update on open and close.

Bypassing the NTFS cache hierarchy appears trivial after a flush + locking the file so it can't be changed or moved.
Resolving the file-level fragmentation in a single run for the entire file rather than on-demand doesn't exactly look impossible either, the prerequisite only the file record being frozen so that the regular invalidation can't happen.

All the truly nasty parts are in the volume stack (and the PartMgr), which is being bypassed entirely. All under the prerequisite "you did not configure anything exotic". Bitlocker counting as exotic in this case is somewhat ironic. Raid (other than Raid 1) etc. is impossible to support as well, naturally. Also effectively prevents any funny details in this stack to change stuff on the fly.
Even though "bypassed" is maybe not the correct term - the aggregated offsets introduced by the skipped stack have to be fed back once to the file system stack in order to be applied there.

Leaves you just with a handful of NTFS features (compression, EFS) which may render the individual file incompatible with raw access. Except with the filter manager bypassed, those were already ruled out to be in use.

Now, what are we left with? The naked file system driver, with a file in a frozen state, and the entire additional addressing introduced with the entire lower stack being also temporarily being applied straight to the cached mapping?

Sure, the GPU still can't directly request storage access. But it doesn't need to, does it? Unless I completely misunderstood, and the NVMe protocol does not allow the CPU to post a read request to the NVMe device pointing to the GPUs address space, due to some security feature I couldn't find documented?

Scheduling for the GPU is still (mostly) CPU driven, so why wouldn't you just issue the read from there? Everything expensive is already eliminated from the stack. A solitary syscall remains, but maybe not even that depending on where it's dispatched from. Best case scenario is now closer to <250 cycles in user space, and <1000 cycles in kernel space assuming hot caches (extrapolated from what a comparable, trimmed down stack achieves under Linux).

The expectation is not that you would get to eliminate the CPU from the render feedback -> read -> dispatch loop yet! That's even from an security aspect just straight out horrifying, and just begs for exploits to happen. The GPU speaking NVMe is a pipe dream for now. Maybe Nvidia has enough firmware space left to implement that for their embedded RISCV core, but that still won't end up in supporting shared access between GPU and CPU. As you correctly said: the stack on the OS side is quite fragile during regular operation.

You do need to get the synchronization of queued transfers to dependent dispatches on the GPU right though.
Unless I'm missing a feature, there are details to consider when you have a non-trivial PCIe topology due to store-n-forward behavior of involved PCIe switches. The CPU might have been able to fetch the completion event from the NVMe device prior to the data having reached the GPU, and even get to do the actual dispatch with some messages still being stalled.

PCIe switches where you don't know how they will behave in your system are a nightmare to work with when orchestrating a protocol with more than 2 peers. That's actually where I suspect "standard features" won't cut it, why AMD requires you to use a fully known PCIe topology only, with guarantees outside the specification for all involved peers.

How did AMD solve that? My best guess would be that they foremost tweaked the spanning tree to always include the CPUs PCIe root complex (i.e. a direct route between GPU and NVMe is forbidden, even if they happen to hang on the same switch!), so that the CPU is guaranteed to observe an absolute order of events.

Requiring transfer to main system memory is the "backward compatible" method; it only must happen if the GPU cannot directly pull storage assets physically off the NVMe drive into the dGPU memory pool directly.
There are several alternatives in between.
  • GPU speaking NVMe is the least likely to happen. You don't just pull, you have to schedule commands and implement a full protocol over PCIe. Also NVMe isn't designed for multi-master access with a single completion queue.
  • In between you have the CPU issuing NVMe control commands, and the GPUs memory being the direct target of read commands. Like I said above, bus topology can get you into trouble with order of events. Also as you've stated repeatedly, plenty of work had to go into making the storage stack efficient enough to make this even worthwhile.
  • That still leaves you with a bounce via main memory as the default route. DMA write from NVMe to main memory followed by DMA read initiated by GPU from main memory. Memory bandwidth successfully burned, but at least no CPU time wasted and no conceptional issues. Drivers/APIs required some tweaks to get there and be able to share a single buffer, in any case. Also still dependent on file system optimizations.
  • The truly "backwards" method of the CPU doing the transfer, which is where we had gotten to after GPUs made their memory host-visible. As this point file system performance didn't even matter any more that much.... And yet this is still the route to go if the CPU needs to touch the data for any reason (SATA, CPU-based Bitlocker, software raids, file system compression etc.)
  • And lastly the ancient way of having the GPUs driver stack forcibly allocating and requiring you to copy to the staging buffer (while still performing a DMA read), with the double-buffering of a caching file system, giving you the worst of all worlds.
 
Last edited:

Great to see this supported in Vulkan too.

Also the following paragraph confirms RTX-IO has absolutely nothing to do with P2P transfers from SSD direct to GPU completely bypassing system memory (as AMD's Smart Access Memory is claimed to do) so it looks like those earlier RTX-IO diagrams were indeed quite misleading:

"NVIDIA RTX™ IO lets compressed data be delivered to GPU memory with minimal staging in system memory. The GPU is utilized for decompression using GDeflate at high throughput. The CPU is freed to perform other auxiliary tasks."
 
I really can't wait for this stuff to just become standard in all games.

Anything that can help potentially alleviate CPU bottlenecks is desperately needed.

Also, I'm really interested in AMD's Smart Access Storage and just exactly how they are bypassing System RAM for GPU destined assets. IF that ends up to be true, I honestly can't seen Nvidia not being able to do the same, working with Intel and AMD.
 
IF that ends up to be true, I honestly can't seen Nvidia not being able to do the same, working with Intel and AMD.
It should.

If I could piece it together correctly:
  • There are a total of 3 critical transactions involved where ordering constraints are strict:
    • P2P write from NVMe to GPU
    • Posted Completion message from NVMe to CPU
    • Fence release from CPU to GPU.
  • You don't want to enforce ordering for the NVMe or the GPU though (stall on the GPU should not stall the NVMe writing to system RAM), so this got to be using the IDO feature, with a coherent ID for all 3 pathways.
  • The GPU is having one or more dedicated queues on the NVMe, and the NVMe is using IDO with a distinct ID per queue.
That last point appears to be where older PCIe NVMe controllers may actually not be up to the task, it's an optional implementation detail. PCIe 4.0 or not.

Not too much black magic involved though, other than the need to signal the GPU on top of raw the raw NVMe completion event interrupt from the CPU, and doing so with awareness of ordering constraints.

Definitely nothing which shouldn't work on any Intel CPU at least since Skylake, or on any Zen1 or newer AMD CPU.

Not all motherboards may be entirely IDO capable though, which would render them prone to deadlocks with P2P in the constellation.
 
Last edited:
I almost missed that article

direct-storage-benchmfldht.png

 
First game with DS is out and the results aren't that impressive - the game is handling loading fine even without DS.
This could change if a game would use GPU decompression probably but then we are already looking at ~1 sec loading times without it.
Streaming still seems like a much better application for DS API.


Interestingly using DS improves loading time considerably on SATA SSD suggesting that whatever the game does to make use of DS is actually beneficial even without DS compatible medium?
 
First game with DS is out and the results aren't that impressive - the game is handling loading fine even without DS.
This could change if a game would use GPU decompression probably but then we are already looking at ~1 sec loading times without it.
Streaming still seems like a much better application for DS API.


Interestingly using DS improves loading time considerably on SATA SSD suggesting that whatever the game does to make use of DS is actually beneficial even without DS compatible medium?

Sata ssd is supported by DS afaik. You don't get all the nvme priorities queue & co, but I guess you get the new "system", gpu decompression, etc.
 
Sata ssd is supported by DS afaik. You don't get all the nvme priorities queue & co, but I guess you get the new "system", gpu decompression, etc.
You are correct in a sense that you can use DS API to read data from any storage source really.
The only advantage NVMe has here is the support for BypassIO which is NVMe exclusive.
The point though is that it was assumed that you'd need a (fast) NVMe device to see the benefits of DS and from the tests above it can be theorized that even SATA SSDs can benefit - the question is if this is because of DS API being faster at reading data off SATA SSD or because the game uses a different code path for data reads when using DS - and that latter code path is just more efficient even without DS.
 
I just hope that Forspoken doesn't represent the typical direct storage implementation we're going to getting over the next few years.

On another note, Metro 2033 (the original release) on an NVME drive, like 1-2 second loading on an NVME :eek:
 
Back
Top