I'm sitting here watching the video. First, he keeps conflating "block size" with "request size" starting at time 6:25. The axis is mislabeled; he actually says it's request size and not block size, but then waffles back and forth calling it the wrong thing. Here's why it matters: your storage block size is
either 512 bytes
or 4096 bytes -- there aren't any other options. No, we're not talking about filesystem cluster size for NTFS / ExFat / XFS / EXT
[3,4] This distinction of request size is important, because what we're really talking about here is how we "glob" assets together into fewer, larger and sequential reads (and he even mentions this if you listen to him, but it's not written down correctly in several successive slides.)
What he goes on to describe over the next four minutes is how to "pack" files in a way that allows us to consolidate a zillion, serially-linked small reads into far fewer, big fat reads.
Starting around the 13 minute mark we see the new way to manage VRAM reservations differently, essentially telling the video memory pool to be ready to receive the aforementioned big fat read which may be A: still compressed and B: actually be multiple resources (apparently refered to as subresources now, per the slide at the 14 minute mark) versus today's method which would be a stack of singular resources, all individually uploaded in a serial end-to-end fashion.
At 14:14 we hear him describe loading resources from disk as device independent, but he doesn't say it's a blocklevel read. This harkens back to some spirited conversations earlier in this thread; we're simply reading files off a filesystem, which means we're still not moving blocks straight from the storage device into VRAM.
Aha, and at 15:50, there it is -- we're still copying from disk with an API call (hello, NTFS filesystem!) into ram (but far less ram-to-ram copies so that's a solid win) and then into the GPU.
At 16:35 we start talking about the new file IO methods in Windows 11. We'll take them in order:
BypassIO is something I hypothesized quite a while ago in our discussion thread, here's the API Spec:
https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/bypassio Basically, what it does is take an NTFS file handle (yup, it has to start at NTFS) and then walks the entire volume driver stack (file to filesystem logical blocks, filesystem logical blocks to a volume map, a volume map to a disk sector map) and then enables a way for NTFS to lock those physical sector details and access them by straight communication to the StorNVME driver.
Apparently BypassIO only works on the StorNVME driver right now, and also only with NTFS file systems (which means the file system overhead is instituted for ATIME stamps and ACL checks.) If you read the API spec, the BypassIO handle is also forced to move down the traditional IO stack if the files are filesystem-compressed, filesystem-encrypted, or a defrag occurs which affects the file in any way after the handle is obtained, or in a multiuser environment if someone else on the system gets a handle to the same file. This API request is basically an alternate way to get a file handle, and if it fails the BypassIO check, then it still gives you a handle but the access simply follows the standard disk IO path.
The second Windows 11 IO method he describes touches upon is IORing (
https://learn.microsoft.com/en-us/windows/win32/api/ioringapi/) which allows an application to queue up a pile of file handles into a buffer, and then the app receives an event notification when the everything in the queue has been read into memory. There's a great writeup of how Windows IORing compares to Linux's similar io_uring here:
https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations/ Of course, since the IORing function works on handles, you just feed it the stack of BypassIO handles you created above... By the way, if you read the MS IOring API spec, it looks like this is fully supported in later Win10 builds.
At 19:20 he mentions Cancellation, or the ability to kill off a pending read request in order to slot in a more important one. It looks like this is an IORing function:
https://learn.microsoft.com/en-us/windows/win32/api/ioringapi/nf-ioringapi-closeioring which allows you to kill the entire ring, but also allows you to kill just specific handles in the ring without killing the entire queue. Either way, it's nice to be able to cancel a pending IO if it's an unnecessary IO.
Now at 25:50 he shows that we need to precompute how textures and assets are now handled to be "DirectStorage compatible." Which is fine, but this also means getting the biggest bang from DirectStorage isn't so simple as a "backport" of an existing asset.
In the end, here's a few things that pop into my mind:
- I wager the CPU isn't busy at all in the non-directstorage use cases. Serialized I/O (read and wait until complete, read and wait until complete) isn't CPU intensive at all, but it's still stupidly slow even on a fast disk interface. All things considered, the CPU is very likely MUCH more busy with DirectStorage enabled, mostly because it can be actually getting work done versus waiting for an IO completion polling cycle.
- Go back to timestamp 15:50 and look at the memory copies again. Did you catch it? If you think your VRAM-hungry games are a problem today, wait until your GPU is responsible for all the decompression stages. Before GPU decompression they're only getting the end result, and in theory only getting the portions of the end result that needs to be uploaded (eg a driver may only upload certain mip levels versus the entire texture dependency chain uploaded as linked subresources.)
- The NTFS filesystem is still here, this isn't block-level transfer from disk straight to the GPU. I feel like a broken clock; direct block transfers to VRAM were never going to happen in this era of security. Even the BypassIO starts with a full file-level-to-block-level traversal of the entire Windows disk IO stack.
- The real magic here probably isn't BypassIO, it's the IORing method for asynchronous batching of files. And what's interesting to me is, any developer could've written something quite similar, and in fact quite a few have (see also: basically any high performance relational database software.) It's nice that Windows has finally provided the easy button for appdevs who didn't want to write their own, and did one better by eliminating a memory copy step for allowing users to see the results even though the kernel directly executes the reads into RAM.
That's my read on the situation. Not a lot of surprises IMO, and I'm glad to see it coming to fruition. Games will be better for it, and so too will a ton of other apps who elect to use it for non-gaming applications (eg imagine Blender using IORing to pull in the shitload of assets linked to your scene...)
Edit on Monday morning: Had several typos bothering me, and I also simplified some language. It was kinda hard to read my stream of conciousness in a few places while I was pausing the video to type