DirectStorage GPU Decompression, RTX IO, Smart Access Storage

They do make it sound like they're bypassing the CPU and System Memory completely (which Direct Storage alone still has to traverse) which suggests they're taking advantage of P2P DMA direct from SSD to GPU which AMD platforms are certainly capable of at a hardware level.

Correct, at 23:28 it's even confirmed that this DMA straight into the Resizable BAR. So this is 100% standardized PCIe 3.0 protocol features.
The SSD certification they're doing may be to ensure the SSDs have the requisite DMA capabilities.
No, they all have, no exceptions for NVMe. But what's critical, is that the NMVe driver pushing via PCIe effectively forces the CPU to back-off until the transfer has completed.

So it's actually a strict performance requirement, in order to ensure that granting the NVMe access isn't going to stall command access from the CPU.

Stalling access from the CPU is a nightmare when it comes to GPUs. Just trust me on that one, you absolutely do not want that to happen, ever.
If true that does support the possibility that RTX-IO (if it actually exists) is doing something similar.
Actually less so. Because the magic isn't happening on the GPU side. The new magic is happening on the chipset and the NVMe driver side, primarily. NVMe driver side, as an un-cached read into memory mapped DMA needs to be intercpted and redirected appropriately.

NVidia of course can do this stuff. Multi-GPU with Cuda is all P2P DMA (if you don't have NVLink on your platform). Has been for a decade. Except the worst thing you could do was a A->B->C->A ring shift of buffers, as you would end up with perf drops from collisions on the PCIe switch like crazy.

AFAIK PCIe switches are usually always cut-through design, not store-and-forward. Good for cost, good for latency, bad for collision handling. Just thinking about it, a major requirement on the chipset is probably switching to store-and-forward operation mode dynamically, in order to not loose bus utilization from excessive back-offs.

You might - by some weird chance - see NVidia GPUs simply working on SAS enabled mainboards. Probably all it requires is for NVidia to actually just not actively try and mess up.
 
Last edited:
So again, doing DMA transfers from a block storage device to a GPU makes a lot of big assumptions about how the storage is represented and ordered. We're talking about how partition geometry is laid out, how a filesystem manages (non?)-contiguous blocks of data, how security descriptors are organized and applied, how auditing is handled, just to name a small handful. None of this is so trivial as "oh my GPU is gonna pick up this file from an NTFS or XFS filesystem" because all that partition, filesystem, security and auditing abstraction is done at a CPU level, performed by kernel modules that aren't GPU code.

I'm still strongly suspecting we'll end up with a new block-contiguous partition at a minimum, or perhaps a whole separate block storage device (read: another NVMe drive) specific to graphics needs.
 
So again, doing DMA transfers from a block storage device to a GPU makes a lot of big assumptions about how the storage is represented and ordered. We're talking about how partition geometry is laid out, how a filesystem manages (non?)-contiguous blocks of data, how security descriptors are organized and applied, how auditing is handled, just to name a small handful. None of this is so trivial as "oh my GPU is gonna pick up this file from an NTFS or XFS filesystem" because all that partition, filesystem, security and auditing abstraction is done at a CPU level, performed by kernel modules that aren't GPU code.

I'm still strongly suspecting we'll end up with a new block-contiguous partition at a minimum, or perhaps a whole separate block storage device (read: another NVMe drive) specific to graphics needs.

A trusted module (likely some part of the kernel) on the CPU can tell the GPU which physical blocks (and even which part of a block, if the filesystem supports that) to read, so filesystem is not a problem.
If you want the GPU to be able to write to the SSD then of course that's a completely different problem, but IMHO it's currently not a high priority in gaming applications.
 
See, that's the misunderstanding that pervades this conversation. Filesystems are not block devices; they're an abstraction (albeit a very useful one) on top of a block device. You can of course "emulate" a block device by using a file as a target, however that file has no specific guarantees of being stored in the underlying block device as contiguous addressable blocks. For the efficiency gains we want to get, we need a way to map a contiguous region of blocks in the underyling block storage device to a memory map.

Let's think of your contrary example: we can absolutely "solve" the problem by asking the CPU if we have access to the file, if the read itself needs to be audited (even reads incur writes on a standard file system -- you're aware of "Last Access Time" right?) We then need the CPU to query the filesystem allocation table where the file is located within the logical volume, and then we need to walk the logical volume -> logical partition -> storage target software stack for each block that stores the file (because there is no guarantee the blocks are contiguous.) We can then issue I/O requests for each psuedo-randomly placed block, or if we only need part of the file, the relevant blocks.

Yeah, so what we just did there is recreate literally the same file I/O stack we always had. Hundreds or even thousands of I/O requests to the source block target because a filesystem somehow gets involved.

For this to actually work, the storage really should be free of the limits that filesystems necessarily impose. Contiguous block allocation for objects, free of object-specific security requirements, free of atime audit requirements, free of locking semantics for multiuser access, devoid of memory caching (since ostensibly we'd be managing that directly with the application.) Inarguably, most of what this direct I/O technology is trying to accomplish is completely disconnected from what a "typical" file system tries to provide.
 
You don't need a file to be continuous, just get the list of blocks (which could be a lot for a large file, but you can't actually transfer a lot of data in one go anyway). You can also lock the file exclusively while the file is being used, no need for multiuser or even multi-process access.
 
You don't need a file to be continuous, just get the list of blocks (which could be a lot for a large file, but you can't actually transfer a lot of data in one go anyway). You can also lock the file exclusively while the file is being used, no need for multiuser or even multi-process access.
Think about what you've described: every single piece you mentioned is wrapped in the overhead of a tradtiional file system.

The list of blocks? You take a file descriptor, map it to a file allocation table, map it to a logical volume, map it to (one or more) logical partition(s), map those to (one or more) logical block storage targets. You still have to write an atime record even if you're only reading, you still have to ask for the file lock even if it's not multi-user, you still have to check permissions of the file -- all of these things are 100% CPU driven and get 100% in the way of a GPU doing a nice, light DMA transfer. To reiterate: you've described nothing different than the entire file system I/O stack. The only potential change is the lack of copy-to-main-memory, which arguably gains you something but it isn't the win we all want and expect.

Great, a GPU DMA transfer from storage -- after spending about 12,000 cycles in CPU and main memory futzing around with filesystem semantics. That's the whole point: if you're ever going to stop doing that inane shit, you have to stop using a (traditional, current-form) filesystem to do it.
 
Think about what you've described: every single piece you mentioned is wrapped in the overhead of a tradtiional file system.

The list of blocks? You take a file descriptor, map it to a file allocation table, map it to a logical volume, map it to (one or more) logical partition(s), map those to (one or more) logical block storage targets. You still have to write an atime record even if you're only reading, you still have to ask for the file lock even if it's not multi-user, you still have to check permissions of the file -- all of these things are 100% CPU driven and get 100% in the way of a GPU doing a nice, light DMA transfer. To reiterate: you've described nothing different than the entire file system I/O stack. The only potential change is the lack of copy-to-main-memory, which arguably gains you something but it isn't the win we all want and expect.

Great, a GPU DMA transfer from storage -- after spending about 12,000 cycles in CPU and main memory futzing around with filesystem semantics. That's the whole point: if you're ever going to stop doing that inane shit, you have to stop using a (traditional, current-form) filesystem to do it.

Considering the current transfer speed of SSD, 12,000 cycles in CPU is probably cheap.
From my understanding the foremost gain from DirectStorage is likely from the GPU's ability to decompress data directly into video memory. The bypass of copy-to-main-memory is a win but not the main one (it's probably not going to be in the first iteration either).
It's also possible (not sure about the first iteration either) that once the list of blocks are established, the GPU can stream data on demand from the storage instead of getting everything in one go. This allows better video memory management.
A dedicated partition with continuous files might be a win, but the cost is probably too high compared to the benefits (unless of course you are on a console, then you probably can do that). But even with that you'll need some kind of filesystem as people tend to have multiple games installed at the same time.
 
The list of blocks? You take a file descriptor, map it to a file allocation table, map it to a logical volume, map it to (one or more) logical partition(s), map those to (one or more) logical block storage targets.
Just have a read please, okay? https://docs.microsoft.com/en-us/wi...i/nf-ioringapi-buildioringregisterfilehandles

The classic 70s style file descriptors have outlived their purpose, for good. Together with the idea of having to play ping-pong with the kernel for scheduling a simple read on an already opened file. The overhead you are describing (and the entire access control on top of that) is now mostly gone or only paid once per *file*, not per *access*.
You still have to write an atime record even if you're only reading
No, definitely not every time.

But even then, that atime record was never flushed straight to disk with NTFS. File system meta data is using - like any ordinary IO - pages for buffering which are only written back to disk delayed, for the most part. You have to get really explicit to even work around that (FlushFileBuffers etc.)
It's also possible (not sure about the first iteration either) that once the list of blocks are established, the GPU can stream data on demand from the storage instead of getting everything in one go.
No, that's not really possible. The GPU simply can't stream. Only the CPU can instruct the NVMe to actively push data to the GPU. And the CPU has - at the end of a long list of transfered chunks - to inform the GPU that it may now safely access the memory exposed via PCIe. CPU monitors completion of the NVMes work queue, and posts a semaphore to the GPU.
The bypass of copy-to-main-memory is a win but not the main one
For AMD it is. That "one copy" is actually 4 half-duplex transfers with 4 context switches, at least one round of process scheduling.
Use BAR and it's at least down to 2 transfers and less synchronization requirements, but other than the burned memoy bandwidth still not much saved on the latency introduced on the CPU from the scheduling problems.
Use ioring API, and you can cut down on the CPU overhead a lot, but you still pay for that memory bandwidth, and a CPU core just spinning to perform the upload.

Factoring in all the other software side improvements, this is then the most significant remainder.

Even though this would had been a far more pressing issue back at the times of DDR3, while DDR5 systems should be impacted a lot less, actually. You average dual channel configuration has more than enough bandwidth to spare nowadays....
 
So let's do this:

I'm keenly aware of how this works. All of it. I'm not even sure why you posted it, probably because....

The classic 70s style file descriptors have outlived their purpose, for good. Together with the idea of having to play ping-pong with the kernel for scheduling a simple read on an already opened file. The overhead you are describing (and the entire access control on top of that) is now mostly gone or only paid once per *file*, not per *access*.
The "per access" overhead I'm speaking of is at a physical block level stored on the block storage target, it never was checking file descriptors over and over for the same file. Remember: we're talking about DMA transfers from a GPU which has literally no idea about how the logical file is physically mapped. Do you remember the conversation was about reducing CPU overhead? Because CPU overhead in this specific example is wholly unchanged from the GPU doing the reads than the CPU doing the same, because the CPU is still doing funtionally every single part of a file read except for the final step of having the GPU driver pull the blocks instead of the block storage device driver doing the same.

No, definitely not every time.

But even then, that atime record was never flushed straight to disk with NTFS. File system meta data is using - like any ordinary IO - pages for buffering which are only written back to disk delayed, for the most part. You have to get really explicit to even work around that (FlushFileBuffers etc.)
ATime can indeed be disabled at a whole filesystem level; it requires specific elevated permissiosn to do so. It's a bad idea on end user compute devices for myriad reasons; it's not even a good idea in the server world outside of very specific use cases. The default configuration for all NTFS is atime enabled and Microsoft would be insane to change it at a global level.

Annnnd again, it's still a CPU-driven event even if the write itself is cached and delayed (which it would be no matter the case.) The problem you're handwaving away is the CPU cycles necessary to do this work, and the interaction between the GPU making the read and notifying the CPU stack that a read was performed. You've literally saved no time at all my allowing the GPU to do this read, in this particular example, and in fact it's probably worse because the locking semantics are going to ping-pong between the block storage device driver and the GPU "block driver" what ever it would be.

No, that's not really possible. The GPU simply can't stream. Only the CPU can instruct the NVMe to actively push data to the GPU. And the CPU has - at the end of a long list of transfered chunks - to inform the GPU that it may now safely access the memory exposed via PCIe. CPU monitors completion of the NVMes work queue, and posts a semaphore to the GPU.
At a pure hardware level, PCIe devices are absolutely able drive their own DMA requests and/or direct bus transfers to other endpoints -- it's a function of the PCIe spec itself, albeit an optional one. Both endpoints have to support the call, of course. And even then, the current Windows OS driver stack probably isn't expecting that behavior today even if the underlying PCIe bus architecture 100% supports it. No small part of this would be an NVMe driver change to be aware of the instructions and device arbitration required.

AAAaaaaaaand again, remember the part where we're trying to get the CPU out of the line of the I/O transfer to accelerate storage to graphics? So far into your retort, we've removed zero of the CPU overhead we keep talking about in this thread.

Look, I do this shit for a living -- for more than one fortune 250 IT shop and for more than two decades. This isn't my first, fifth or 20th rodeo in this space. My entire point several times in this thread is: the Windows I/O stack is absolutely capable of millions of IOPS without significant CPU overhead. This whole DirectIO thing seems to be more about getting the decompression pieces hardware accelerated, rather than core changes to the I/O stack itself. I'm not sure why I have to keep repeating myself? You've brought up literally nothing that I hadn't already addressed.

The only way a GPU is going to get DMA transfers running from a storage device (which very well could be an HBA such as Fiberchannel or iSCSI in the datacenter compute space) will require either significant changes to NTFS, or a different filesystem altogether, which would map to it's own logical volume and partition scheme. That new storage topology would necessarily drop a number of things, or at least significantly restructure how they're handled today. That's the message I've continuously put forth in this thread, and I stand by it.
 
So dumbing this all down for me so I can understand it... basically @Albuquerque you believe that in order for DMA transfer from storage to VRAM to work, Microsoft would essentially need to create a new gaming specific filesystem which runs on a separate partition/drive formatted to that filesystem, right?
 
So dumbing this all down for me so I can understand it... basically @Albuquerque you believe that in order for DMA transfer from storage to VRAM to work, Microsoft would essentially need to create a new gaming specific filesystem which runs on a separate partition/drive formatted to that filesystem, right?
It's easier if you think about what DMA actually means: Direct Memory Access. In the context of this thread, we're talking about the data from disk being directly transferred to video memory without any interaction with the main CPU.

How "normal people" think of storage is actually in the form of filesystems; a useful abstraction which allows you to create, manipulate and delete files. However, your physical storage device doesn't store files, it stores blocks of binary data. As such, you can't issue a direct memory access request against a file, you must issue it against very specific blocks of data. What makes this nearly impossible to skip the main CPU (and the OS code necessary to create the filesystem abstraction) is that files are never guaranteed to be, and rarely ever are, written in contiguous blocks on the storage device.

Filesystems also have to care about whether your user ID is permitted to touch a file, to read a file, to make changes to a file, to delete a file... Filesystems also have to care about the last time a file was accessed or when it was saved. All of this file metadata is also stored in your storage device, not as a file, but as journaled log or bitmap or other such "file allocation table" thing. Since all modern filesystems must care about these metadata things, all of those actions must be checked and executed by your main CPU before any DMA would happen.

The purpose is to get data directly from your storage device into the video card memory without the need to use the main CPU. For any modern file system, this direct access would violate several key tenets of identy and access management. Meaning there's no obvious way for current filesystems to play in this space; something new will have to come about. It could be "so simple as" a new raw partition type, or something slightly more complex like a new filesystem type.

Ooooor it could be something supremely hand-wavey, like a unique extension to an existing filesystem which would permit a file with unique flags to be A: written contiguously to the underlying storage blocks, B: would permit access without checking for permissions (lol you're permitted to not check permissions!) and C: would never manage timestamps.

Honestly though, that sort of thing can lead to other problems. Imagine a portion of your storage which has no security controls, by design, because it can load data into your mega-performance video card faster. I'm quite sure there's no bad actors anywhere out there who might like such lax controls...
 
It's easier if you think about what DMA actually means: Direct Memory Access. In the context of this thread, we're talking about the data from disk being directly transferred to video memory without any interaction with the main CPU.

How "normal people" think of storage is actually in the form of filesystems; a useful abstraction which allows you to create, manipulate and delete files. However, your physical storage device doesn't store files, it stores blocks of binary data. As such, you can't issue a direct memory access request against a file, you must issue it against very specific blocks of data. What makes this nearly impossible to skip the main CPU (and the OS code necessary to create the filesystem abstraction) is that files are never guaranteed to be, and rarely ever are, written in contiguous blocks on the storage device.

Filesystems also have to care about whether your user ID is permitted to touch a file, to read a file, to make changes to a file, to delete a file... Filesystems also have to care about the last time a file was accessed or when it was saved. All of this file metadata is also stored in your storage device, not as a file, but as journaled log or bitmap or other such "file allocation table" thing. Since all modern filesystems must care about these metadata things, all of those actions must be checked and executed by your main CPU before any DMA would happen.

The purpose is to get data directly from your storage device into the video card memory without the need to use the main CPU. For any modern file system, this direct access would violate several key tenets of identy and access management. Meaning there's no obvious way for current filesystems to play in this space; something new will have to come about. It could be "so simple as" a new raw partition type, or something slightly more complex like a new filesystem type.

Ooooor it could be something supremely hand-wavey, like a unique extension to an existing filesystem which would permit a file with unique flags to be A: written contiguously to the underlying storage blocks, B: would permit access without checking for permissions (lol you're permitted to not check permissions!) and C: would never manage timestamps.

Honestly though, that sort of thing can lead to other problems. Imagine a portion of your storage which has no security controls, by design, because it can load data into your mega-performance video card faster. I'm quite sure there's no bad actors anywhere out there who might like such lax controls...
I see. Thanks. So anything like a new partition or filesystem would have to be 100% locked down right?

I think PC architecture is going to see some big changes coming in the future.
 
I see. Thanks. So anything like a new partition or filesystem would have to be 100% locked down right?
Yes, but also no.

It's probable we wouldn't need a whole new partition technology; a UEFI Class 3 GPT partition would likely work just fine with a new partition-type GUID or GPT Attribute flag to indicate it's a utility or reserved partition space for this sort of workload. Once we have the unique partition metadata specified, Windows could then know how to properly interact (or alternatively, deny interaction) with that partition.

So for example, common things Windows can already do with uniquely flagged partition types:
  • Deny mounting the partition by an interactive user logon
  • Prohibit filesystem overlays
  • Block direct or indirect access by user-runtime applications
  • Permit unique filesystems or filesystem attributes unrelated to the main boot partition or user data partitions.

You could hypothetically build this new GPT partition type so that only a specific Windows kernel process can map to/from regions of that partiton block on the underlying storage. You could also build the proper controls so that no data coming from that process can be marked "executable." Ostensibly limiting access to this new partition to only specific kernel hooks would limit what could be done by a bad actor to hide malicious code. Probably. It is still Windows though ;) Anyway, the point is we have the technology available today to do this already using the existing modern UEFI spec and existing Windows Management toolchain.

Here's something we still need to consider though: if it's a new partition, then it's a hard allocation of space in your storage device which would not be available for any other user data partitions. Said another way, you'd find yourself in a place where some number of GB will be reserved for this direct storage mechanism and is not otherwise available for your normal game installs. This brings up another limitation then: how much space do you reserve? How much space does each game take up? Is this direct storage pool fully persistent, or does it work somewhat like a "cache" that is occasionally loaded or unloaded? Do games perhaps need a "cache warm up" phase to optimally use this pool?

What I'm suggesting isn't the only way to solve these problems, there are other ways to make this work. There's not going to be a perfect solution to this interesting challenge, merely a more interesting set of tradeoffs for each...
 
I assume based on the explanation above that the CPU initiating a request for a texture that the GPU requires which is then sent direct from SSD to VRAM without traversing system RAM (like AMD's smart access storage suggests it will do) would not then be considered a DMA request? Would it have to be the GPU itself making the request directly to the SSD without the CPU necessarily even being aware of it to be considered a DMA request? I assume at the moment the GPU asks the CPU and the CPU tells the SSD to send the texture system RAM where the CPU then passes it on to VRAM?

Does anyone know how this was handled on those Radeon Pro's that had local SSD storage attached which the GPU could see as local memory?
 
Yes, but also no.

It's probable we wouldn't need a whole new partition technology; a UEFI Class 3 GPT partition would likely work just fine with a new partition-type GUID or GPT Attribute flag to indicate it's a utility or reserved partition space for this sort of workload. Once we have the unique partition metadata specified, Windows could then know how to properly interact (or alternatively, deny interaction) with that partition.

So for example, common things Windows can already do with uniquely flagged partition types:
  • Deny mounting the partition by an interactive user logon
  • Prohibit filesystem overlays
  • Block direct or indirect access by user-runtime applications
  • Permit unique filesystems or filesystem attributes unrelated to the main boot partition or user data partitions.

You could hypothetically build this new GPT partition type so that only a specific Windows kernel process can map to/from regions of that partiton block on the underlying storage. You could also build the proper controls so that no data coming from that process can be marked "executable." Ostensibly limiting access to this new partition to only specific kernel hooks would limit what could be done by a bad actor to hide malicious code. Probably. It is still Windows though ;) Anyway, the point is we have the technology available today to do this already using the existing modern UEFI spec and existing Windows Management toolchain.

Here's something we still need to consider though: if it's a new partition, then it's a hard allocation of space in your storage device which would not be available for any other user data partitions. Said another way, you'd find yourself in a place where some number of GB will be reserved for this direct storage mechanism and is not otherwise available for your normal game installs. This brings up another limitation then: how much space do you reserve? How much space does each game take up? Is this direct storage pool fully persistent, or does it work somewhat like a "cache" that is occasionally loaded or unloaded? Do games perhaps need a "cache warm up" phase to optimally use this pool?

What I'm suggesting isn't the only way to solve these problems, there are other ways to make this work. There's not going to be a perfect solution to this interesting challenge, merely a more interesting set of tradeoffs for each...
I personally am ok with allocating a specific partition specifically to games. Hell, as it is I have 3 drives each dedicated to specific things. One dedicated to Steam games specifically, one for the rest of the launchers (Ubi, Epic, Origin, ect) and one for Emulation/Storing GoG game backups and Mods for all my games.

So dedicating specific drives to gaming would be nothing new.

But yeah I wonder how it would be integrated into Windows... and by that I mean the user experience. At what point would they ask you to create this new partition? Would that fall on Steam and the others to implement a way for the user to create a partition beforehand, or would the game install detect whether it's present or not, and if not, trigger the process to create one?

I don't think a cache idea would be good. I think it needs to be fully persistent, and people just accept that they have to dedicate either a portion or an entire drive specifically to games.
 
I assume based on the explanation above that the CPU initiating a request for a texture that the GPU requires which is then sent direct from SSD to VRAM without traversing system RAM (like AMD's smart access storage suggests it will do) would not then be considered a DMA request?

The short answer for how a texture fetch works today goes like this:
  • Application in userspace calls to D3D API to fetch a texture from a file
  • D3D userspace runtime invokes the requisite filesystem API to retrieve the file
  • Filesystem userspace runtime invokes the requisite kernel storage hooks to make the access request
  • The kernel space storage request is then sent out to the storage driver for the requisite block fetches
  • Storage userspace driver retrieves the requested storage blocks from the underlying physical storage controller
  • Storage controller fetches blocks, places them in main memory, and notifies the storage driver of completion
  • Storage driver notifies the kernel of the completion and the memory address of the loaded asset
  • Kernel would then notify the filesystem with the new handle ID and the relevant memory pointer
  • Filesystem hands the pointer back to D3D
  • D3D can now do its own work to decompress or whatever; when its done what it needs, it will then notify the video driver of the in-memory pointer of the resulting data
  • The video driver will then either fully or partially load the completed result from system memory into dedicated VRAM
  • The video driver then notifies the DXAPI stack that the texture is ready, which notifies the app

By converting this to a Direct I/O call we could hypothetically drop this list by quite a bit:
  • Application in userspace calls to D3D "DirectI/O" API to fetch a texture (by way of a memory-mapped pointer to storage)
  • D3DIO userspace driver would then ostensibly hand this storage request directly to the userspace video card driver
  • The video driver instructs the physical video card to perform a PCIe peer-to-peer DMA request to the host storage controller which "owns" the memory mapped pointer region
  • Physical host storage device acknowledges the request, hands the mapped blocks straight over PCIe to the video card
  • Video card notifies the video card driver that the requested asset has been loaded, which notifies DXAPI, which notifies the app

The CPU is almost entirely avoided in the DXIO method, and that's roughly how we should expect it to lay out.
 
They have people way smarter than me solving this problem :) My "it should look like this" list of items is very simply reading the diagrams they've already potsed on how they envision it working.
So then what you do make of AMD Smart Access Storage (and assumingly RTXI/O) ?? How does that play into this? Because it seems like AMD already states that they will be able to bypass system RAM for GPU destined assets entirely.. RTX I/O back when it was announced showed quite clearly the direct path from storage to VRAM without requiring much CPU intervention.
 
Back
Top