Could the CPU not instruct the SSD DMA controller to send the required data direct to the GPU? Wouldn't that still allow all the security checks and file permissions to remain in place, but also allow the system memory to be bypassed?
Realistically, this gains incredibly little. The heaviest CPU burden here (in terms of active compute cycles) is all the filesystem semantics about identity and access management, and then walking the filesystem descriptor (the journalled file system, or file table, or file bitmap, or whatever) to disover the relevant physical storage host controller(s) and, underneath, the mapped physical storage device blocks.Could the CPU not instruct the SSD DMA controller to send the required data direct to the GPU? Wouldn't that still allow all the security checks and file permissions to remain in place, but also allow the system memory to be bypassed?
Thanks for the detailed write-ups.Realistically, this gains incredibly little. The heaviest CPU burden here (in terms of active compute cycles) is all the filesystem semantics about identity and access management, and then walking the filesystem descriptor (the journalled file system, or file table, or file bitmap, or whatever) to disover the relevant physical storage host controller(s) and, underneath, the mapped physical storage device blocks.
Said another way: the inclusion of a filesystem in this I/O call makes the process CPU heavy.
One potential gain here would be bypassing the need for a second copy of the object in main system memory. While this immediately seems like a win (and it very well could be) you then get into some performance curiosities about leveraging heterogenous GPU systems as a novel way to "hardware accelerate" all the DXIO work away from the active rendering GPU. Mental picture for you: leverage a modern Intel iGPU or the AMD APU as the I/O offload accelerator in this picture; D3D12 already supports this asynchronous compute model today. Modern iGPU/APUs are simultaneously connected to the PCIe root complex and also main system memory. Bypassing main memory becomes an arbitrary and ultimately useless line in the sand for those folks who could use an embedded GPU (and thus, the main system memory pool) to do all the fancy GPU-accelerated decompression and storage management, to then hand it off to the dGPU doing all the raster work.
So yeah, could we get some gain in performance by skipping the main memory copy of the object? Certainly. Is that secondary memory copy actually costing CPU cycles? Sure, however it costs a miniscule quantity of CPU power when compared to the monster pile of CPU cycles needed to walk the filesystem abstraction layer.
Filesystems are the enemy here.
As I mentioned earlier, there are ways we could work around that bottleneck while keeping a filesystem. A contrived example could involve the software installer writing a singular binary blob (file) with all the assets bundled together. Once the write is complete, the app would then use a series of instructions to map out how the aforementioned asset file was physically written to the underlying storage (all the way down to discrete LBAs - or Logical Block Addresses), then create a bitmap of sorts serving as a rosetta stone for "this object/asset is stored on this storage controller, on this storage port connected to that controller, on this storage endpoint GUID attached to that storage port, starting at this LBA and extending to this LBA for the first 96KB, then skipping ahead to this LBA through this next LBA for the following 36KB, then skipping ahead again to this LBA through this other LBA for the next 124KB, and then...."
That bitmap rosetta stone would have to be built specifically on each machine during installation, and there would need to have a flag placed on that file to ensure it could never be "defragged" at any abstraction level, and if you ever migrated your disk (eg you upgrade from a 1TB gaming drive to a 4GB gaming drive) you'd have to let every app regenerate its bitmap because all the LBAs would change.
This method would permit a single filesystem call to get the master filehandle on the large asset blob, and then combined with an in-memory copy of that bitmap rosetta stone, the app could possibly issue direct I/O calls underneath to the millions of mapped blocks. This solution is a fragile, brittle, and rigid system that could be made to work but would also potentially suffer a LOT of strange pitfalls when it comes down to how a modern OS expects to be able to manage files separately from block I/O calls.
Also worth noting: in the direct storage I/O stack where the video card and the storage controller can work through a peer-to-peer transfer, there has to be a modification to both the video driver and storage driver stacks. Specifically, when each of those devices goes into PCIe Bus Master mode to do their jobs, the OS needs to know that the underlying hardware is carrying out commands that the OS may not have actually issued or even be aware of. So when the physical video card sends the PCIe TLP commands to the physical storage controller, the upstream OS storage driver needs to be aware of the physical device arbitration status so that other OS-handled I/O requests are properly queued and pipelined to stack in behind the current active bus transfer.
Edit: it later occured to me the singular binary blob of asset data will also require several other steps in how it's created at the filesystem level so that individual assets are written to disk in such a way they align at LBA boundaries. This means there's a pre-step of determining the physical LBA size of the destination storage device before creating the file, and then while creating the asset blob, each asset needs to be padded so that they all line up properly on LBA boundaries. Otherwise you end up with asset fragments that spill over into other LBAs and then you have to build a bunch of extra logic to prune nonsense off the blocks, which also means you're loading more data and burning more GPU cycles than you actually need to.
Example to help illustrate: if your underlying storage device is a modern 4kN drive, then you want your assets padded to the nearest 4 kilobyte boundary -- so a 9KB asset will necessarily consume 12KB on disk -- three 4Kb blocks, padded so that it perfectly fits into three physical blocks. If your underlying storage is actually a 512E or a 512N device, then your 9KB asset could ostensibly only consume 9KB on disk -- eighteen 512b blocks aligned to the underlying eighteen physical blocks.
One potential gain here would be bypassing the need for a second copy of the object in main system memory. While this immediately seems like a win (and it very well could be)
you then get into some performance curiosities about leveraging heterogenous GPU systems as a novel way to "hardware accelerate" all the DXIO work away from the active rendering GPU. Mental picture for you: leverage a modern Intel iGPU or the AMD APU as the I/O offload accelerator in this picture; D3D12 already supports this asynchronous compute model today. Modern iGPU/APUs are simultaneously connected to the PCIe root complex and also main system memory. Bypassing main memory becomes an arbitrary and ultimately useless line in the sand for those folks who could use an embedded GPU (and thus, the main system memory pool) to do all the fancy GPU-accelerated decompression and storage management, to then hand it off to the dGPU doing all the raster work.
We know that dedicated hardware will be introduced at some point as well.
This is an intriguing possibility. Certainly leveraging the iGPU to handle the CPU destined data decompression which even under the Direct Storage GPU decompression model would still be done on the CPU could be a very real win, both from a architectural perspective and a practical/performance one.
I still see value in passing the GPU destined data directly to the GPU for decompression there, thus avoiding the extra system memory copies and saving PCIe bandwidth (on account of sending compressed data over that bus rather than uncompressed data). But for the CPU destined data then to unburden the CPU itself of that decompression job by utilising the otherwise idle iGPU seems like a excellent use of already present resources.
So again, the bypass of system memory isn't going to net much of anything in terms of CPU cycles saved. You might avoid a handful of milliseconds of data transfer, depending on how big your asset might be.
The enormous all-CPU-consuming pink elephant in the room is the filesystem; most everything else in the work pipeline consumes so very little CPU time as to functionally not matter.
Microsoft has already stated it's the case though. I don't see why it would be any harder than it is for any other hardware standard and support to happen? This is specifically why AMD and Nvidia (and Intel) are utilizing the standard MS is bringing forward. GPU decompression, and anything else done within the current architecture is a stop-gap to allow actual hardware and architectural changes to be created and adopted by the market.I still don't think this is necessarily the case. In fact I think it's less likely than likely to be honest. It's analogous to PhysX processors IMO. Theoretically it's an advantage. But practically it adds, cost, complexity, limits flexibility, and it quite difficult to implement in the PC space because of all the various parties that would need to agree on standards and support. I suspect the likely already good enough GPU implementation that should be relatively easily introducable will win out. I also think that streaming and decompression requirements will become relatively smaller over time compared to growing CPU and GPU performance.
It is likely time consuming to implement while providing smallish actual benefits.The SDK has been out for a while, correct? Is there a reason not a single dev has put out a game taking advantage of it? The first high-profile one will be Forspoken and that will be coming out almost a year after DirectStorage was made available.
Is it time-consuming/complicated to implement?
That's disappointing. I've been hearing about the supposed great benefits for over two years now.It is likely time consuming to implement while providing smallish actual benefits.
IMOThat's disappointing. I've been hearing about the supposed great benefits for over two years now.
Such benefits could be coming with GPU decompression path being added but without it I rather doubt that DS bring much to the table in terms of actual I/O performance.That's disappointing. I've been hearing about the supposed great benefits for over two years now.
And if I remember this was supposed to be part of RTX IO which is still MIA two years later.Such benefits could be coming with GPU decompression path being added but without it I rather doubt that DS bring much to the table in terms of actual I/O performance.
So again, the bypass of system memory isn't going to net much of anything in terms of CPU cycles saved. You might avoid a handful of milliseconds of data transfer, depending on how big your asset might be.
Microsoft has already stated it's the case though. I don't see why it would be any harder than it is for any other hardware standard and support to happen? This is specifically why AMD and Nvidia (and Intel) are utilizing the standard MS is bringing forward. GPU decompression, and anything else done within the current architecture is a stop-gap to allow actual hardware and architectural changes to be created and adopted by the market.
That's disappointing. I've been hearing about the supposed great benefits for over two years now.
And if I remember this was supposed to be part of RTX IO which is still MIA two years later.
But that reduced latency may have some benefits I guess? Perhaps say for SFS where there is a very short window to fetch the necessary tile from memory? ny other bottlenecks in the loading process that the effort to implement DS as it stands right now may not be worth it.
With DirectStorage 1.1, we present a new compression format, contributed by NVIDIA, called GDeflate.
“NVIDIA and Microsoft are working together to make long load times in PC games a thing of the past,” said John Spitzer, VP of Developer and Performance Technology at NVIDIA. “Applications will benefit by applying GDeflate compression to their game assets, enabling richer content and shorter loading times without having to increase the file download size.”
GDeflate is a novel lossless data compression standard optimized for high-throughput decompression on the GPU with deflate-like compression ratios. GDeflate saves CPU cycles by offloading costly decompression operations to the GPU, while saving system interconnect bandwidth and on-disk footprint at the same time. GDeflate compression is inherently data-parallel, which enables greater scalability across a wide range of GPU architectures. It is designed to provide significant bandwidth amplification when loading from the fastest NVMe devices, supporting both bulk-loading and fine-grained streaming scenarios.
GDeflate provides a new GPU decompression format that all hardware vendors can support and optimize for. Microsoft is working with key partners like AMD, Intel, and NVIDIA to provide drivers tailored for this format. “Intel is excited to release drivers co-engineered with Microsoft to work seamlessly with the DirectStorage Runtime to bring optimized GPU decompression capabilities to game developers!” said Murali Ramadoss, Intel Fellow and GM of GPU Software Architecture. Like all DirectX technologies, with DirectStorage, Microsoft is working to ensure that gamers have great options for compatibility and performance for their hardware.