Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

In order for the DirectStorage-specific GPU decompression to work, the entire asset management chain has to be collapsed into a singular, preprocessed blob. An example implementation of this process is the GLTF format he describes starting where I timestamped the AMD presentation above. This is to ensure maximum read speed from the storage layer, and also a "known size" of both the compressed and uncompressed assets in order to reserve the VRAM via the driver (see see timestamp 12:15)

This precomputed blob stores all the linked resources for the asset, and according to the AMD presentation, those are called subresources (he's talking about these around the 14m mark. Also see the API documentation on-screen, specifically calling out SubResources class.)

I watched the video and the recommendation to bundle all the metadata into a single blob makes sense to avoid multiple dependent reads. However this metadata resides only in system memory and contains pointers to the actual resources. The pointers are used by the CPU to initiate the load requests into the GPU. The actual load request can be for a single compressed 64KB tile of a texture from disk to the GPU. You don’t need to copy the entire compressed texture to the GPU in order to decompress one tile.

You do need a few extra buffers on the GPU when doing GPU decompression but it shouldn’t be a huge problem. The size of the staging buffer on the GPU is a function of the ratio of VRAM bandwidth + decoding speed to incoming bandwidth. Since VRAM bandwidth is multiple orders of magnitude higher than the rest of the I/O subsystems the VRAM buffers can be quite small. AMD recommends 256MB in this presentation. Nvidia recommends 128MB as a good starting point.
 
At the same time, all of that was still true when the resources were in system am. It was always billed (even in the presentation) as "up to 5x" and that number doesn't change in the GPU decompression method vs the CPU decompression method. So at best, you're +256MB of VRAM consumed on the AMD gear, and +128MB consumed on NVIDIA gear, versus workloads today in a world where we're already running out of VRAM.

So I'll reiterate my point again: the performance upswing of GPU decompression still comes at a cost of VRAM. Is it a worthy tradeofF? For maximum performance on large assets, where the video card has sufficient VRAM, absolutely. Is it a default case for a whole lot of the video cards that exist today? Not really.

Also, the GPU isn't initiating any disk transfers at all, it's still handled by video driver and actioned by the main CPU as a load request into system memory, which is then moved into VRAM. The deck is 100% clear on that transaction flow, even in BypassIO mode.
 
So I'll reiterate my point again: the performance upswing of GPU decompression still comes at a cost of VRAM. I

You also need to account for the likelihood that with awesome streaming tech you are going to evict assets from VRAM more aggressively and keep less stuff resident. You can’t really use today’s VRAM usage as a baseline.

Also, the GPU isn't initiating any disk transfers at all, it's still handled by video driver and actioned by the main CPU as a load request into system memory, which is then moved into VRAM. The deck is 100% clear on that transaction flow, even in BypassIO mode.

Yup that’s what I said.
 
At the same time, all of that was still true when the resources were in system am. It was always billed (even in the presentation) as "up to 5x" and that number doesn't change in the GPU decompression method vs the CPU decompression method. So at best, you're +256MB of VRAM consumed on the AMD gear, and +128MB consumed on NVIDIA gear, versus workloads today in a world where we're already running out of VRAM.

An increased use of tiled resources and and the use of something like SFS, made more usable thanks to faster decompression, might be able to offset this to some degree (less resident in vram, stuff getting dumped faster when not needed).

Trouble is, we might not see adoption of techniques like that quickly. Cards like the 570, 5700XT, 1060 etc are all still pretty popular, and don't support DX12U. The tools to handle increasing vram pressure seem to be arriving, but their use might not be high enough up the priorities list for a while.
 
An increased use of tiled resources and and the use of something like SFS, made more usable thanks to faster decompression, might be able to offset this to some degree (less resident in vram, stuff getting dumped faster when not needed).

Trouble is, we might not see adoption of techniques like that quickly. Cards like the 570, 5700XT, 1060 etc are all still pretty popular, and don't support DX12U. The tools to handle increasing vram pressure seem to be arriving, but their use might not be high enough up the priorities list for a while.

Pascal supports Shader Model 6 so the 1060 should be fine. Those cards are probably too slow in other areas anyway.
 
Pascal supports Shader Model 6 so the 1060 should be fine. Those cards are probably too slow in other areas anyway.

Yeah, I was thinking about Sampler Feedback and its use with tiled resources. DS thankfully has a lower bar for entry than DX12U.

I should really put an NVMe drive in my PC and do that avocado test on my old timey RX570.
 
In the end, all of this DirectStorage tech is good stuff and nobody should think otherwise. In the same breath, a lot of it is future-looking and so several of the coolest features will work better in later cards with higher capabilities and capacities. There's nothing wrong with this at all; in the end some of these technologies will be useful even beyond gaming; the BypassIO and IORing methods are absolutely applicable to general purpose applications.
 
In the end, all of this DirectStorage tech is good stuff and nobody should think otherwise. In the same breath, a lot of it is future-looking and so several of the coolest features will work better in later cards with higher capabilities and capacities. There's nothing wrong with this at all; in the end some of these technologies will be useful even beyond gaming; the BypassIO and IORing methods are absolutely applicable to general purpose applications.
Exactly. It's simply a much more efficient use of the architecture that is already there, and a useful improvement which acts as a stop-gap of sorts to buy time until proper architectural improvements can be made and adopted by the market.

What will future implementations look like? Will future GPUs incorporate extra silicon dedicated to decompression? Microsoft have implied that future implementations could be dedicated hardware based.

There's still fundamental OS issues to solve to reduce latency and other overheads which don't exist on consoles, but PC also doesn't exactly have to be as efficient as consoles to be superior.. and that's where the improvements to the API comes in which makes taking advantage of this improved dataflow path for developers much easier.. which ultimately means it's more likely to be adopted and used by games and applications.
 
Hope this is the correct thread ...
Addlink has made waves in the tech industry with the announcement of its S95 8TB Gen4x4 SSD. The drive has achieved a remarkable sequential read speed of 28GB/s in a 32TB NVMe RAID array, tested on an AMD Threadripper workstation with four Addlink 8TB SSDs using an MSI M.2 XPANDER-AERO RAID card on an MSI TRX40 Creator motherboard.

The S95 SSD features TLC 3D NAND technology and offers exceptional read speeds of up to 7GB/s, making it at least two times faster than Gen 3 NVMe SSDs and more than 14 times faster than SATA SSDs.
...
In addition to its outstanding performance, the S95 SSD offers impressive endurance, with 2800TBW for the 4TB version and 5600TBW for the 8TB version.
 
Considering we're nowhere near close to maxing out the new SSDs, performance like this doesn't have much relevance to gaming.
 
Yeah, we've been able to put SSDs into striped RAID configurations for a while now, the problem is that applications need to be written specifically for fast storage for it to be a meaningful improvement for most use cases. Obviously there are some that will benefit regardless, like simply copying of large volumes of data, but real world consumer applications (like games) generally need to be specifically written to take advantage of it and even there you'll likely run into other bottlenecks before you get even close to needing more than say 7 GB/s or even 4 GB/s drives.

Regards,
SB
 
Yeah, this is really just a marketing ploy. NVMe RAID has been around for a hot minute; the story here is someone spend a decent chunk of coin to build a proper PCIe 4 x16 interface card to then strap in four (of any, really) NVMe drives and pump out a cool marketing number for a QD32 1mbyte sequential read -- which they didn't call out, but that's near-exactly what it was, I'm quite sure.

Historical anecdote: Quite a while back, when the C-stepping 3930k came out with the "technically not PCIe 3 but also totally has PCIe 3" interface spec, I bought a Hightpoint 2720SGL RAID card and strapped six Vertex 3 SSD's to it, all in a fat RAID 0 stripe set, and cranked out some crazy ass benchmark numbers. I never had a single problem with the array, however all that throughput was utterly wasted on the applications of the time. It could pump out ~2.5GBytes/sec in susrtained throughput in synthetic benches more than ten years ago, and yet literally not a single piece of software I owned at the time could realistically get to even a quarter of that number in regular usage.
 
Can any application other than high-end video editing saturate any-bandwidth storage?

If you have a sufficiently large database and a sufficiently complex query, I suppose so. However, there's not likely many general consumers that would need anything like that. :D Maybe a hobbyist forum on a homemade server in the basement that gets a ton of traffic? :)

It's difficult to come up with many consumer facing applications that would be able to take advantage of that much bandwidth without first running into other potential bottlenecks.

Heck, even simple file copying runs into bottlenecks preventing full speed data transfers. For example without batching many smaller transfers into a single large transfer you can incur large overhead that suddenly makes your GB/s NVME SSD start transferring files in the single or double digit MB/s speeds when transferring hundreds or potentially thousands of KB or even single digit MB sized files. And batching those smaller files into one large transfer can itself be a bottleneck where your transfer will stall until they are assembled into a larger block transfer.

Regards,
SB
 
Can any application other than high-end video editing saturate any-bandwidth storage?
I'd wager that some dev tools, especially game dev tools and environments would go close to saturating such setups,
but thats likely due to a history of throwing hardware at the problem in that industry.
 
Yeah, we've been able to put SSDs into striped RAID configurations for a while now, the problem is that applications need to be written specifically for fast storage for it to be a meaningful improvement for most use cases. Obviously there are some that will benefit regardless, like simply copying of large volumes of data, but real world consumer applications (like games) generally need to be specifically written to take advantage of it and even there you'll likely run into other bottlenecks before you get even close to needing more than say 7 GB/s or even 4 GB/s drives.

Regards,
SB
That's the "big" problem. There is only so many data CPUs and GPUs can process until the shere volume of data just reduces a bit a bit the latencies. Also not every bit of data must be read into memory as most parts are already in there because there is normally not that much changing between frames. So only the initial bandwidth between complete location changes are those that might create problems but even such things where handled quite good in the past.

Also nowadays there is often a procedural part in games which is normally not loaded from the storage but instead generated on the CPU. Which leads to situations where the game logic must first generate the world and can than start to load things into memory because the game does just not know what to load into memory until the games world is generated.
 
Last edited:
At the same time, all of that was still true when the resources were in system am. It was always billed (even in the presentation) as "up to 5x" and that number doesn't change in the GPU decompression method vs the CPU decompression method. So at best, you're +256MB of VRAM consumed on the AMD gear, and +128MB consumed on NVIDIA gear, versus workloads today in a world where we're already running out of VRAM.

So I'll reiterate my point again: the performance upswing of GPU decompression still comes at a cost of VRAM. Is it a worthy tradeofF? For maximum performance on large assets, where the video card has sufficient VRAM, absolutely. Is it a default case for a whole lot of the video cards that exist today? Not really.

Also, the GPU isn't initiating any disk transfers at all, it's still handled by video driver and actioned by the main CPU as a load request into system memory, which is then moved into VRAM. The deck is 100% clear on that transaction flow, even in BypassIO mode.

Not necessarily. From what I recall, GPUs hardly make inefficient use of VRAM. Aggressively prefetching unnecessary data in return for limiting GPU calls to memory that can't be service by VRAM. The decompression + IO is supposed to allow for less aggressive prefetching and more efficient use of VRAM. So ideally you are trading 128-256 MB of buffer for some amount of freed VRAM that can be actually utilized for data that's needed for rendering.

So if your scheme is performant enough the decompression buffer should be less than the amount of VRAM that freed from storing unneeded data. Basically, a net gain.
 
Last edited:
I finally fixed my NVMe drives to properly support BypassIO. There's a couple of things that I haven't seen mentioned yet but are probably worth knowing about.

First of all, it breaks disk performance counters in Windows in applications that use DirectStorage. That is to say, they work perfectly normally in applications that do not use DirectStorage. But applications that do use DirectStorage effectively become invisible to the OS as far as those disk counters are concerned. In other words, you can have a DS application that completely maxes out your SSD but Task Manager, Resource Monitor, HWinfo, etc. will all report your disk as being idle. This obviously has some implications for measuring DS performance since we're effectively blind to how the disk itself is performing.

Secondly, BypassIO seems to work transparently with DirectStorage and doesn't need any specific work done. As long as the application uses DS and the drive supports BypassIO then it seems to work automatically.

Also, I was kinda hoping that it would fix that weird read behavior in Forspoken but that doesn't seem to be the case. It was still invisible to the above performance counters but I managed to confirm it via the drive's SMART info.
 
I finally fixed my NVMe drives to properly support BypassIO. There's a couple of things that I haven't seen mentioned yet but are probably worth knowing about.

First of all, it breaks disk performance counters in Windows in applications that use DirectStorage. That is to say, they work perfectly normally in applications that do not use DirectStorage. But applications that do use DirectStorage effectively become invisible to the OS as far as those disk counters are concerned. In other words, you can have a DS application that completely maxes out your SSD but Task Manager, Resource Monitor, HWinfo, etc. will all report your disk as being idle. This obviously has some implications for measuring DS performance since we're effectively blind to how the disk itself is performing.

Secondly, BypassIO seems to work transparently with DirectStorage and doesn't need any specific work done. As long as the application uses DS and the drive supports BypassIO then it seems to work automatically.

Also, I was kinda hoping that it would fix that weird read behavior in Forspoken but that doesn't seem to be the case. It was still invisible to the above performance counters but I managed to confirm it via the drive's SMART info.
If it breaks disk performance counters, then perhaps there's no weird read behavior in Forspoken to begin with?
 
Back
Top