DirectStorage GPU Decompression, RTX IO, Smart Access Storage

The I/O stack in Windows isn't perfect, just like it isn't perfect in Linux. It also isn't the bottleneck you and others seem to assume it is.

The "others" literally being Microsoft themselves.

Microsoft said:
Modern games load in much more data than older ones and are smarter about how they load this data. These data loading optimizations are necessary for this larger amount of data to fit into shared memory/GPU accessible memory. Instead of loading large chunks at a time with very few IO requests, games now break assets like textures down into smaller pieces, only loading in the pieces that are needed for the current scene being rendered. This approach is much more memory efficient and can deliver better looking scenes, though it does generate many more IO requests.

Unfortunately, current storage APIs were not optimized for this high number of IO requests, preventing them from scaling up to these higher NVMe bandwidths creating bottlenecks that limit what games can do. Even with super-fast PC hardware and an NVMe drive, games using the existing APIs will be unable to fully saturate the IO pipeline leaving precious bandwidth on the table.

That’s where DirectStorage for PC comes in.
 
The "others" literally being Microsoft themselves.
And again, you fail to recognize the context which I've twice now tried to educate you with.

Issuing tens or hundreds of thousands of I/O operations requires a lot of power to service those requests. A reduction in CPU time absolutely helps a system which is power constrained or otherwise CPU burdened -- both of which directly apply to consoles with "reserved" CPU capacity and shared iGPU power draw to contend with.

On a desktop PC with dedicated power budget to just the CPU, the cycles aren't a limiter.

We can directly prove this by showing Windows boxes crushing the IOP rate of any bullshit game when executing real, high performance enterprise workloads without any problem whatsoever. The paltry I/O needs of any video game pale in comparison to a SQL cluster managing centralized store replenishment for six thousand five hundred stores cross-linked with DMV registration data for every car in every county, cross-linked again with weather statistics for the 18 months, cross-linked again with prior sales data for the same 18 months, cross linked once more with the topology of a spoke-and-hub supply chain.

I could probably light a block of concrete on fire with the power consumption of that SQL cluster and the NVMeOF frame it's connected to. Your assertion of IOP limits in Windows being anything related to a SATA interface is laughable, at best.
 
Last edited:
Yeah you can get by pretty well with your games on a lowly hard drive even today. Load times are longer of course but it doesn't cripple the experience.

I guess we'll see if upcoming games do something drastically different.
 
And again, you fail to recognize the context which I've twice now tried to educate you with.

Issuing tens or hundreds of thousands of I/O operations requires a lot of power to service those requests. A reduction in CPU time absolutely helps a system which is power constrained or otherwise CPU burdened -- both of which directly apply to consoles with "reserved" CPU capacity and shared iGPU power draw to contend with.

On a desktop PC with dedicated power budget to just the CPU, the cycles aren't a limiter.

This makes no sense. The consoles have fixed frequency CPU's (yes the PS5 is slightly variable, but not massively so like a mobile CPU).

But taking the XSX CPU for example, regardless of the consoles power constraints you have a guaranteed frequency of 3.5Ghz. Thats on an 8 core Zen2 based CPU. There are plenty of PC CPU's that don't have those resources and so suggesting PC's don't need DirectStorage because they don't have the same resource constraints as consoles makes no sense. And that's before we even consider the hardware decompressors on the consoles.
 
Despite it not making sense to you, the reality is a modern Windows kernel can push multiple millions of IOPS with service times in the dozens microseconds.

Game load times are not bottlenecked by a Windows I/O stack failure on modern, capable hardware.
 
Despite it not making sense to you, the reality is a modern Windows kernel can push multiple millions of IOPS with service times in the dozens microseconds.

Game load times are not bottlenecked by a Windows I/O stack failure on modern, capable hardware.
The reality is that MS are creating DirectStorage for a reason... Sub 1 second.. or close to 1 second load times... ie instantaneous.. and to reduce as much strain on the cpu at run-time as possible.
 
Yes, we agree: they're optimizing an I/O path for devices which are CPU constrained. It seems to make a lot of sense on consoles.

There are multitudes of datapoints and reviews showing desktop gaming PC's are not experiencing I/O wait for loading game assets. Your statement does not invalidate anything I've said, nor does it refute millions of Windows servers (running the exact same kernel as Windows 10) being able to perform disk I/O at a level far beyond anything a video game would need.
 
For whatever* reason, decompression of assets seems to be single-threaded or at least very shyly multithreaded on gaming PCs. DirectStorage removes that and give a kind-of-guaranteed decompression performance that does not tax your PC's CPU.
upload_2021-7-8_18-50-44.png
upload_2021-7-8_18-52-32.png
*my guess is: You can never know how many cores there are in a PC before and most devs cannot be bothered to write a dynamic decompression.
 
Game load times are not bottlenecked by a Windows I/O stack failure on modern, capable hardware.

So just to be clear, why do you think Microsoft are saying that they are, and what do you think is the real bottleneck preventing NVMe drives with 10x the throughput of SATA SSD's from delivering significant speed ups in game load time?

Yes, we agree: they're optimizing an I/O path for devices which are CPU constrained. It seems to make a lot of sense on consoles.

Then why wouldn't it also make sense on PC? Granted there are much more powerful CPU's in the PC space but unless you're rocking a 3700x which is still pretty high end compared to the average PC then you're likely just as CPU constrained or even more so than the consoles.

So if this makes sense for the consoles then surely it makes sense for those PC's as well.

And taking that one step further, even if you have a 5950X with 16 cores, why would you want to bog down 5 or 6 of them purely to service IO requests and decompression when DirectStorage can reduce that to a small fraction of a single core?

There are multitudes of datapoints and reviews showing desktop gaming PC's are not experiencing I/O wait for loading game assets.

I'm curious to see these benchmarks showing NVMe drives demonstrating multiple times the game load speed performance of SATA SDD's.
 
I'm sorry that you think millions of incredibly high-end Windows servers running the Windows 10 kernel are limited to the speed of a SATA disk interface.

I don't know why you think that.

It isn't correct in the slightest.

Microsoft pointed out an optimization to their console systems which reduces CPU overhead for issusing lots and lots of I/O requests. Yeah, that's nice and probably worthwhile. Yeah, it's easily ported to the PC.

No, that doesn't equate to every Windows 10 PC being massively bottlenecked to SATA disk speeds.

I'm done having this incredibly stupid and narrow-minded conversation. There is infinitely more data showing the Windows 10 kernel having enormous amounts of disk I/O headroom than there are data showing the contrary.

There are plenty of sales pitches for it though, aren't there?

NV RTX IO, available only for your $2000/ea video card! Buy a few :)

I bet I can find a boxed copy of SoftRAM for you, too. I'll make you a deal...
 
Microsoft pointed out an optimization to their console systems which reduces CPU overhead for issusing lots and lots of I/O requests. Yeah, that's nice and probably worthwhile. Yeah, it's easily ported to the PC.
Yeah, that appears to be the main selling point of DirectStorage on PC.

You are absolutely correct, Windows severs can offer millions of I/Os, but the cost of CPU time is high, CPUs also has to decompress game data during loading and during streaming. The problem is that current gaming APIs is incredibly inefficient in using the CPU resources for these logistical tasks, as most of these tasks are still single threaded. DirectStorage is simply a way to correct these limitations in the DirectX API specifically, freeing the CPU to focus more on it's natural tasks of handling game logic, physics and simulation rendering.
 
Key point of DS on PC is the standardization of GPU decompression. Read speeds on PC are not limited by storage, they are limited by data processing (decompression) which tend to happen on a couple of CPU threads at best.
 
https://devblogs.microsoft.com/directx/directstorage-developer-preview-now-available/

DirectStorage Compatibility 
Microsoft is committed to ensuring that when game developers adopt a new API, they can reach as many gamers as possible. As such, games built against the DirectStorage SDK will be compatible with Windows 10, version 1909 and up; the same as the DirectX 12 Agility SDK.

DirectStorage features can be broken down into:
  • The new DirectStorage API programming model that provides a DX12-style batched submission/completion calling pattern, relieving apps from the need to individually manage thousands of IO requests/completion notifications per second
  • GPU decompression providing super-fast asset decompression for load time and streaming scenarios (coming in a later preview)
  • Storage stack optimizations: On Windows 11, this consists of an upgraded OS storage stack that unlocks the full potential of DirectStorage, and on Windows 10, games will still benefit from the more efficient use of the legacy OS storage stack
This means that any game built on DirectStorage will benefit from the new programming model and GPU decompression technology on Windows 10, version 1909 and up. Additionally, because Windows 11 was built with DirectStorage in mind, games running on Windows 11 benefit further from new storage stack optimizations. The API runtime implementation and the GPU decompression technology is delivered via the DirectStorage SDK, and ships with your game. As a game developer, you need only implement DirectStorage once into your engine, and all the applicable benefits will be automatically applied and scaled appropriately for gamers.

In fact, this great compatibility extends to a variety of different hardware configurations as well. DirectStorage enabled games will still run as well as they always have even on PCs that have older storage hardware (e.g. HDDs).
So compatible with Win10 as old as 1909 even.
 
  • The new DirectStorage API programming model that provides a DX12-style batched submission/completion calling pattern, relieving apps from the need to individually manage thousands of IO requests/completion notifications per second
And that's actually a key feature here. Because even though we do have overlapped IO with IO completion ports for a long time now, it requires a decent amount of overhead per each single request for bookkeeping what's in-flight, both in-application, and the OVERLAPPED structures. Plus, even though there was already GetQueuedCompletionStatusEx for batched processing of completion events before, batched submission on the other hand did not exist.

Well, if you don't insist on batched submission, massive parallel scatter loading was already possible before too, though. Batching ain't that difficult either, if you are willing to pay for the bookkeeping across multiple pending batches.

There is yet another catch though, if you can't batch requests, good old friends named AV software (and a bunch of other freeloaders on the IO stack) are also waking up for every single request, and these are often where the actual CPU time part of the IO overhead comes from.
 
There are multitudes of datapoints and reviews showing desktop gaming PC's are not experiencing I/O wait for loading game assets.
I'm not trying to weigh in on who definitely is or isn't correct, just want to point out that using current PC games as a benchmark doesn't really work cuz they aren't built, in terms of data structure and memory management and all that, in order to take advantage of SSD's the way some select 'next gen' games are on consoles so far. The point is that IO requests will increase drastically. Maybe you're right and this still wont bottleneck the system and Microsoft are basically wasting their time on a fairly pointless update(given that next gen games will be targeting fairly well spec'd systems), but that seems a bit strange, no? You say it's just for CPU-limited systems like consoles, except the consoles have pretty good CPU's in them, that are only a mild glance back compared to desktop CPU's. It's hard to imagine that this gap includes, somewhere in its fairly short size, the defining line at which things go from not good enough to good enough, so that MS have needed to step in and do some emergency upgrades to ensure things work fine on consoles.

I would guess MS do know something we dont and that there will be tangible benefits to come from them doing all this work.
 
I'm not trying to weigh in on who definitely is or isn't correct, just want to point out that using current PC games as a benchmark doesn't really work cuz they aren't built, in terms of data structure and memory management and all that, in order to take advantage of SSD's the way some select 'next gen' games are on consoles so far. .
Show us one and we can have that conversation.

There are Optane P5800X reviews (PCIe 4 x 4,U.2 single interface) on the net now showing Microsoft SQL Server performance nearing a million IOPS with I/O service times below 100 μs. This again is on a single storage device, using the same Windows kernel as a Windows 10 distribution. This isn't bulk-rate reading a single enormous flat file for maximum bandwidth, this is query result which means a LOT of random I/O.

I'm still waiting for someone to show any data at all that the Windows kernel is bottlenecking disk I/O today. I'd love to hear a realistic story as to why stupidly-simply-in-comparison game data files are somehow more complex and more onerous to disk throughput than an enterprise-scale transactional database.

I suspect what we're really facing here is an API which does all the heavy lifting work for game designers who don't want to put in the code effort, which is truly fine. Making it easier for a dev is a rational and reasonable argument, far more so than making the kernel servicing I/O in some remarkably, game-changing (ha!) faster way. I'm sure the kernel can use more tweaking as all code can; it isn't bottlenecking disk I/O today on the crap storage we find in commodity grade consumer devices like typical NVMe drives.

I also buy into the GPU decompression conversation being more of the "meat and potatoes" of a newfangled feature being added.
 
Show us one and we can have that conversation.

There are Optane P5800X reviews (PCIe 4 x 4,U.2 single interface) on the net now showing Microsoft SQL Server performance nearing a million IOPS with I/O service times below 100 μs. This again is on a single storage device, using the same Windows kernel as a Windows 10 distribution. This isn't bulk-rate reading a single enormous flat file for maximum bandwidth, this is query result which means a LOT of random I/O.

But how many CPU cores are those servers using to hit that level of data transfer? And how much of that data are they having to decompress on the fly? This isn't about saying PC's can't move that much data, it's about reducing the massive overhead associated with it.

We literally have it straight from Digital Foundry and Microsoft's Andrew Goossen:

Digital Foundry said:
The final component in the triumvirate is an extension to DirectX - DirectStorage - a necessary upgrade bearing in mind that existing file I/O protocols are knocking on for 30 years old, and in their current form would require two Zen CPU cores simply to cover the overhead, which DirectStorage reduces to just one tenth of single core.

"Plus it has other benefits," enthuses Andrew Goossen. "It's less latent and it saves a ton of CPU. With the best competitive solution, we found doing decompression software to match the SSD rate would have consumed three Zen 2 CPU cores. When you add in the IO CPU overhead, that's another two cores. So the resulting workload would have completely consumed five Zen 2 CPU cores when now it only takes a tenth of a CPU core.

That's just the overhead associated with a 2.4GB/s SDD. You can multiple that by almost 3x for the fastest SSD's on the market which are themselves capable of 1m IOPs. How many gaming PC's can afford to throw 15 Zen 2 cores at a game loading scenario? Or even worse in game streaming? And that is of course assuming that games can spread the IO load and decompression over multiple CPU cores evenly rather than being single thread limited which is more often the case. In which case you're limited to about 1.2GB/s on the IO side, and around 800MB/s on the decompression side.

I'm still waiting for someone to show any data at all that the Windows kernel is bottlenecking disk I/O today. I'd love to hear a realistic story as to why stupidly-simply-in-comparison game data files are somehow more complex and more onerous to disk throughput than an enterprise-scale transactional database.

Then what's the explanation for this?

doom-eternal.png


https://www.techpowerup.com/review/western-digital-wd-black-sn850-1-tb-ssd/13.html
 
But how many CPU cores are those servers using to hit that level of data transfer? And how much of that data are they having to decompress on the fly? This isn't about saying PC's can't move that much data, it's about reducing the massive overhead associated with it.

We literally have it straight from Digital Foundry and Microsoft's Andrew Goossen:
Yeah, so a desktop PC with eight cores at a minimum is going to be fine for level loading at the maximum rate of a commodity SSD. Remember, Zen2 is prior architecture -- and I've been panned for insinuating consoles are at a CPU deficit. Guess what? Consoles are at a CPU deficit compared to desktops.

That's just the overhead associated with a 2.4GB/s SDD. You can multiple that by almost 3x for the fastest SSD's on the market which are themselves capable of 1m IOPs.
You're conflating terms here. CPU consumption isn't directly a function of I.O bandwidth, in fact you can achieve maximum disk bandwidth with relatively tiny amount of I/O and related CPU. Processor consumption is a function of total outstanding I/Os in the stack, which doesn't necessarily relate to overall bandwidth. There are multiple factors there: total depth of the I/O queues, number of I/O queues available (at least four on commodity NVMe disks, only one for ATA), and service times of the disk serving up your requests. If you're going to proxy CPU to disk I/O, then IOPs is a closer way to track this, not bandwidth. Now to get to the point: how many IOPs are we dealing with during these level loads? Do we know? Because if we don't, then we ineed to find out.

Part of this goes back to file management of the game itself, which should be obvious now.

How many gaming PC's can afford to throw 15 Zen 2 cores at a game loading scenario?
Honestly? All of them. It's a level load! We aren't actively playing the game here, we're waiting to play the game. Want to make an argument about GPU decompression? Great, but you can't buy a "gaming" desktop today with less than eight CPU threads, that gives you six more threads for doing... something else.

Or even worse in game streaming?
Writes are far less painful than reads, especially asynchronous writes (which would be indicative of recording your streamed video game to a video file.) Reads are pathalogical because they're blocking operations; writes are not blocking unless you have a specific need for 100% data integrity, which isn't what any of the commodity streaming recording studio software is doing. Writes are cached and coalesced by the OS and then written to disk as a few, large contiguous blocks rather than scattered in zillions of individual I/Os (aka, random reads.)
Then what's the explanation for this?
Explanation is simple: they aren't waiting on disk.

The level load time shows literally nothing about describing a bottleneck, pro or con. Want to actually demonstrate support for your attempt at a point? Toss in a perfmon log during level load and have it show disk service times. Here's a hint: you aren't waiting on the service times of the disk, which is why faster and faster disks aren't making a difference to the level load time.
 
Last edited:
Back
Top