Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Hmm... does Diablo 4 use DirectStorage? It has the dstorage.dll and dstoragecore.dll files in the directory just like Forspoken.

dstorage.png


Or are those files just something that is included with games built on the newer SDKs by default now?
 
Hmm... does Diablo 4 use DirectStorage? It has the dstorage.dll and dstoragecore.dll files in the directory just like Forspoken.

dstorage.png


Or are those files just something that is included with games built on the newer SDKs by default now?
Good question. It might be worth using Process Explorer to see if it's loading the DLLs.

If you're unfamiliar with it then you press CTRL+L to open a lower pane window and then CTRL+D to switch from handles to DLLs. Then just find the executable in the upper list.
 
Yes they did.. Sapphire made one.

So they did!!! Sapphire always did mad stuff with their Toxic variants - The Sapphire HD4890 Toxic was the first GPU with a 1Ghz core clock out the box (<-- random fact of the day)

But the point still stands, 99% of the 7970's were 3GB.
 
I'm sitting here watching the video. First, he keeps conflating "block size" with "request size" starting at time 6:25. The axis is mislabeled; he actually says it's request size and not block size, but then waffles back and forth calling it the wrong thing. Here's why it matters: your storage block size is either 512 bytes or 4096 bytes -- there aren't any other options. No, we're not talking about filesystem cluster size for NTFS / ExFat / XFS / EXT[3,4] This distinction of request size is important, because what we're really talking about here is how we "glob" assets together into fewer, larger and sequential reads (and he even mentions this if you listen to him, but it's not written down correctly in several successive slides.)

What he goes on to describe over the next four minutes is how to "pack" files in a way that allows us to consolidate a zillion, serially-linked small reads into far fewer, big fat reads.

Starting around the 13 minute mark we see the new way to manage VRAM reservations differently, essentially telling the video memory pool to be ready to receive the aforementioned big fat read which may be A: still compressed and B: actually be multiple resources (apparently refered to as subresources now, per the slide at the 14 minute mark) versus today's method which would be a stack of singular resources, all individually uploaded in a serial end-to-end fashion.

At 14:14 we hear him describe loading resources from disk as device independent, but he doesn't say it's a blocklevel read. This harkens back to some spirited conversations earlier in this thread; we're simply reading files off a filesystem, which means we're still not moving blocks straight from the storage device into VRAM.

Aha, and at 15:50, there it is -- we're still copying from disk with an API call (hello, NTFS filesystem!) into ram (but far less ram-to-ram copies so that's a solid win) and then into the GPU.

At 16:35 we start talking about the new file IO methods in Windows 11. We'll take them in order:

BypassIO is something I hypothesized quite a while ago in our discussion thread, here's the API Spec: https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/bypassio Basically, what it does is take an NTFS file handle (yup, it has to start at NTFS) and then walks the entire volume driver stack (file to filesystem logical blocks, filesystem logical blocks to a volume map, a volume map to a disk sector map) and then enables a way for NTFS to lock those physical sector details and access them by straight communication to the StorNVME driver.

Apparently BypassIO only works on the StorNVME driver right now, and also only with NTFS file systems (which means the file system overhead is instituted for ATIME stamps and ACL checks.) If you read the API spec, the BypassIO handle is also forced to move down the traditional IO stack if the files are filesystem-compressed, filesystem-encrypted, or a defrag occurs which affects the file in any way after the handle is obtained, or in a multiuser environment if someone else on the system gets a handle to the same file. This API request is basically an alternate way to get a file handle, and if it fails the BypassIO check, then it still gives you a handle but the access simply follows the standard disk IO path.

The second Windows 11 IO method he describes touches upon is IORing (https://learn.microsoft.com/en-us/windows/win32/api/ioringapi/) which allows an application to queue up a pile of file handles into a buffer, and then the app receives an event notification when the everything in the queue has been read into memory. There's a great writeup of how Windows IORing compares to Linux's similar io_uring here: https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations/ Of course, since the IORing function works on handles, you just feed it the stack of BypassIO handles you created above... By the way, if you read the MS IOring API spec, it looks like this is fully supported in later Win10 builds.

At 19:20 he mentions Cancellation, or the ability to kill off a pending read request in order to slot in a more important one. It looks like this is an IORing function: https://learn.microsoft.com/en-us/windows/win32/api/ioringapi/nf-ioringapi-closeioring which allows you to kill the entire ring, but also allows you to kill just specific handles in the ring without killing the entire queue. Either way, it's nice to be able to cancel a pending IO if it's an unnecessary IO.

Now at 25:50 he shows that we need to precompute how textures and assets are now handled to be "DirectStorage compatible." Which is fine, but this also means getting the biggest bang from DirectStorage isn't so simple as a "backport" of an existing asset.

In the end, here's a few things that pop into my mind:
  • I wager the CPU isn't busy at all in the non-directstorage use cases. Serialized I/O (read and wait until complete, read and wait until complete) isn't CPU intensive at all, but it's still stupidly slow even on a fast disk interface. All things considered, the CPU is very likely MUCH more busy with DirectStorage enabled, mostly because it can be actually getting work done versus waiting for an IO completion polling cycle.
  • Go back to timestamp 15:50 and look at the memory copies again. Did you catch it? If you think your VRAM-hungry games are a problem today, wait until your GPU is responsible for all the decompression stages. Before GPU decompression they're only getting the end result, and in theory only getting the portions of the end result that needs to be uploaded (eg a driver may only upload certain mip levels versus the entire texture dependency chain uploaded as linked subresources.)
  • The NTFS filesystem is still here, this isn't block-level transfer from disk straight to the GPU. I feel like a broken clock; direct block transfers to VRAM were never going to happen in this era of security. Even the BypassIO starts with a full file-level-to-block-level traversal of the entire Windows disk IO stack.
  • The real magic here probably isn't BypassIO, it's the IORing method for asynchronous batching of files. And what's interesting to me is, any developer could've written something quite similar, and in fact quite a few have (see also: basically any high performance relational database software.) It's nice that Windows has finally provided the easy button for appdevs who didn't want to write their own, and did one better by eliminating a memory copy step for allowing users to see the results even though the kernel directly executes the reads into RAM.

That's my read on the situation. Not a lot of surprises IMO, and I'm glad to see it coming to fruition. Games will be better for it, and so too will a ton of other apps who elect to use it for non-gaming applications (eg imagine Blender using IORing to pull in the shitload of assets linked to your scene...)

Edit on Monday morning: Had several typos bothering me, and I also simplified some language. It was kinda hard to read my stream of conciousness in a few places while I was pausing the video to type :D
 
Last edited:
As discussed in the thread comparing external storage, the high price seems to be entirely due to the small form factor. The standard longer M2 (2280) format is much more widespread and cheaper than the mSATA (2230) MS chose. Being larger allows more chips for the same storage etc.

Seeing consumer prices here, WD have a cheaper 2230 option. If 2230 becomes more widespread (it's used in SteamDeck now) ,prices should drop a bit, but it'll always be more expensive for one chip solutions versus 4 chips.
 
Searching about oodle kraken to explain this is not a PS5 technology I find some interesting stuff


About the PS5 implementation


It seems HW decompressor aren't used on many title on PS5 and Xbox Series.
Most ports and cross-generation titles (which is almost everything currently released on Xbox Series X and PS5) don't use the HW decoders at all and do SW decompression on the CPU cores. We're still selling lots of Oodle Data licenses for PS5 and Xbox Series X titles, despite both having free HW decompression. Just because it's available doesn't mean software actually uses it.

Either device has 16GB of RAM and easily reads above 3.2GB/s with decompression, and a good chunk of RAM goes into purely transient memory like GPU render targets etc, so in practice there's a lot less data that actually needs to be resident, especially since many of the most memory-intensive things like high-detail mip map levels don't need to be present for the first frame to render and can continue loading in the background. Any load times significantly above 5 seconds, on either device, have nothing to do with either the SSD or codec speeds and are bottlenecked elsewhere, usually CPU bound.
 
It seems HW decompressor aren't used on many title on PS5 and Xbox Series.

Note that post was from May 2021.

1680521301284.png


I mean yeah, of course most multiplatform releases 6 months after the new consoles arrived were not using the HW decompressors much, it was obvious and people would regularly comment why so many games did not seem to be living up to the reputation of lighting-fast SSD's.

You can't make any assumptions of how they're being utilized nearly 2 years later based on that post.
 
It seems HW decompressor aren't used on many title on PS5 and Xbox Series.

That makes sense considering the amount of games with <5 loads times on either platform are extremely rare.

I imagine changing engines to suit is what's taking the time and has also no doubt been slowed down further by having to support last gen.
 
Searching about oodle kraken to explain this is not a PS5 technology I find some interesting stuff


About the PS5 implementation


It seems HW decompressor aren't used on many title on PS5 and Xbox Series.
I wonder if this is just a tooling problem that needs to be rectified, not every game company makes everything top to bottom.
It’s a curious situation, or perhaps there are limitations on the license if you multiplatform.
 
I wonder if this is just a tooling problem that needs to be rectified, not every game company makes everything top to bottom.
It’s a curious situation, or perhaps there are limitations on the license if you multiplatform.

It was 2021 and for cross gen games.
 
Seeing consumer prices here, WD have a cheaper 2230 option. If 2230 becomes more widespread (it's used in SteamDeck now) ,prices should drop a bit, but it'll always be more expensive for one chip solutions versus 4 chips.

When you can get one high density chip off the wafer at a equivalent yield as four low density chips, it's a wash then wafer costs will make the lower density chips more expensive - particularly when you account for semi packaging, controller simplicity and testing. This gradual pricing curve is observable with RAM, optical drive space and solid state storage - any medium where the upper bounds is always pushed greater capacities. If you have a look at what 16Gb of DDR2/DDR3 will cost you across multiple sticks versus 1 stick of DDR4/DDR5, you'll see it in action. For solid state binning is quite aggressive as well so four chips might be four failed higher density chips that pass Q&A, in which case the cost is sunk and the pricing is about loss recovery.

I imagine economies of scale come into it as well. When you're making less because there is less demand, it costs more.
 
It seems HW decompressor aren't used on many title on PS5 and Xbox Series.
Load times aren't always constrained by IO, though, so a game loading for more than 5 seconds doesn't mean that the HW decompressors aren't being used. It simply means that the bottleneck lies somewhere other than the movement of bits from storage to RAM. If you were already bottlenecked somewhere else, it doesn't matter that you can now move data 100x faster.
 
^^ This is it, precisely.

So many people seem to be confused on how the code itself can completely constrain a system without A: stressting the drive at all and at the same time B: not stressing the CPU either. So much of the DirectStorage talk in the AMD presentation was about the simplest parts of obvious code optimization: don't stupidly just load one asset at a time, in serial fashion (load this, wait until it finishes, load next, wait until it finishes, repeat until done.)

If you want a modern flash drive to perform at peak, you need to do BIG reads and a lot of them at the same time. Shove the disk queue completely full, and place big fat file reads in that queue.
 
Load times aren't always constrained by IO, though, so a game loading for more than 5 seconds doesn't mean that the HW decompressors aren't being used. It simply means that the bottleneck lies somewhere other than the movement of bits from storage to RAM. If you were already bottlenecked somewhere else, it doesn't matter that you can now move data 100x faster.

Fabian Giesen is a RAD tools Game/Epic enployee and he said in 2021 and for cross gen games very few games use hardware decompressor because studio buy the licence for the Oodle software solution. This is the fact he gave to say many games don't use hardware decompressor. And this is logic cross gen games are developped most of the time around the lowest denominator, the PS4 or the Xbox One.

And you can load very fast like R&C Rift Apart and the bottleneck will be the CPU, same with Forspoken*. There is no game where the bottleneck is I/O but with optimization you can load very fast before reaching the CPU bottleneck. But if devs use the hardware decompressor and all the system around I/O it means more CPU power for other task and better RAM usage. He confirms that I/O complex SRAM is used for compressed data in PS5 as a buffer and only decompressed data are inside RAM, he can't give more details because he is under NDAs.

* The dev told the bottleneck is initialization of entity by the CPU in a zone for the two games. For Matrix Awakens demo, the stutters comes from initialization of entity by the CPU.
 
Last edited:
Back
Top