Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

How hard is it to create a game centric file system and API to be used exclusively as a secondary PARTITION of your drive, while windows and other apps all sit in good old NTFS?

I think this is the tricky bit. What is a game-centric filesystem? Sony on PlayStation, and Microsoft on Xbox, heavily steer the game creation process with their devkits and SDKs. Games are developed exactly for the target platform which includes how assets are stored. You need to put the cart before the horse (games). But then what happens when you have a game installed on a system where the user doesn't want, or can't, use a game-centric filesystem? Does the game now worse on NTFS?

Game centric and read only could be a missed opportunity.
I think so too. I think Microsoft will want a filesystem that provides benefits all all Windows applications, not just games. But it would be quicker to deploy and even serve as easy opt-in for willing beta testers - without having to hose your whole drive.

Hell, they could even treat this GameFile.Sys like the PageFile.Sys, just allocate huge amount of continuous space on your existing drive after an optimization defrag is run and then treat everything inside that with the new special APIs. Sort of like using VHDs for VMs.
These exist as constructs within the current filesystem, I think you'd want to have a space that exists wholly outside the existing software stack, otherwise you have the overhead of the original system plus the new one.
 
Maybe make it a requirement. Introduce a next-gen SSD only filesystem that can work externally from normal Windows. If that's what it takes to move the PC forwards, it needs to be done, because we can't be tied to legacy hardware forever. We've had fundamental changes in hardware like CPU sockets and RAM sockets change over the years. Even Apple will ditch outmoded hardware after a while when its holding them back. If it takes another major hardware shift to solve the 1970s based file system, go for it. Include a FastIO port/bay/something or other in new systems, and once adapted, move over to it completely. Surely MS doesn't want to be using decades old tech for the next 30 years? They must be thinking of ways to go forwards, somehow or other.
 
I think this is the tricky bit. What is a game-centric filesystem?

Switch "game-centric" for "Whatever-XBSX-is-doing-like" and there it goes. Would that be a missed oportunity of doing something more broad, robust and useful? Yes. But that would be the quickiest and easiest thing to implement to allow next-gen like performance on PC.

The robust solution can still be developed in parallel aiming to come 5 years from now or even more...

Arguably, releasing the quick and dirty band-aid Direct Storage version might provide valuable feedback for the development of the robust new file system, if that actually is a thing.

But then what happens when you have a game installed on a system where the user doesn't want, or can't, use a game-centric filesystem? Does the game now worse on NTFS?

"System Requirements: An SSD with a Direct to Storage™ partition with 70 Gb free."

done.
 
Switch "game-centric" for "Whatever-XBSX-is-doing-like" and there it goes.
I don't think we know what XBSX is doing. We have a better insight in PS5 which suggests more individually managed files, the polar opposite of assets being consolidated into large multi-gigabyte data packs.

But more individually managed files comes with an overhead. We'll still have filesystem minimum block sizes (e.g. 1kb file on a 32kb block takes up 32kb of disk space) and the need to address and manage all of these files so we could be looking an order-of-magnitude increase in filesystem management. I wonder if PS5 still has traditional linux file permissions?.

It may Series X isn't doing anything radical on the filesystem front and much of it's performance comes from the built-in decompression, only having to deal with one bus, plus a thinner software stack. These are fairly substantial savings.

Hopefully we'll find out more about both systems but I'm not optimistic.
 
If it takes another major hardware shift to solve the 1970s based file system, go for it. Include a FastIO port/bay/something or other in new systems, and once adapted, move over to it completely.
You've just described PCI, IDE, SCSI, SATA, PCIe, M.2, U.2, mSATA and SATA Express! :LOL: They all started out fast enough then they weren't. The reason the standards keep changing is because you don't want the issue of legacy support slowing a new standard. A new interface generally means a new motherboard. There is always your local bus but you're sticking one bus on top of another which is the last thing you want.

The reason it's not a solved problem is because it's a fiendishly difficult problem to solve.
 
Game centric and read only could be a missed opportunity.
I think having a fast temporal cache

Who said anything about cache? I'm talking permanent file, permanently put in place and non-movable on the normal filesystem. Think VHD, where all games are installed inside of it if you don't want to deal with partitions.

As for being game-only, that's one way of being able to not have to retest every single application that has ever existed in the WinOS eccosystem since the beginning of time. You make it something new and start off as being game and application centric.

Edit: To clarify on what I said earlier. It removes the native filesystem overhead, by requiring the new gamefile.sys to be continuous and non movable. You allocate the file on the existing partition starting at Sector N for Length Z. You let the filesystem do whatever it has to do to reserve that space, but all APIs for it deal with reading between Sector N and N+Z. Think of it as a Partition within a Partition.
 
Last edited:
Who said anything about cache? I'm talking permanent file, permanently put in place and non-movable on the normal filesystem. Think VHD, where all games are installed inside of it if you don't want to deal with partitions.

As for being game-only, that's one way of being able to not have to retest every single application that has ever existed in the WinOS eccosystem since the beginning of time. You make it something new and start off as being game and application centric.
The cache behavor would be fine with other opplications, e.g. to cache some video footage, or huge open world, some volume simulation for offline render, etc.
It could be also fine for games. Install game to big HD, game caches your current level or region of world to SSD. Also cache the entire game state so load up in a second. And most important: Cache your vendor specific BVH for RT :) It really has to be writable from the game.

I do not disagree with the general idea, but the issue i see is this: I do not want to spend money on SSD i can only use for games. You'd have a hard time to sell me this. And finally at least i want a cheap and small model, probably too small to install some number of games.
In that sense the cache thing would be not that bad, maybe.
 
What comes to my mind is requiring a dedicated DirectStorage drive. It would only be accessed via the DirectStorage APIs and therefore compatibility with existing applications becomes a non-issue.

Although given most systems only have one M2 slot that would restrict NVMe use for most users to gaming only. Everything else would have to operate from good old SATA or HDD. Unless you use a PCIe expansion card of course.
 
New What comes to my mind is requiring a dedicated DirectStorage drive. It would only be accessed via the DirectStorage APIs and therefore compatibility with existing applications becomes a non-issue.
This, A new API can also give rise to new type of NVMe that is equipped with a dedicaeted compression/decompression chip if necessary.
 
...

It may Series X isn't doing anything radical on the filesystem front and much of it's performance comes from the built-in decompression, only having to deal with one bus, plus a thinner software stack. These are fairly substantial savings.

Hopefully we'll find out more about both systems but I'm not optimistic.

For series x they claim decompression will save 3 cpu cores and directstorage will save 2 cpu cores. They’re claiming a pretty massive reduction in io overhead. Should be one tenth of one core for io and decompression.
 
DSoup and everyone,

I wanted to talk about a few things with PC architecture vs PS5.
It seems to me there roughly 3 areas where improvement could be:
1. Mechanically (Less physical signal path)
2. Greater SSD controller capability vs standard PC (More transactions per second)
2. Software layer (Less overhead per IO)

Mechanical:
On most modern systems the integrated north bridge on the CPU has 4x PCIe 4.0 lanes which can provide a little less then 8GB/s directly to the socket/die (half that for PCIe 3.0). These NVMe lanes are dedicated lanes separate from the 16x PCIe to the dGPU.
So I don't *think* you have to worry about any thing for this scenario that is outside the socket/die/main memory/dGPU system (IE. South bridge and shared lanes from SB to CPU.)
This means you can get consistent access to the high speed NVMe SSD w/o much overhead mechanically, as the PS5 likely still needs a bus that connects the NVMe SSD to the socket/die, I doubt the SSD is on-chip with PS5 or we would see greater raw throughput (and cost).

Even going through main memory isn't a significant slow down on modern systems from a access latency stand point. Your still talking ~1000x less latency from main memory than the SDD portion of this transaction, nanoseconds for main memory vs microseconds for SSD access.
So bandwidth and latency don't really seem to be a big concern form the PC side, unless I am missing something in my thinking through this (Which I could be for sure, things are never as simple as they seem =)

It looks like this to me at this point: SSD (10us) -> PCIe to CPU (300ns) -> Main Memory (write+read ~30ns) -> PCIe to dGPU (300ns) -> dGPU memory write (15ns) - Total: 10.35us of latency per transaction of a given size (Do
The above covers the access portion of littles law for throughput (Throughput = access latency for a given transaction size x parallelism).
A Console has an on die GPU so likely ~250-300ns less latency per transaction, which is a very small amount vs the PC but still helps drive higher SSD -> GPU throughput a little.

Mechanically it would appear that consoles are not going to be all that different than a PC from a latency perspective even with with extra hop on the PC to the dGPU across the much higher bandwidth GPU PCIe lanes. (Latency here will likely be dominated by the SSD retrieval times)

SSD Controller capability:
PS5 seems to have a potential advantage in a few places:
1. SSD Controller request handling:
A more robust SSD controller that is able to access the storage chips in a more optimized fashion (Possibly more highly parallel fashion from how some of the descriptions read).
This, in theory, could be applied to the PC as well if the SSD controllers added features. However at this point we don't really know how much better/different the PS5 controller is than a standard high end PCIe NVMe SDD controller is today. This shouldn't be a problem to product/enable on the PC architecture as its part of the NVMe SDD and its OS level driver.
From reading about the PS5 controller I *think* that it improves the ability to have multiple outstanding accesses satisfied in parallel, again I could be wrong here. This helps improve sustained throughput across various transaction sizes (less queue and more do!).
This is the throughput part of littles law.

2. Possible ability to bypass main memory and CPU with an DMA like functionality
Does the architecture allow for a DMA type method directly to the GPU from the SSD avoiding CPU/memory decreasing latency per transaction?
I'm not sure this can happen today on a PC (I think it can), but could easily happen with the next generation of PC hardware.
From a throughput standpoint not sure this would make much difference, but it really depends on the software IO layer efficiency so it might make huge differences or nearly none at all =)
rDMA on Ethernet is great because it avoids a large amount of additional processing that I don't think it is anywhere near as large an issue with SSD access.

Software Layer:
The software IO layer is (as everyone stated) very in need a large optimization in Windows which hopefully what the direct storage enhancements will improve upon greatly, and where PS5/XSX have a potentially great advantage at the moment as they can insert a custom IO layer which is likely thinner from a latency standpoint then windows is today and possibly more parallel.

I'm guessing even with PS5/XSX there is still a driver needed which is likely more optimized than current storage drivers on Windows, but again I could be wrong I am somewhat un-certian how this is handled on consoles today and if those requirements change with PS5/XSX.

The filesystem could be something that is new and far more optimized than current systems we have today, not sure how much more efficient these can become really there is basic functionality needed to understand where the data you need is and now to request it.

The compression is likely pretty interesting from a current perspective, but its purpose seems to be allowing for higher overall throughput (more efficient use of the storage bus bandwidth available), so the PC can really brute force this as time goes by pretty easily with current 4x PCIe 4.0 lanes, but certainly PCIe 3.0 won't cut the mustard here w/o compression.

Now questions to the crowd:
From a mechanical perspective - what have I missed in my summary above, and what additional possible improvements would PS5/XSX have?
From a software perspective - what have I missed in my summary above, and what additional possible improvements would PS5/XSX have?

Thanks all for your time and thoughts!
 
"System Requirements: An SSD with a Direct to Storage™ partition with 70 Gb free."

done.

I think could probably do both - implement DirectStorage for both your standard Windows/NTFS partition, and also for a new filesystem (accessed with a modified IO stack) using a dedicated SSD partition.

The game wouldn't know / care which it was running on, but it would automagically perform better using the more optimal arrangement.
 
As it would appear I cannot edit my post above, my math was a bit off on the latnecy calc for the PC. It should have been 10.65us.
 
Mechanical:
On most modern systems the integrated north bridge on the CPU has 4x PCIe 4.0 lanes which can provide a little less then 8GB/s directly to the socket/die (half that for PCIe 3.0). These NVMe lanes are dedicated lanes separate from the 16x PCIe to the dGPU.

Great post, I just have a couple of bits to add.

I think this is only the case for Zen 2 at the moment (not sure about Zen). Intel still accesses the NVMe drive through the chipset. That will change with Rocket Lake though (due this year), which should match Zen 2's 20 spare lanes of PCIe 4.0 direct to CPU.

This means you can get consistent access to the high speed NVMe SSD w/o much overhead mechanically, as the PS5 likely still needs a bus that connects the NVMe SSD to the socket/die, I doubt the SSD is on-chip with PS5 or we would see greater raw throughput (and cost).

Yep Sony confirmed the SSD connects to the APU via PCIe 4.0 4x.

SSD Controller capability:
PS5 seems to have a potential advantage in a few places:
1. SSD Controller request handling:
A more robust SSD controller that is able to access the storage chips in a more optimized fashion (Possibly more highly parallel fashion from how some of the descriptions read).
This, in theory, could be applied to the PC as well if the SSD controllers added features. However at this point we don't really know how much better/different the PS5 controller is than a standard high end PCIe NVMe SDD controller is today. This shouldn't be a problem to product/enable on the PC architecture as its part of the NVMe SDD and its OS level driver.
From reading about the PS5 controller I *think* that it improves the ability to have multiple outstanding accesses satisfied in parallel, again I could be wrong here. This helps improve sustained throughput across various transaction sizes (less queue and more do!)

PS5 uses a 12 channel interface vs 8 for the current and next gen top end controllers on the PC. It conversely uses slower memory though to hit the 5.5GB/s throughput. Additionally they implement 6 priority levels for data requests vs NVMe's 3. No idea how much real world difference that would make!

2. Possible ability to bypass main memory and CPU with an DMA like functionality
Does the architecture allow for a DMA type method directly to the GPU from the SSD avoiding CPU/memory decreasing latency per transaction?
I'm not sure this can happen today on a PC (I think it can), but could easily happen with the next generation of PC hardware.

Nvidia certainly have a solution to this as per Xpeas post above although it's not yet implemented on commercial gaming GPU's. I'm fairly sure AMD's HBCC is also capable of this but isn't currently implemented in drivers.
 
Great post, I just have a couple of bits to add.
I think this is only the case for Zen 2 at the moment (not sure about Zen). Intel still accesses the NVMe drive through the chipset. That will change with Rocket Lake though (due this year), which should match Zen 2's 20 spare lanes of PCIe 4.0 direct to CPU.
Yeah by modern I meant PC hardware that will be available in the PS5/XSX time frame, could have said that more clearly =)
 
For series x they claim decompression will save 3 cpu cores and directstorage will save 2 cpu cores. They’re claiming a pretty massive reduction in io overhead. Should be one tenth of one core for io and decompression.
I was late to this. But sort of a reminder that the move to windows 10 away from 7 was a larger departure than most people believe. A lot of people ragged on MS for locking DX12 onto windows 10. But when they released 12on7 and people tried to play Gears5 it wasn’t nearly working so great.
So heads up there could be a lot of changes under the hood and a lot of legacy I/O stuff can be changed wrt W10. There is a lot of stuff happening with windows at the base level. Windows 10 at launch is a very different Windows 10 as of today’s patch. I’m not sure how much legacy knowledge will continue to apply for come the newer releases that are supposed to be due out this year.
 
Interesting discussion regarding potential DirectStorage API improvements, though some points were already raised in the thread about Sony PlayStation 5 filesystem patent - specifically, the effect of large allocation blocks on SSD performance.

the way the SSD is integrated into the system is beyond what Windows offers currently
That is not to say PC won’t catch up of course. Eventually they always do
It's still standard NVMe SSD, with 2.4GByte/s throughput - compared to ~7 GByte/s from top-end SSD controllers in 2020 Q3 (Phison PS5018-E18, Silicon Moton SM2264, Samsung 980 Pro).

Every disk I/O benchmarks shows how large blocks maximize SSD throughput. Therefore the I/O subsystem needs to respect the native size for flash memory write page (8-16 KB) and erase block (1-2 MB). You can do it by either increasing LBA sector size from the default 512 Bytes - with potentially detrimental effects on backward compatibility - or following NVMe controller hints for optimal I/O block size in the StorNVMe miniport driver which requires updates to filesystem block allocation algorithm.

In the end, a mid-end PC from 2021 would have about the same throughput as next-gen consoles (for an additional cost though).

they could even treat this GameFile.Sys like the PageFile.Sys, just allocate huge amount of continuous space on your existing drive after an optimization defrag is run and then treat everything inside that with the new special APIs. Sort of like using VHDs for VMs.
Yes, it's possible to store it all in a large compressed file - and actually you won't even need to run Defrag, since the new 'CompactOS' NTFS compression performs contiguous allocation automatically, so there's no file fragmentation.

There could be additional compression algorithms, suited for specific types of game data, with offline tools selecting the best possible variation.


make it a requirement. Introduce a next-gen SSD only filesystem that can work externally from normal Windows. I

Flash-memory aware filesystems with write-once logic were researched and implemented three decades ago in the era of PCMCIA cards, and they simply did not live up to expectations. It turned out standard filesystems work better in 100% of the use cases when you implement the translation layer inside the SSD controller, and not on the OS driver level. This way LBA emulation and automatic background garbage collection can work around write amplification more efficiently by taking into account all performance details of onboard flash memory.
 
Last edited:
I can well believe in creating an new API, they also create a new end-to-end filesystem driven by that API as needed when saying it was a ground-up approach to fast IO
a solid-state optimised filesystem running on it's own partition, or part of a logical portion on an existing drive would be much quicker to deploy.
I don't really think it's a new filesystem - this would break compatibility with a lot of existing tools, and would require either some 'smart' automated disk management tools, or the end-users themselves, to proactively manage a separate partition or a separate SSD just for games. Everything they announced could be done on top of existing file I/O and filesystems, maybe with some specific SSD settings and API hooks for hardware-accelerated decompression.

I think these are rather improvemens to disk space allocation strategies, file compression algorithms, and read/write I/O performance with deep queues and optimal block sizes - something like a higher-level transaction layer for the NTFS and LZX 'CompactOS' compression. Combined with enough system memory, these could overcome some of the most obvious obstacles for Windows developers.

And if they really need a low-overhead filesystem, exFAT is already there and it supports 'flash memory parameters' block (but it cannot be used for system partitions).


DirectStorage cannot undo all of that because fundamentally to Windows your PC is just a bunch of components that needs Windows, its filesystem and Windows kernel drivers to move small bits of data around in an incredibly administratively burdensome way.
It's only because the basic I/O block (disk sector size) did not really change since 1983 IBM PC XT, while disk storage sizes (and game asset file sizes) increased by six orders of magnitude (10^6=1,000,000), from 10 MBytes to 10 TBytes. Make the OS allocate and process these basic blocks in thousands, problem solved.


The only way Microsoft can fix this in software is to toss pretty much most of the Windows code in the bin, along with the filesystem, and start over
They've managed to implement LZX file compression in Windows 10 and Advanced 4Kn (4 KByte native sectors) in Windows 8 without throwing anything in the bin, and similarily 2 MB clusters in exFAT and NTFS simply required an updated release of Windows 10.

They just need to add support for 64 KB sectors (either native or emulated with deep queue 512 Byte I/O requests) and make this the default I/O granularity and disk allocation unit, then the disk throughput will skyrocket (although for best efficiency, this would probably need an x86_64 CPU with native support for 64 KB virtual memory pages, which is not even announced yet).
 
Last edited:
Back
Top