Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

So yes, the drives were outperforming the CPU's ability to respond to CPU-driven I/O management. PS5 (and presumably XSX) sidestep this but having the I/O decoupled from the CPU which is a double win, the CPU is not only freed from slowing IO, crazy fast I/O no longer consumes significant CPU time. I guess this is why Mark Cerny said a few times that the design "saves X amount of Zen cores" for certain PS5 hardware functions. But CPU-driven I/O is how Windows does things. And the more you load on, the more load you're putting on the CPU and less CPU time you have left.
I mean, it sounds like DirectStorage is directly going to address this issue. So we'll have to wait and see.
 
This became the foundation for the Xbox Velocity Architecture, which comprises our custom-designed NVME SSD, a custom dedicated hardware decompression block, our new DirectStorage API which provides developers with direct low-level access to the NVME controller
the bearded guy in Xbox videos.
https://www.windowscentral.com/xbox-series-x-what-do-game-devs-think
'Direct access to the NVMe controller' is sure an interesting point. On Windows, this could be implemented with a new NVMe storage port driver that's designed around NVME command interface and controller hints for optimal I/O block size - instead of StorPort port driver/StorNVMe miniport driver model which is based on a generalization of SCSI command set.

I still think the DirectStorage API would be a user-mode layer designed to issue large file I/O requests with deeper queues, which should be far more efficient on NVMe storage. This will still be based on Windows I/O Manager driver stack and virtual Memory Manager and file Cache Manager, as well as existing Installable File System drivers and filters.

This way they can tweak the I/O subsystem to reliably support large block sizes and use new or updated internal structures to reflect NVMe control flow, while also remaining compatibile with the StorPort driver model for legacy SATA devices.

They could also intrercept ReadFile/WriteFile requests from legacy applications and rearrange them to use similar deep-queue and large-block transactions when the new storage drivers are installed.


I'm just going to ask the same question again: What is your explanation for why huge leaps on SSD NAND performance, controller improvements, PCI4.0 improvements fail to material as meaningful increases in actual performance?

It's because applications are not designed to efficiently utilize this enormous bandwidth. Did you really expect to get a different answer for the same question?


Decompression is one of those things that in reality can be stupidly parallel but is implemented by people who don't know and then it becomes single threaded load that takes forever.
It's not just decompression or other processing overhead, it's also overall program flow and the data set.

Imagine you have a 1970s era computer system with a tape archive application that reads 80 character lines from punch cards and writes them to text files on the magnetic tape, and a TTY application that sends text files over a 300 bit/s modem line.

If you port these applications and OS interfaces to a modern computer with SATA disks and Gigabit Ethernet and run them on the same set of text data from 1970s - do you really expect to max out on network and disk bandwidth?

Processing would only take a fraction of a second on modern hardware, so your theoretical bandwidth is hundreds of megabytes per second. Unfortunately you only have several hundred kilobytes of text to transfer and then your program stops - so your your real-life bandwidth is even less than a megabyte per second.

That's the difference between maxing at 3 GBytes/s in synthetic disk benchmarks and averaging to 30 MBytes/s in real-world applications.


if there is possibility for some hw improvements to happen to make decompression/io run without consuming cpu
It should be possible to plug hardware processing into a filesystem minifilter driver.
 
Last edited:
we have a 16 channel four-SSD PCI4.0 card that can sustain 29Gb/sec on linux but which couldn't sustain 5Gb/sec in Windows. That's a bottleneck.
It's definitely a first world problem when your 29Gb/sec SSD only delivers 5Gb in one OS.

Premiere Pro is a good example of very demanding non-linear editing application that works with multi-gigabyte 4K video files requring fast disk I/O, and its workloads are well aprroximated by sequential disk I/O benchmarks with large file and block sizes.

But even if we assume that Windows I/O subsystem has six times as much overhead as Linux, and it's not some specific RAID driver issue with I/O Manager or SMB redirector - would it really help the majority of real-world applications, which average 30 Mbytes/s in random disk access patterns with small file and block sizes, if processing overhead is reduced six- or ten-fold?


The million dollar question is, can Microsoft remedy this without breaking everything
They would probably leave it 'as is' rather than break anything, if a break was the only option. Microsoft goes to great lenghts to provide binary compatibility even for very old software that's no longer maintained, to the point of replacing standard APIs with custom code to work around application bugs .

Windows file I/O subsystem scaled from 32-bit Windows NT 3.1 Advanced Server running on quad-processor Pentium Pro 200 MHz with dozen megabytes of memory and RAID volumes of dozen hard disks and gigabytes of storage spece, to 64-bit Windows Server 2016 Datacenter running on 100+ 3GHz cores with terabytes of NUMA memory and fiber optic links to clusters of hundreds of disks and petabytes of storage space. All while using essentially the exact same filesystem and I/O subsystem, and having nearly 100% compatibility with existing applications.

Duing the exact same 25-year span, how many times the Linux Kernel and its I/O subsystems and file systems have been tweaked, substantially changed, or totally rewritten with breaking changes to the APIs and drivers?


How do you explain the disparity of I/O throughput using the same hardware running Windows then Linux? Did you watch the LTT video @Davros LinkedIn this post, and LTT's apology to Tim Sweeny?
How can I? They show me some server hardware and guys staring at fancy monitors, but there are no details on software configuration, drivers, or workloads used, no performanvce analysis or synthetic benchmarks, just some embedded ads and ramblings-about. Could as well made it a trash talk show episode with videographers debating Linux then engaging in staged fights.
Linux is cool if you want to tune the I/O scheduling subsystem, or just change it altogether (interrupt to realtime polling). Windows has a legacy of umm.. well, software legacy.
But CPU-driven I/O is how Windows does things. And the more you load on, the more load you're putting on the CPU and less CPU time you have left.
NVMe protocol only uses bus-mastering DMA and MSI interrupts.
the drives were outperforming the CPU's ability to respond to CPU-driven I/O management.
Hardware time-out is an exact opposite of what you have described here.
It's probably not a single issue but a combination of related issues, that's hampering I/O
Or it could be just about anything one can imagine.
 
Last edited:
But even if we assume that Windows I/O subsystem has six times as much overhead as Linux, and it's not some specific RAID driver issue with I/O Manager or SMB redirector - would it really help the majority of real-world applications, which average 30 Mbytes/s in random disk access patterns with small file and block sizes, if processing overhead is reduced six- or ten-fold?
The issue seems to crop up a lot to be some specific RAID driver issue. It's occurring in different PC hardware (a server, a generic Windows PC), with different storage solutions but the one common where hardware performance crumbles is Windows.

As for your second questions, the answer is it very much depends on how data is stored. This discussion spawned from the console forum and discussion around PS5 which has a very different storage and I/O paradigm where everything, hardware and software stack has been redesigned to minimum load times. That can work on a new closed platform like a console where games can be bundled and distributed to take advantage of this, but making a radical change like this in Windows? I don't think it would be all good.

'It's because applications are not designed to efficiently utilize this enormous bandwidth. Did you really expect to get a different answer for the same question?
And linux applications are? What about just the disk I/O benchmark which is the most optimum scenario you'll ever get. You're just reading data and not doing anything with it, unlike a real work scenario where an applications reading data for a purpose.
 
the answer is it very much depends on how data is stored. This discussion spawned from the console forum and discussion around PS5 which has a very different storage and I/O paradigm where everything, hardware and software stack has been redesigned to minimum load times.
making a radical change like this in Windows? I don't think it would be all good.
These same applications and data patterns (i.e. console games and art assets) will come over to the PC - the question is whether mid-range PCs will be able to run them with sufficient performance in 2021.

I still think it could be done with the existing Windows I/O subsystem. It wouldn't be as efficient, just like I said in an earlier thread, but at least Windows users are ready to trade off some performance and efficiency for broad extensibility and compatibility on both software and hardware levels - so in the end, installing more powerful hardware would improve performance of existing software, which is rarely possible for embedded devices.

What about just the disk I/O benchmark which is the most optimum scenario you'll ever get.
There is not a lot of benchmarking software for Linux - in Phoronix Test Suite, which uses some common command-line utilities or scripts, sequential disk read and write tests typically have much lower peak bandwidth numbers for the same model/make. Not sure if these results are comparable to Windows benchmarks in regard to block sizes and queue depths.


And linux applications are?
Depends on your definition of Linux application.

Linux Kernel is open source, so breaking changes are less of a concern since the source code is always available to make updates. This seems like a better support model for HPC/server applications tailored for the cloud, because you have qualified engineers, programmers and system administrators to manage your hardware and software stack. There are hundreds of vendors who submit changes to the Linux Kernel to support their platforms, and it's their responsibility and their best interest to update the source code for their proprietary applications and drivers and configure them to extract the best possible performance.

On the other hand, Windows traditionally used binary executable files and proprietary closed source code, and valued binary compatibility above all else - which carries lower support costs and required less qualified staff, which is best for their once-typical on-premises file server and database server setups.

For example, backward compatibility is the main reason why Windows sets a hard limit of 4 KByte for virtual memory pages, even though x86-64 architecture also supports 2 MByte and 1 GByte pages, and ARM64 additionally supports 16 and 64K Byte pages. There is limited support for 2 MByte large pages in Window - they can only be allocated physically, with no support for paging or kernel-mode use, for a lot of different reasons.
All the wile Linux does support large 2 MByte pages in kernel mode with Transparent Hugepage (THP), and there is even a background service that converts contiguous physical memory regions into huge pages for old applications. It's not enabled by default either, but at least some applications and drivers can use large pages if they can benefit from them.

So in theory, free open-source 'desktop' Linux applications could also be written to extract as much performance as possible, just like the Linux Kernel.

Unfortunately there are almost no 'desktop' applications, besides some SteamOS ported games and a few basic productivity apps, and most of them use multi-platform frameworks with similar less-than-optimal file access patterns, so there are virtually no 'desktop' applications that can saturate disk I/O to the max.
It's even worse for embedded/mobile applications, where the actual GUI and Linux device drivers use proprietary closed code which ships as binaries - once the vendor releases the initial firmware, it's almost always 'ship and forget' support mode, and even most tedious bugs and vulnerabilities are rarely fixed. Embedded system also use cheaper eMMC flash memory which is several times slower than NVMe drives.
 
Last edited:
For the last time, no. A software API does not negate all of the driver and device I/O that Windows does. The fact you keep suggesting this strongly suggests you need to spend a good hour reading this Microsoft documentation.

They can drill down to hardware as far as they want, it's their code. If they want to bypass WDM and make a whole new I/O layer designed for peer to peer DMA they can obviously do so, sure they need to reimplement things build on top of it like say filesystems but for a single one that's not a big deal ... they don't need to support every flavour of legacy filesystem for DirectStorage, they can simply demand a partition be formatted specifically for it filesystem and all.
 
Last edited:
They can drill down to hardware as far as they want, it's their code. If they want to bypass WDM and make a whole new I/O layer designed for peer to peer DMA they can obviously do so, sure they need to reimplement things build on top of it like say filesystems but for a single one that's not a big deal ... they don't need to support every flavour of legacy filesystem for DirectStorage, they can simply demand a partition be formatted specifically for it filesystem and all.

Of course Microsoft can do this, yet there is no evidence of this in Microsoft's forward-looking technology roadmap that this is imminent. Nor is it something you would want them to rush. Testing this in Xbox Series X is far less risky with less repercussions.. And flipping to a new stack and filesystem optimised to SSD is an additional stream of support, Microsoft are not about to abandon the hundreds of millions of Windows users who still use a spinning HDD.

I wouldn't categorise the effort of making a new filesystem as "no big deal". Ask Sun how long it took ZFS to get traction. Or ask Apple about APFS which was in development almost a decade before it was ready to put on people's actual devices. Changing the stack, I/O and filesystem all at once? Brave.
 
Microsoft are not about to abandon the hundreds of millions of Windows users who still use a spinning HDD.

Abandoning them would be not giving PC developers important tools when it's perfectly possible. Obviously they can't really afford to advantage AMD over NVIDIA, so on the pure graphics side they can't allow devs as low level access as on the Xbox ... but a 3GB/s NVMe drive isn't expensive, they should provide PC devs with an efficient peer to peer DMA system if they do it for console devs.

IMO they should start an "Optimized For X" certification system for complete PCs and GPUs/drives and get as much feature parity between certified PCs and the Xbox X as feasible. I doubt they'll do it and that's why we need a real desktop alternative to Windows for PCs.
 
Abandoning them would be not giving PC developers important tools when it's perfectly possible. Obviously they can't really afford to advantage AMD over NVIDIA, so on the pure graphics side they can't allow devs as low level access as on the Xbox ... but a 3GB/s NVMe drive isn't expensive, they should provide PC devs with an efficient peer to peer DMA system if they do it for console devs.

IMO they should start an "Optimized For X" certification system for complete PCs and GPUs/drives and get as much feature parity between certified PCs and the Xbox X as feasible. I doubt they'll do it and that's why we need a real desktop alternative to Windows for PCs.

Yea HDDs are basicly gone now unless your at the very low end of pricing or your picking a machine that's going to be used for data storage or video editing. SSD prices keep dropping.
I don't think MS wants to set up a Optimized for X on the PC. It will just lead to confusion in the end and alot of companies will try and skirt around the specification as much as possible.
 
I didn't say specification, I said certification.

A certification mark is well protected legally and they can test the components against their certification database to display in Windows whether it's certified. Some third world nothing brand might pretend to be a proper brand, but that's not much of a problem.
 
Last edited:
I doubt they'll do it and that's why we need a real desktop alternative to Windows for PCs.
This alternative will give you what precisely?
It won't give you xsx or ps5 certification either.

MS is bringing pc and xbox into alignment in many ways, tools, api, dev environment etc.
They don't control the pc hardware and many things that give pc's there benefits also has there downsides.

I've always expected them to provide direct storage to widows, but there are parts they don't control so pc won't get the full velocity architecture.
All seems reasonable to me and going beyond what they have to do on pc.
 
No matter how though whether though just hooking SSD to the GPU directly or otherwise, to quote Jeff Goldblum/Malcolm, PC will ... find a way.
 
This alternative will give you what precisely
It gives indications of performance levels and for complete system ensures there's nothing dragging it down, like for instance some shit NVMe drive which lacks say the random access or write throughput to get close to what console devs will expect for being able to use their X streaming code.

Parents who buy their kid a PC/laptop probably won't do the research to estimate it's performance level. Lazy people don't want to be bothered by it. You could do it by year, Optimized for X 20/21/etc. With 20 being designed to run Xbox One X games without significant compromise, 21 Xbox X, 22 start improving on it. I'd prefer the Chromebook model and Microsoft creating PCs which Just Werk, but better than nothing.
They don't control the pc hardware and many things that give pc's there benefits also has there downsides.
They've had certification programs for Designed For Windows logos, they could have a more specific one for gaming. Precisely because they are trying to bring PC gaming into alignment with XBox, at least for their own store, it makes PC gaming more accessible.
I've always expected them to provide direct storage to widows, but there are parts they don't control so pc won't get the full velocity architecture.
All seems reasonable to me and going beyond what they have to do on pc.
Peer to peer DMA has been done already on PCs, Linux obviously but still.
 
Yea HDDs are basicly gone now unless your at the very low end of pricing or your picking a machine that's going to be used for data storage or video editing. SSD prices keep dropping.
I don't think MS wants to set up a Optimized for X on the PC. It will just lead to confusion in the end and alot of companies will try and skirt around the specification as much as possible.

I don't think HDD are gone at all, most people with a "gaming" PC will have a small SSD to boot from (256/512MB) and a multi-GB HDD to store games on. Sure people who own 2080Ti's will have all SSD storage but the majority, with cards like the 2070 and under will be using a hard drive to load games. If you can't afford a 2080 but can afford a 2070, you are not going to spend $300 on a 2TB SSD. Plus they will likely have only 1 M.2 slot. Even a 1TB boot M.2 doesn't leave you much space for modern games. If you are running a QLC drive as you fill it up it slows down too. 2TB SSD's are still expensive, especially the faster non QLC models, and the trusted brands (Corsair, Crucial, Samsung) are even more.
 
I don't think HDD are gone at all, most people with a "gaming" PC will have a small SSD to boot from (256/512MB) and a multi-GB HDD to store games on. Sure people who own 2080Ti's will have all SSD storage but the majority, with cards like the 2070 and under will be using a hard drive to load games.
You think most people with a RTX 2070 aren't even using SATA SSDs?

I don't personally know a single person that is using a HDD to play PC games. A 256GB SATA3 SSD goes for what, 40€? And people with towers can usually put at least 4 of those in any motherboard.
 
I don't personally know a single person that is using a HDD to play PC games.

Now you do :) I have a SATA 256GB SSD for Windows + 4TB HDD for games. It's really not a problem at the moment.

I have a hungry m.2 slot on my x570 board though waiting for a 7GB/s Direct Storage compatible drive in the 2TB range for when it becomes necessary. Event then though, the HDD will still be there hosting a good few dozen current generation games and emulators.
 
Back
Top