Blazing Fast NVMEs and Direct Storage API for PCs spawn

Deleted member 11852 · May 30, 2020

iroboto said:
I was late to this. But sort of a reminder that the move to windows 10 away from 7 was a larger departure than most people believe. A lot of people ragged on MS for locking DX12 onto windows 10. But when they released 12on7 and people tried to play Gears5 it wasn’t nearly working so great.

That was one version of one API. Tossing in new APIs with new versions of an OS is how things has worked for six decades.

iroboto said:
So heads up there could be a lot of changes under the hood and a lot of legacy I/O stuff can be changed wrt W10. There is a lot of stuff happening with windows at the base level. Windows 10 at launch is a very different Windows 10 as of today’s patch. I’m not sure how much legacy knowledge will continue to apply for come the newer releases that are supposed to be due out this year.

What is the heads up, specifically? I'm looking at roadmaps for the next four planned versions of Windows 10 and there are no changes to the fundamental Windows architecture. You can't just change this without breaking software and hardware which is why these far-horizon tech roadmaps exist. Assuming you're looking a the same roadmap as me - the latest is dated May 3rd, can you tell me change code you think suggests something?

DmitryKo said:
They've managed to implement LZX file compression and Advanced 4Kn (4 KByte native sectors) in Windows 8 withouth throwing anything in the bin, and similarily 2 MB clusters in exFAT and NTFS simply required an updated release of Windows 10.

Now add support for 64 KB sectors (either native or emulated with deep queue 512 Byte I/O requests) and make this the default I/O granularity, and see the disk throughput skyrocket (although this would probably require a CPU with native support for 64 KB virtual memory pages, which is not even announced yet).

These are software changes. Software is easy to change. Inserting a hardware decompressor into the storage-I/O-RAM-CPU/GPU pipeline managed by different kernel drivers is is a different kettle of fish.

iroboto · May 30, 2020

DSoup said:
That was one version of one API. Tossing in new APIs with new versions of an OS is how things has worked for six decades.

What is the heads up, specifically? I'm looking at roadmaps for the next four planned versions of Windows 10 and there are no changes to the fundamental Windows architecture. You can't just change this without breaking software and hardware which is why these far-horizon tech roadmaps exist. Assuming you're looking a the same roadmap as me - the latest is dated May 3rd, can you tell me change code you think suggests something?

These are software changes. Software is easy to change. Inserting a hardware decompressor into the storage-I/O-RAM-CPU/GPU pipeline managed by different kernel drivers is is a different kettle of fish.

I have been, possibly given the impression, that we’re seeing a heavy move towards something like 10X. So a great deal of things running in containers. But perhaps o may have misheard and that is just a 10X thing. But the move to every app in its own container may free MS to re-write major under lying structure is what I was thinking. I could be wrong though. Perhaps I was overstretching what Xbox does with its games.

DmitryKo · May 30, 2020

DSoup said:
there are no changes to the fundamental Windows architecture. You can't just change this without breaking software and hardware which is why these far-horizon tech roadmaps exist

Arguably they don't need to change the architecture of the storage subsystem. The current multi-layered approach, with storage port/miniport driver - storage class driver - storage filter driver and installable filesystem driver - filesystem filter/minifilter driver, worked quite well for NT kernel based operating systems.

This allows 3rd party antivirus and compression/encryption software to intercept I/O blocks from every disk device without knowing intricate details about its connection bus (SCSI/SATA/SAS, PCIe/NVMe, USB/UAS), form-factor (CD/DVD, HDD/SSD, add-on board, flash card) and filesystem (FAT/FAT32/exFAT, NTFS/ReFS, CDFS/ISO9960/UDF).

What they do need is change the size of data structures and optimise I/O patterns to operate on large blocks of data, such as 64 KB sectors and large clusters - and that's the difficult part, because large-scale modifications to StorPort / StorNVMe may break compatibility with existing miniport drivers and user applications. The easier way to achieve better performance would be a transaction optimisation helper layer that works on top of existing block device and filesystem drivers, similar to the 'CompactOS' NTFS compression feature.

DSoup said:
inserting a hardware decompressor into the storage-I/O-RAM-CPU/GPU pipeline managed by different kernel drivers is is a different kettle of fish

On the contrary, all it takes is to implement a custom filesystem filter driver and register a NTFS reparse point type tag. The OS-level filesystem driver will parse the tag/GUID and load the metadata from the reparse point attached to the compressed file, then pass the file data to the associated filter driver, which will further pass it to the user-mode helper that performs hardware or software compression/encryption on the I/O blocks.

This is exactly how 'CompactOS' compression is implemented in Windows 10, and how a bunch of extended NTFS features were implemented in Windows 2000 and above.

Deleted member 11852 · May 31, 2020

iroboto said:
I have been, possibly given the impression, that we’re seeing a heavy move towards something like 10X. So a great deal of things running in containers. But perhaps o may have misheard and that is just a 10X thing. But the move to every app in its own container may free MS to re-write major under lying structure is what I was thinking. I could be wrong though. Perhaps I was overstretching what Xbox does with its games.

More sandboxing is happening but this will have no impact anything happening in this discussion. Arguably, things will get worse for sandboxed applications as Windows has a growing list of low-level APIs that are disallowed.

DmitryKo said:
Arguably they don't need to change the architecture of the storage subsystem. The current multi-layered approach, with storage port/miniport driver - storage class driver - storage filter driver and installable filesystem driver - filesystem filter/minifilter driver, worked quite well for NT kernel based operating systems.

Adding more layers of abstraction will only exasperate the existing problem, to approach the efficiencies of PS5 you need less separate kernel drivers, or a trusted kernel driver model where different kernel drivers can share data in a trust way without the massive overhead of IRPs for passing data - as you SSD speed ramps up, that will consume more and more overhead.

This is the Windows 10 I/O model. The more data you transfer and the more kernel-level device drivers and lower-level drivers/libraries involved (and there are a lot to get from the storage to either RAM or VRAM) the worse it is.

DmitryKo said:
This allows 3rd party antivirus and compression/encryption software to intercept I/O blocks from every disk device without knowing intricate details about its connection bus (SCSI/SATA/SAS, PCIe/NVMe, USB/UAS), form-factor (CD/DVD, HDD/SSD, add-on board, flash card) and filesystem (FAT/FAT32/exFAT, NTFS/ReFS, CDFS/ISO9960/UDF).

And a consequence of the flexibility afforded by this level of abstractions is also the prime culprit in why I/O is so slow - relatively. Let' be real here, nobody is saying SSDs on Windows are slow, it's just that when you remove a bunch of overhead from the design, e.g. PS5, you can simply move more data. Raw SSD speed is not what is holding back SSD's on Windows, Windows driver model and I/O model is holding it back - hence why large RAID SSD RAID arrays always disappoint in raw performance. But Windows need to substantially change to engineer out these designs. I can't see how you can change it that much without breaking a lot of device drivers.

Jay · May 31, 2020

Does windows in azure use the same architecture?

This could be another opportunity for the Xbox and azure teams to work together on mutually beneficial things, then it partially gets back ported to win 10, or as a basis of a new file system IO stack.

iroboto · May 31, 2020

Jay said:
Does windows in azure use the same architecture?

This could be another opportunity for the Xbox and azure teams to work together on mutually beneficial things, then it partially gets back ported to win 10, or as a basis of a new file system IO stack.

no azure setups are different. everything in azure is containerized, so I believe the underlying I/O is setup to serve that type of setup. I'm not sure of the pros and cons necessarily, except that windows continues to slow move into that direction.

Jay · May 31, 2020

iroboto said:
no azure setups are different. everything in azure is containerized, so I believe the underlying I/O is setup to serve that type of setup. I'm not sure of the pros and cons necessarily, except that windows continues to slow move into that direction.

Not sure what that means to the file system IO stack, you got any more details/input.

Windows is moving to a containerized environment, xbox already runs with custom hyperV setup.
I'm assuming that azure will have a better IO stack in relation to SSD, remote desktop (windows in the cloud) would benefit from more modern IO stack also.

On the consumer side xbox is in need of it more than PC, so I could see cross department collaboration starting with xbox with azure input, with view to moving it to the PC. Under the guise of Direct Storage.

DmitryKo · May 31, 2020

DSoup said:
as you SSD speed ramps up, that will consume more and more overhead

I'm not sure why CPU overhead has to go up - it should actually decrease with large blocks of data, because they require less I/O operations.

Raw SSD speed is not what is holding back SSD's on Windows, Windows driver model and I/O model is holding it back

If that were the case, we wouldn't see the raw disk bandwidth max out with large 64 KB I/O blocks in synthetic tests, but choke to a minimum with the default 512 B blocks, using the same existing I/O subsystem and storage drivers.

Adding more layers of abstraction will only exasperate the existing problem, to approach the efficiencies of PS5 you need less separate kernel drivers, or a trusted kernel driver model where different kernel drivers can share data in a trust way without the massive overhead of IRPs for passing data

These are not layers of abstraction, but separate OS modules each doing their part of the job. This was a deliberate design decision to componentize previously monolithic parts into several modules. Even If you consolidate everything back into fewer modules, there are still hardware protocols and process multithreading and isolation requiremetns which have to be followed.

The approach with kernel-mode drivers containing device-class specific functions for I/O and resource management, miniport drivers providing device-specific functions, and user-mode drivers implementing actual processing and user-mode interaction was implemented to reduce unrecoverable blue/green "screen of death" kernel stops, since generic port/class drivers are designed and tested by Microsoft, and hardware/software vendors only need to develop less complex miniport/filter drivers.

iroboto · May 31, 2020

Jay said:
Not sure what that means to the file system IO stack, you got any more details/input.

Windows is moving to a containerized environment, xbox already runs with custom hyperV setup.
I'm assuming that azure will have a better IO stack in relation to SSD, remote desktop (windows in the cloud) would benefit from more modern IO stack also.

On the consumer side xbox is in need of it more than PC, so I could see cross department collaboration starting with xbox with azure input, with view to moving it to the PC. Under the guise of Direct Storage.

No unfortunately this is out of my area of knowledge. I don’t know how I/O is handled in the cloud.

Davros · Jun 1, 2020

Deleted member 11852 · Jun 1, 2020

DmitryKo said:
I'm not sure why CPU overhead has to go up - it should actually decrease with large blocks of data, because they require less I/O operations.

Because I/O is CPU-driven. :???:

The more I/O, the more CPU work is required. Again, I can only keep linking Microsoft's dev pages on the Windows I/O model.

DmitryKo said:
If that were the case, we wouldn't see the raw disk bandwidth max out with large 64 KB I/O blocks in synthetic tests, but choke to a minimum with the default 512 B blocks, using the same existing I/O subsystem and storage drivers.

Because 64kb is the maximum I/O request size for the Windows filesystem. :???:

These are not layers of abstraction, but separate OS modules each doing their part of the job. This was a deliberate design decision to componentize previously monolithic parts into several modules.

Microsoft's number one goal is maintainability of software code and the fundamental design philosophy to this is dependency inversion. Reusing software libraries and frameworks, and building software on abstracted interfaces (application/OS API, software/hardware API) is how Microsoft does everything, this is the only way Microsoft designs software.

DmitryKo said:
The approach with kernel-mode drivers containing device-class specific functions for I/O and resource management, miniport drivers providing device-specific functions, and user-mode drivers implementing actual processing and user-mode interaction was implemented to reduce unrecoverable blue/green "screen of death" kernel stops, since generic port/class drivers are designed and tested by Microsoft, and hardware/software vendors only need to develop less complex miniport/filter drivers.

I think anybody producing graphics drivers, RAID drivers or any kind of I/O device which for an existing class does not exist would strongly disagree. :yep2:

DmitryKo · Jun 1, 2020

DSoup said:
I can only keep linking Microsoft's dev pages on the Windows I/O model.
building software on abstracted interfaces (application/OS API, software/hardware API) is how Microsoft does everything

Every OS abstracts its programming interfaces and data structures in multiple layers.

The point is, if entire block I/O layer uses the same abstractions - i.e. data and control structures - in each of the different components that comprise the device stack, they are, well, on the same level of abstraction. And even if you move all these components into a single OS module, you cannot eliminate work items in that flow graph - these steps still need to be performed somewhere.

Filesystem I/O is on a higher level - so if you come up with another 'simpler' filesystem, you still have not eliminated the filesystem abstraction.

Because 64kb is the maximum I/O request size for the Windows filesystem.

I/O buffers are only limited by available memory pools, there is no hard limit of 64 KBytes.

This was the maximum NTFS cluster size before Windows 10 version 1709, but currently NTFS supports up to 2 MB, and exFAT supports up to 32 MB.

There is also a requirement to align non-buffered file I/O requests to multiples of formatted sector size, i.e 512, 2048, or 4096 bytes depending on the disk media.

anybody producing graphics drivers, RAID drivers or any kind of I/O device which for an existing class does not exist would strongly disagree.

We're discussing filesystems and low-level I/O in WDM and WDF (KMDF/UMDF). This is the standard driver model for SATA, USB 3.0, and PCIe storage devices since Server 2003, and both StorAHCI and StorNVMe, as well as Intel RST/RSTe, are implemented as StorPort miniport drivers.

There are other types of device drivers, such as display, network, printer, scanner, input etc. which do not need to follow the WDM/WDF data model for I/O requests - though some are similarily componentized into port/class/miniport/filter modules.

Deleted member 11852 · Jun 2, 2020

DmitryKo said:
This was the maximum NTFS cluster size before Windows 10 version 1709, but currently NTFS supports up to 2 MB, and exFAT supports up to 32 MB.

Not cluster size, I/O request size. In SDK terms, MM_DISO_IO_SIZE. You're confusing filesystem and kernel I/O. One exists within the other.

DmitryKo · Jun 2, 2020

DSoup said:
I/O request size. In SDK terms, MM_DISO_IO_SIZE

There is a legacy 64 KByte I/O limit from pre-Windows Vista era still present in wdm.h but it's not used by the kernel-mode Memory Manager in either Windows 10 or Server 2016.
Also note system PAGE_SIZE granularity of kernel pool allocations and DMA transfers.

Code:

// Define the old maximum disk transfer size to be used by MM and Cache
// Manager.  Current transfer sizes can typically be much larger.
//
#define MM_MAXIMUM_DISK_IO_SIZE          (0x10000)

Deleted member 11852 · Jun 2, 2020

DmitryKo said:
There is a legacy 64 KByte I/O limit from pre-Windows Vista era still present in wdm.h but it's not used by the kernel-mode Memory Manager in either Windows 10 or Server 2016.
Also note system PAGE_SIZE granularity of kernel pool allocations and DMA transfers.

I'm lost by this post. We're talking about very different things. You're posting about file caching and DMA requests :???:

I'm talking about Windows 10's CPU-driven I/O process.

DmitryKo · Jun 2, 2020

DSoup said:
I'm lost by this post.

You posted a kernel Memory Manager constant from an NT-era DDK to prove that I/O request packets (IRP) are limited to 64 Kbyte buffers for disk I/O operations in Windows 10.

In fact I/O buffers, no matter if DMA or double-buffered memory-mapped IO, are physically allocated by the driver using ExAllocatePoolWithTag, MmAllocateContiguousMemorySpecifyCache, MmAllocatePagesForMDL, or IoAllocateMdl, MmProbeAndLockPages, MmMapLockedPagesSpecifyCache, etc., to be mapped into virtual memory with memory descriptor list (MDL) structures. So in effect they are only limited by the kernel non-paged memory pool.

For the 32-bit platform, non-paged pool is limited to 256 MBytes (~65532 pages) in Windows 2000/XP and Server 2003, 2 GBytes in Vista/2008, 4 GBytes in W7/2008R2 and W8.1/2012R2, and 3 GBytes in W10/Server 2016.
For the 64-bit platform, it's limited to 128 GBytes and if smaller, to entire RAM in XP/Server 2003, W8.1/2012R2 and W10/Server 2016, 40% of RAM in Vista, and 75% of RAM in Server 2008 and W7/2008R2.
https://docs.microsoft.com/en-us/windows/win32/memory/memory-limits-for-windows-releases

We're talking about very different things.

Not so much. File cache is a key part of the disk I/O driver stack since the original NT 3.x (and arguably OS/2 1.1 and DOS 4.x SmartDrive), which works by memory-mapping files into virtual address space then letting the memory manager load missing physical pages from disk.
Windows Internals, Part 2 describes kernel drivers; it should be updated with Windows 10 details in the upcoming 7th edition.

You're posting about file caching and DMA requests I'm talking about Windows 10's CPU-driver I/O process.

You think StorNVMe uses memory-mapped I/O?

Deleted member 11852 · Jun 3, 2020

DmitryKo said:
You posted a kernel Memory Manager constant from an NT-era DDK to prove that I/O request packets (IRP) are limited to 64 Kbyte buffers for disk I/O operations in Windows 10.

I did and that was a mistake on my part. But your posts are focussed on how hardware accesses memory and how the higher-level filesystem works in some I/O devices. Unpinning all of this is the Windows I/O driver model. These are many kernel level drivers involved in reading data off of storage, decompressing it and getting it to where it needs to be; main RAM or VRAM and they communicate using the kernel level I/O model because drivers cannot directly communicate with each other, only through kernel I/O messaging.

You think StorNVMe uses memory-mapped I/O?

Again, I'm confused by this post. What has memory-mapped I/O got to do with this? Maybe it would help if you explain how you think data is read off a SSD and gets to RAM on the video card. Because I think you've overlooking a bunch of critical stages which none of your posts have addressed but which is the only thing I am focussed on. :yes:

DmitryKo · Jun 4, 2020

DSoup said:
These are many kernel level drivers involved in reading data off of storage, decompressing it and getting it to where it needs to be; main RAM or VRAM and they communicate using the kernel level I/O model

I've described all of them in the posts above. The principal components involved in disk I/O for NVMe storage are:

I) I/O Manager

1) filesystem driver stack, made of installable filesystem driver (FAT32/exFAT/NTFS/REFs) and filesystem filter/minifilter driver (encryption, compression, virus scanning, symbolic links, mount points, deduplication, etc.)
2) storage driver stack, made of storage port driver - storage miniport driver - storage class driver - storage filter (StorPort driver and StorNVMe miniport)

II) Cache Manager
III) virtual Memory Manager

They will also interact with other kernel components like object manager, security reference monitor, power manager, Plug and Play etc.

Specifically encryption/compression works at the file level - if NTFS reparse point data is encountered for the file, the driver stack calls a specified filesystem filter/minifilter driver, which can process the metadata and file data, and/or map file buffers into user address space for system services or applications to process.

Unfortunately you cannot directly load data from disk into local video memory (unless it's an integrated APU with no local video memory). Display driver port/miniport (WDDM) model is not plugged into the filesystem driver stack and Cache Manager. This would probably require another major revision of the WDDM driver model, and hardware changes to the memory management unit (MMU) in both GPU and CPU to accomodate cache-coherent memory I/O protocols.

Direct transfer between NVMe controller and GPU is not possible either, since PCIe peer-to-peer DMA is optional for the root complex and desktop chipsets never implement it.

I think you've overlooking a bunch of critical stages which none of your posts have addressed but which is the only thing I am focussed on.

If you want low-level details and/or historical data, I refer you to Microsoft Docs, SysInternals / Windows Internals, and community.osr.com / osronline.com (use Google "site:" search).

DmitryKo · Jun 4, 2020

DSoup said:
Maybe it would help if you explain how you think data is read off a SSD and gets to RAM on the video card.

Heck that's easy.

I. Load from disk

File I/O requests typically use CreateFile / ReadFile / ReadFileScatter and these must provide a pre-allocated user-mode buffer (or a scatter page list to multiple buffers).

The I/O manager checks whether file cache contains the required data - if yes, filesystem driver Fast I/O path calls into the Cache Manager (and Memory Manager).
Otherwise the I/O Manager locks the physical pages in the user buffer and maps them into kernel address space using the memory descriptor list (MDL) structures, then creates I/O request packet IRP_MJ_READ.

The IRP passes through the filesystem stack down to the StorPort/StorNVMe driver pair, which set up the NVMe controller to perfrom Direct I/O DMA to the user-mode buffer mapped by MDLs.

When complete, the driver reports success (or failure) to the I/O Manager, which unmaps the user-mode buffer from kernel address space, then returns the number of read bytes to the user-mode process.

II. Move/map into video memory

Once your data is loaded into system memory, you call the graphics API runtime to submit your resources to the video memory.

The runtime uses WDDM kernel-mode driver (DXGK) callbacks to manage local video memory; I/O requests are served by the video port driver using DMA paging buffers and MDLs.

WDDM 1.x uses linear video memory model where DXGK video memory manager maps a part of GPU video memory to an aperture 'segment' in CPU memory space. The Direct3D 9/10/11 runtimes automatically manage local video or shared (system) memory pools to allocate created resources.

In Direct3D 12 and WDDM 2.x DDIs, there is no automatic resource management. The video memory manager allocates GPU virtual address space for each user process. The resource binding model requires the programmer to organize GPU resources into descriptor heaps/tables; the runtime processes the descriptors to assign them a virtual video memory address. The application then has to allocate physical memory for the resources from available shared (system) memory or local video memory.

For discrete GPUs (GpuMmu model), the driver takes an abstracted page table format from the video memory manager and maps internal GPU page tables to point into physical video or physical system memory. The video memory manager pages physical memory from system memory to local video memory by requesting DMA transfers. Additionally the system memory aperture 'segment' contains the entire local video memory for the CPU to access.
This is the memory model used for AMD GCN and NVidia Kepler ('Unified Memory'), or later.

For integrated graphics (IoMmu), you actually don't have local video memory, only system memory; GPU MMU uses the same virtual address space as the CPU, and the Memory Manager handles page faults as usual.
This is the memory model for Intel UHD.

What has memory-mapped I/O got to do with this?

'CPU I/O', which I take is your term for either programmed or memory-mapped (port-mapped) I/O, is not used by StorAHCI or StorNVMe miniport drivers; StorPort only supports Direct I/O DMA mode - specifically bus-master DMA and interrupt signalling. The reason is, UDMA HDDs and PCI bus-mastering controllers like Intel PIIX for i430/440 series and ICH for i810/820/840/865 series were commonplace by 2003 when development of Longhorn (Windows Vista) started.

pjbliverpool · Jun 4, 2020

DmitryKo said:
Direct transfer between NVMe controller and GPU is not possible either, since PCIe peer-to-peer DMA is optional for the root complex and desktop chipsets never implement it.

That's interesting, I wasn't aware of this. Do you know is this applies to all desktop chipsets including the the latest AMD and Intel platforms? This would make bypassing the system memory for memory transfers impossible for desktops chipsets then I suppose?

I assume then that the EPYC chipset than the the DGX works on in the GPU DirectStorage prototypes would have PCIe peer-to-peer DMA enabled in the chipset, unlike their desktop counterparts.

EDIT: Re-reading this it does suggest that Zen platforms do indeed support P2P DMA between any 2 devices, even on different root ports.

Blazing Fast NVMEs and Direct Storage API for PCs spawn

Deleted member 11852

Guest

iroboto

Daft Funk

DmitryKo

Deleted member 11852

Guest

Jay

iroboto

Daft Funk

Jay

DmitryKo

iroboto

Daft Funk

Davros

Deleted member 11852

Guest

DmitryKo

Deleted member 11852

Guest

DmitryKo

Deleted member 11852

Guest

DmitryKo

Deleted member 11852

Guest

DmitryKo

DmitryKo

pjbliverpool

B3D Scallywag

Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

Deleted member 11852

Guest

Daft Funk

Deleted member 11852

Guest

Daft Funk

Daft Funk

Deleted member 11852

Guest

Deleted member 11852

Guest

Deleted member 11852

Guest

Deleted member 11852

Guest

B3D Scallywag

Blazing Fast NVMEs and Direct Storage API for PCs spawn