Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

Discussion in 'PC Hardware, Software and Displays' started by DavidGraham, May 18, 2020.

  1. DSoup

    DSoup X
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    15,262
    Likes Received:
    11,353
    Location:
    London, UK
    That was one version of one API. Tossing in new APIs with new versions of an OS is how things has worked for six decades.

    What is the heads up, specifically? I'm looking at roadmaps for the next four planned versions of Windows 10 and there are no changes to the fundamental Windows architecture. You can't just change this without breaking software and hardware which is why these far-horizon tech roadmaps exist. Assuming you're looking a the same roadmap as me - the latest is dated May 3rd, can you tell me change code you think suggests something?


    These are software changes. Software is easy to change. Inserting a hardware decompressor into the storage-I/O-RAM-CPU/GPU pipeline managed by different kernel drivers is is a different kettle of fish.
     
  2. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    13,503
    Likes Received:
    16,535
    Location:
    The North
    I have been, possibly given the impression, that we’re seeing a heavy move towards something like 10X. So a great deal of things running in containers. But perhaps o may have misheard and that is just a 10X thing. But the move to every app in its own container may free MS to re-write major under lying structure is what I was thinking. I could be wrong though. Perhaps I was overstretching what Xbox does with its games.
     
    DSoup likes this.
  3. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    918
    Likes Received:
    1,122
    Location:
    55°38′33″ N, 37°28′37″ E
    Arguably they don't need to change the architecture of the storage subsystem. The current multi-layered approach, with storage port/miniport driver - storage class driver - storage filter driver and installable filesystem driver - filesystem filter/minifilter driver, worked quite well for NT kernel based operating systems.

    This allows 3rd party antivirus and compression/encryption software to intercept I/O blocks from every disk device without knowing intricate details about its connection bus (SCSI/SATA/SAS, PCIe/NVMe, USB/UAS), form-factor (CD/DVD, HDD/SSD, add-on board, flash card) and filesystem (FAT/FAT32/exFAT, NTFS/ReFS, CDFS/ISO9960/UDF).


    What they do need is change the size of data structures and optimise I/O patterns to operate on large blocks of data, such as 64 KB sectors and large clusters - and that's the difficult part, because large-scale modifications to StorPort / StorNVMe may break compatibility with existing miniport drivers and user applications. The easier way to achieve better performance would be a transaction optimisation helper layer that works on top of existing block device and filesystem drivers, similar to the 'CompactOS' NTFS compression feature.


    On the contrary, all it takes is to implement a custom filesystem filter driver and register a NTFS reparse point type tag. The OS-level filesystem driver will parse the tag/GUID and load the metadata from the reparse point attached to the compressed file, then pass the file data to the associated filter driver, which will further pass it to the user-mode helper that performs hardware or software compression/encryption on the I/O blocks.

    This is exactly how 'CompactOS' compression is implemented in Windows 10, and how a bunch of extended NTFS features were implemented in Windows 2000 and above.
     
    #63 DmitryKo, May 30, 2020
    Last edited: Jan 14, 2021
  4. DSoup

    DSoup X
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    15,262
    Likes Received:
    11,353
    Location:
    London, UK
    More sandboxing is happening but this will have no impact anything happening in this discussion. Arguably, things will get worse for sandboxed applications as Windows has a growing list of low-level APIs that are disallowed.

    Adding more layers of abstraction will only exasperate the existing problem, to approach the efficiencies of PS5 you need less separate kernel drivers, or a trusted kernel driver model where different kernel drivers can share data in a trust way without the massive overhead of IRPs for passing data - as you SSD speed ramps up, that will consume more and more overhead.

    This is the Windows 10 I/O model. The more data you transfer and the more kernel-level device drivers and lower-level drivers/libraries involved (and there are a lot to get from the storage to either RAM or VRAM) the worse it is.

    [​IMG]


    And a consequence of the flexibility afforded by this level of abstractions is also the prime culprit in why I/O is so slow - relatively. Let' be real here, nobody is saying SSDs on Windows are slow, it's just that when you remove a bunch of overhead from the design, e.g. PS5, you can simply move more data. Raw SSD speed is not what is holding back SSD's on Windows, Windows driver model and I/O model is holding it back - hence why large RAID SSD RAID arrays always disappoint in raw performance. But Windows need to substantially change to engineer out these designs. I can't see how you can change it that much without breaking a lot of device drivers.
     
    Ivan and Lightman like this.
  5. Jay

    Jay
    Veteran Regular

    Joined:
    Aug 3, 2013
    Messages:
    3,630
    Likes Received:
    2,955
    Does windows in azure use the same architecture?

    This could be another opportunity for the Xbox and azure teams to work together on mutually beneficial things, then it partially gets back ported to win 10, or as a basis of a new file system IO stack.
     
    BRiT likes this.
  6. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    13,503
    Likes Received:
    16,535
    Location:
    The North
    no azure setups are different. everything in azure is containerized, so I believe the underlying I/O is setup to serve that type of setup. I'm not sure of the pros and cons necessarily, except that windows continues to slow move into that direction.
     
    pharma likes this.
  7. Jay

    Jay
    Veteran Regular

    Joined:
    Aug 3, 2013
    Messages:
    3,630
    Likes Received:
    2,955
    Not sure what that means to the file system IO stack, you got any more details/input.

    Windows is moving to a containerized environment, xbox already runs with custom hyperV setup.
    I'm assuming that azure will have a better IO stack in relation to SSD, remote desktop (windows in the cloud) would benefit from more modern IO stack also.

    On the consumer side xbox is in need of it more than PC, so I could see cross department collaboration starting with xbox with azure input, with view to moving it to the PC. Under the guise of Direct Storage.
     
  8. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    918
    Likes Received:
    1,122
    Location:
    55°38′33″ N, 37°28′37″ E
    I'm not sure why CPU overhead has to go up - it should actually decrease with large blocks of data, because they require less I/O operations.

    If that were the case, we wouldn't see the raw disk bandwidth max out with large 64 KB I/O blocks in synthetic tests, but choke to a minimum with the default 512 B blocks, using the same existing I/O subsystem and storage drivers.


    These are not layers of abstraction, but separate OS modules each doing their part of the job. This was a deliberate design decision to componentize previously monolithic parts into several modules. Even If you consolidate everything back into fewer modules, there are still hardware protocols and process multithreading and isolation requiremetns which have to be followed.

    The approach with kernel-mode drivers containing device-class specific functions for I/O and resource management, miniport drivers providing device-specific functions, and user-mode drivers implementing actual processing and user-mode interaction was implemented to reduce unrecoverable blue/green "screen of death" kernel stops, since generic port/class drivers are designed and tested by Microsoft, and hardware/software vendors only need to develop less complex miniport/filter drivers.
     
    #68 DmitryKo, May 31, 2020
    Last edited: May 31, 2020
    PSman1700, BRiT and iroboto like this.
  9. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    13,503
    Likes Received:
    16,535
    Location:
    The North
    No unfortunately this is out of my area of knowledge. I don’t know how I/O is handled in the cloud.
     
    Jay likes this.
  10. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    16,866
    Likes Received:
    4,190
     
    xpea and Lightman like this.
  11. DSoup

    DSoup X
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    15,262
    Likes Received:
    11,353
    Location:
    London, UK
    Because I/O is CPU-driven. :???: The more I/O, the more CPU work is required. Again, I can only keep linking Microsoft's dev pages on the Windows I/O model.

    Because 64kb is the maximum I/O request size for the Windows filesystem. :???:

    Microsoft's number one goal is maintainability of software code and the fundamental design philosophy to this is dependency inversion. Reusing software libraries and frameworks, and building software on abstracted interfaces (application/OS API, software/hardware API) is how Microsoft does everything, this is the only way Microsoft designs software.

    I think anybody producing graphics drivers, RAID drivers or any kind of I/O device which for an existing class does not exist would strongly disagree. :yep2:
     
  12. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    918
    Likes Received:
    1,122
    Location:
    55°38′33″ N, 37°28′37″ E
    Every OS abstracts its programming interfaces and data structures in multiple layers.

    The point is, if entire block I/O layer uses the same abstractions - i.e. data and control structures - in each of the different components that comprise the device stack, they are, well, on the same level of abstraction. And even if you move all these components into a single OS module, you cannot eliminate work items in that flow graph - these steps still need to be performed somewhere.

    Filesystem I/O is on a higher level - so if you come up with another 'simpler' filesystem, you still have not eliminated the filesystem abstraction.


    I/O buffers are only limited by available memory pools, there is no hard limit of 64 KBytes.

    This was the maximum NTFS cluster size before Windows 10 version 1709, but currently NTFS supports up to 2 MB, and exFAT supports up to 32 MB.

    There is also a requirement to align non-buffered file I/O requests to multiples of formatted sector size, i.e 512, 2048, or 4096 bytes depending on the disk media.


    We're discussing filesystems and low-level I/O in WDM and WDF (KMDF/UMDF). This is the standard driver model for SATA, USB 3.0, and PCIe storage devices since Server 2003, and both StorAHCI and StorNVMe, as well as Intel RST/RSTe, are implemented as StorPort miniport drivers.

    There are other types of device drivers, such as display, network, printer, scanner, input etc. which do not need to follow the WDM/WDF data model for I/O requests - though some are similarily componentized into port/class/miniport/filter modules.
     
    #72 DmitryKo, Jun 1, 2020
    Last edited: Jun 1, 2020
  13. DSoup

    DSoup X
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    15,262
    Likes Received:
    11,353
    Location:
    London, UK
    Not cluster size, I/O request size. In SDK terms, MM_DISO_IO_SIZE. You're confusing filesystem and kernel I/O. One exists within the other.
     
  14. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    918
    Likes Received:
    1,122
    Location:
    55°38′33″ N, 37°28′37″ E
    There is a legacy 64 KByte I/O limit from pre-Windows Vista era still present in wdm.h but it's not used by the kernel-mode Memory Manager in either Windows 10 or Server 2016.
    Also note system PAGE_SIZE granularity of kernel pool allocations and DMA transfers.

    Code:
    // Define the old maximum disk transfer size to be used by MM and Cache
    // Manager.  Current transfer sizes can typically be much larger.
    //
    #define MM_MAXIMUM_DISK_IO_SIZE          (0x10000)
    
     
    #74 DmitryKo, Jun 2, 2020
    Last edited: Jun 2, 2020
  15. DSoup

    DSoup X
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    15,262
    Likes Received:
    11,353
    Location:
    London, UK
    I'm lost by this post. We're talking about very different things. You're posting about file caching and DMA requests :???:
    I'm talking about Windows 10's CPU-driven I/O process.
     
    #75 DSoup, Jun 2, 2020
    Last edited: Jun 2, 2020
  16. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    918
    Likes Received:
    1,122
    Location:
    55°38′33″ N, 37°28′37″ E
    You posted a kernel Memory Manager constant from an NT-era DDK to prove that I/O request packets (IRP) are limited to 64 Kbyte buffers for disk I/O operations in Windows 10.


    In fact I/O buffers, no matter if DMA or double-buffered memory-mapped IO, are physically allocated by the driver using ExAllocatePoolWithTag, MmAllocateContiguousMemorySpecifyCache, MmAllocatePagesForMDL, or IoAllocateMdl, MmProbeAndLockPages, MmMapLockedPagesSpecifyCache, etc., to be mapped into virtual memory with memory descriptor list (MDL) structures. So in effect they are only limited by the kernel non-paged memory pool.

    For the 32-bit platform, non-paged pool is limited to 256 MBytes (~65532 pages) in Windows 2000/XP and Server 2003, 2 GBytes in Vista/2008, 4 GBytes in W7/2008R2 and W8.1/2012R2, and 3 GBytes in W10/Server 2016.
    For the 64-bit platform, it's limited to 128 GBytes and if smaller, to entire RAM in XP/Server 2003, W8.1/2012R2 and W10/Server 2016, 40% of RAM in Vista, and 75% of RAM in Server 2008 and W7/2008R2.
    https://docs.microsoft.com/en-us/windows/win32/memory/memory-limits-for-windows-releases

    Not so much. File cache is a key part of the disk I/O driver stack since the original NT 3.x (and arguably OS/2 1.1 and DOS 4.x SmartDrive), which works by memory-mapping files into virtual address space then letting the memory manager load missing physical pages from disk.
    Windows Internals, Part 2 describes kernel drivers; it should be updated with Windows 10 details in the upcoming 7th edition.

    You think StorNVMe uses memory-mapped I/O?
     
    #76 DmitryKo, Jun 2, 2020
    Last edited: Jun 4, 2020
    PSman1700 likes this.
  17. DSoup

    DSoup X
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    15,262
    Likes Received:
    11,353
    Location:
    London, UK
    I did and that was a mistake on my part. But your posts are focussed on how hardware accesses memory and how the higher-level filesystem works in some I/O devices. Unpinning all of this is the Windows I/O driver model. These are many kernel level drivers involved in reading data off of storage, decompressing it and getting it to where it needs to be; main RAM or VRAM and they communicate using the kernel level I/O model because drivers cannot directly communicate with each other, only through kernel I/O messaging.

    Again, I'm confused by this post. What has memory-mapped I/O got to do with this? Maybe it would help if you explain how you think data is read off a SSD and gets to RAM on the video card. Because I think you've overlooking a bunch of critical stages which none of your posts have addressed but which is the only thing I am focussed on. :yes:
     
  18. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    918
    Likes Received:
    1,122
    Location:
    55°38′33″ N, 37°28′37″ E
    I've described all of them in the posts above. The principal components involved in disk I/O for NVMe storage are:

    I) I/O Manager
    1) filesystem driver stack, made of installable filesystem driver (FAT32/exFAT/NTFS/REFs) and filesystem filter/minifilter driver (encryption, compression, virus scanning, symbolic links, mount points, deduplication, etc.)
    2) storage driver stack, made of storage port driver - storage miniport driver - storage class driver - storage filter (StorPort driver and StorNVMe miniport)​
    II) Cache Manager
    III) virtual Memory Manager

    They will also interact with other kernel components like object manager, security reference monitor, power manager, Plug and Play etc.

    Specifically encryption/compression works at the file level - if NTFS reparse point data is encountered for the file, the driver stack calls a specified filesystem filter/minifilter driver, which can process the metadata and file data, and/or map file buffers into user address space for system services or applications to process.


    Unfortunately you cannot directly load data from disk into local video memory (unless it's an integrated APU with no local video memory). Display driver port/miniport (WDDM) model is not plugged into the filesystem driver stack and Cache Manager. This would probably require another major revision of the WDDM driver model, and hardware changes to the memory management unit (MMU) in both GPU and CPU to accomodate cache-coherent memory I/O protocols.

    Direct transfer between NVMe controller and GPU is not possible either, since PCIe peer-to-peer DMA is optional for the root complex and desktop chipsets never implement it.

    If you want low-level details and/or historical data, I refer you to Microsoft Docs, SysInternals / Windows Internals, and community.osr.com / osronline.com (use Google "site:" search).
     
    #78 DmitryKo, Jun 4, 2020
    Last edited: Jun 6, 2020
    Malo, PSman1700, BRiT and 3 others like this.
  19. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    918
    Likes Received:
    1,122
    Location:
    55°38′33″ N, 37°28′37″ E
    Heck that's easy.


    I. Load from disk

    File I/O requests typically use CreateFile / ReadFile / ReadFileScatter and these must provide a pre-allocated user-mode buffer (or a scatter page list to multiple buffers).

    The I/O manager checks whether file cache contains the required data - if yes, filesystem driver Fast I/O path calls into the Cache Manager (and Memory Manager).
    Otherwise the I/O Manager locks the physical pages in the user buffer and maps them into kernel address space using the memory descriptor list (MDL) structures, then creates I/O request packet IRP_MJ_READ.

    The IRP passes through the filesystem stack down to the StorPort/StorNVMe driver pair, which set up the NVMe controller to perfrom Direct I/O DMA to the user-mode buffer mapped by MDLs.

    When complete, the driver reports success (or failure) to the I/O Manager, which unmaps the user-mode buffer from kernel address space, then returns the number of read bytes to the user-mode process.


    II. Move/map into video memory

    Once your data is loaded into system memory, you call the graphics API runtime to submit your resources to the video memory.

    The runtime uses WDDM kernel-mode driver (DXGK) callbacks to manage local video memory; I/O requests are served by the video port driver using DMA paging buffers and MDLs.

    WDDM 1.x uses linear video memory model where DXGK video memory manager maps a part of GPU video memory to an aperture 'segment' in CPU memory space. The Direct3D 9/10/11 runtimes automatically manage local video or shared (system) memory pools to allocate created resources.

    In Direct3D 12 and WDDM 2.x DDIs, there is no automatic resource management. The video memory manager allocates GPU virtual address space for each user process. The resource binding model requires the programmer to organize GPU resources into descriptor heaps/tables; the runtime processes the descriptors to assign them a virtual video memory address. The application then has to allocate physical memory for the resources from available shared (system) memory or local video memory.

    For discrete GPUs (GpuMmu model), the driver takes an abstracted page table format from the video memory manager and maps internal GPU page tables to point into physical video or physical system memory. The video memory manager pages physical memory from system memory to local video memory by requesting DMA transfers. Additionally the system memory aperture 'segment' contains the entire local video memory for the CPU to access.
    This is the memory model used for AMD GCN and NVidia Kepler ('Unified Memory'), or later.

    For integrated graphics (IoMmu), you actually don't have local video memory, only system memory; GPU MMU uses the same virtual address space as the CPU, and the Memory Manager handles page faults as usual.
    This is the memory model for Intel UHD.


    'CPU I/O', which I take is your term for either programmed or memory-mapped (port-mapped) I/O, is not used by StorAHCI or StorNVMe miniport drivers; StorPort only supports Direct I/O DMA mode - specifically bus-master DMA and interrupt signalling. The reason is, UDMA HDDs and PCI bus-mastering controllers like
    Intel PIIX for i430/440 series and ICH for i810/820/840/865 series were commonplace by 2003 when development of Longhorn (Windows Vista) started.
     
    #79 DmitryKo, Jun 4, 2020
    Last edited: Jan 14, 2021
    DSoup, pharma, Malo and 4 others like this.
  20. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,694
    Likes Received:
    3,182
    Location:
    Guess...
    That's interesting, I wasn't aware of this. Do you know is this applies to all desktop chipsets including the the latest AMD and Intel platforms? This would make bypassing the system memory for memory transfers impossible for desktops chipsets then I suppose?

    I assume then that the EPYC chipset than the the DGX works on in the GPU DirectStorage prototypes would have PCIe peer-to-peer DMA enabled in the chipset, unlike their desktop counterparts.

    EDIT: Re-reading this it does suggest that Zen platforms do indeed support P2P DMA between any 2 devices, even on different root ports.
     
    #80 pjbliverpool, Jun 4, 2020
    Last edited: Jun 4, 2020
    Malo likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...