Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

it does suggest that Zen platforms do indeed support P2P DMA
Looks dubious - they whitelist AMD Zen, but not Zen+ or Zen2, and Intel support is limited to a few LGA-2066 models. And what about hardware cache coherence - the root complex would need to snoop all peer-to-peer transactions...

the EPYC chipset than the the DGX works on in the GPU DirectStorage prototypes would have PCIe peer-to-peer DMA enabled in the chipset
EPYC processors do not need a chipset, there are 128 PCIe lanes onboard. GPUDirectStorage looks like Linux-only feature, like special GPUDirect RDMA drivers in server NICs from Mellanox.
 
EPYC processors do not need a chipset, there are 128 PCIe lanes onboard.

Wouldn't this hold for Zen and Zen 2 as well then which both feature 24 lanes? Obviously they still need a chipset for other system elements which 4 of those lanes are used for, but the other 20 are split between GPU and NVMe. So does that mean they share the same upstream PCIe bridge and thus P2P DMA is enabled by default?

EDIT: I've scoured the net on this and as yet been unable to find confirmation one way or the other as to whether the NVMe and GPU share the same host port/bridge from the CPU or whether they're on different ports. I did stumble upon this though which is a whitelisting for Renoir (Zen 2) so it looks like this is supported for Zen 2 as well (and presumably then Zen+)
 
Last edited:
Wouldn't this hold for Zen and Zen 2 as well
a whitelisting for Renoir (Zen 2) so it looks like this is supported for Zen 2
It's not in the master yet, and their vendor/device IDs do not match the PCIe Root Ports on my Zen 2 processor (Ryzen 5 3600). Overall this doesn't look like a fully tested or widely supported feature that's ready for production deployment.


The problem is, PCIe Root Complex is not required to route peer-to-peer transactions either within or beyond its hierarchy tree, and does not expose its P2P capabilities as a discoverable configuration option.

Only PCIe Switches (virtual PCI-PCI bridges) are required to support P2P unconditionally - and even then it was primarily intended for multifunction devices and embedded/industrial markets, where the limited lanes from the CPU are typically split between several microcontrollers which talk to each other directly, saving the limited memory and CPU bandwidth (look at Xilinx, Broadcom, etc). This is the motivation given in the original commit request for the P2PDMA component, and the latest version only supports devices connected to the same root port or the same upstream port.

The PCIe spec assumes implementations may incorporate a virtual / physical PCIe Switch within the Root Complex to enable software transparent routing of P2P transactions, but it's not required either.

been unable to find confirmation one way or the other as to whether the NVMe and GPU share the same host port/bridge from the CPU
No, they are on separate Root Ports. Here's how the PCI device tree looks on my AMD X570 / Ryzen 5 3600 system with RX 5700 XT graphics and GAMMIX S11 SSD (Device Manager - View - Devices by connection):
Code:
⯆ PCI Bus
      AMD SMBus
    ⯆ PCI Express Root Port
        ⯆  Standard NVM Express Controller
               XPG GAMMIX S11 Pro
    ⯆ PCI Express Root Port
        ⯆ PCI Express Upstream Switch Port
            ⯆ PCI Express Downstream Switch Port
                ⯆ PCI Express Upstream Switch Port
                      PCI Express Downstream Switch Port
                      PCI Express Downstream Switch Port
                    ⯆ PCI Express Downstream Switch Port
                        ⯆ Intel(R) I211 Gigabit Network Connection
                      PCI Express Downstream Switch Port
            ⯆ PCI Express Downstream Switch Port
                  AMD PCI
                  AMD PSP 11.0 Device
                ⯆ AMD USB 3.10 eXtensible Host Controller - 1.10 (Microsoft)
                    ⯆ USB Rooot Hub (USB 3.0)
                          Generic USB Hub
                ⯆ AMD USB 3.10 eXtensible Host Controller - 1.10 (Microsoft)
                    ⯆ USB Rooot Hub (USB 3.0)
                        ⯆ Generic SuperSpeed USB Hub
                            ⯆ USB Attached SCIS (UAS) Mass Storage Device
                                JMicron Generic SCSI Disk Device
            ⯆ PCI Express Downstream Switch Port
                ⯆ Standard SATA AHCI Controller
                      WDC WD4003FZEX-00Z4SA0
    ⯆ PCI Express Root Port
        ⯆ PCI Express Upstream Switch Port
            ⯆ PCI Express Downstream Switch Port
                ⯆ AMD Radeon RX 5700 XT
                      BenQ EW3270U
                ⯆ High Definition Audio Controller
                    ⯆ AMD High Definition Audio Device
                          3 - BenQ EW3270U (AMD High Definition Device)
    ⯆ PCI Express Root Port
          AMD PCI
    ⯆ PCI Express Root Port
          AMD PCI
        ⯆ AMD USB 3.10 eXtensible Host Controller - 1.10 (Microsoft)
            ⯆ USB Rooot Hub (USB 3.0)
                ⯆ USB Composite Device
                    ⯈ USB Input Device
                    ⯈ USB Input Device
                ⯆ USB Composite Device
                    ⯈ USB Input Device
                    ⯈ USB Input Device
                    ⯈ USB Input Device
        ⯆ High Definition Audio Controller
            ⯆ Realtek(R) Audio
                  Realtek Asio Component
                  Realtek Audio Effects Component
                  Realtek Audio Universal Service
                  Realtek Digital Output (Realtek(R) Audio)
                  Realtek Hardware Support Application
                  Headphones (Realtek(R) Audio)
              PCI Encryption/Decryption Controller
    ⯆ PCI Express Root Port
          Standard SATA AHCI Controller
    ⯆ PCI Express Root Port
          Standard SATA AHCI Controller
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
      PCI standard host CPU bridge
   ⯆ PCI standard ISA bridge
          Direct memory access controller
          Programmable interrupt controller
          System CMOS/real time clock
          System speaker
          System timer

Note that chipset-based I/O controllers and the primary PCIe x16 slot (which can share lanes with the secondary x16 slot) are actually connected through a PCIe Switch - which manifests as an Upstream Switch Port and multiple Downstream Switch Ports - and the Ethernet controller is connected through another lower-level switch to allow multiple Ethernet ports.

Other Root Ports and devices are not routed through a switch.
 
Last edited:
It's not in the master yet, and their vendor/device IDs do not match the PCIe Root Ports on my Zen 2 processor (Ryzen 5 3600).

I'm guessing that's because you're using the desktop version of Zen 2 (Matisse) whereas the device ID used in the link above was for Renoir which is the APU implementation with a slightly different configuration.
 

Linus made a video apologizing to Tim Sweeney about the PS5 SSD and his lack of due diligence and understanding.

At the end though, he still has the questions that a lot of us here have as well. How much of the overhead resides on the OS Kernel vs GPU driver, whether a high core CPU can sufficiently mitigate some of the decompression overhead, as well as what DirectStorage is going to do on the PC side of things.

Sorry if this has already been posted elsewhere, but since it does pertain to PCs, I thought this would be a good place for it.
 
At the end though, he still has the questions that a lot of us here have as well. How much of the overhead resides on the OS Kernel vs GPU driver, whether a high core CPU can sufficiently mitigate some of the decompression overhead, as well as what DirectStorage is going to do on the PC side of things.

Interesting watch although I think we've already been through pretty much all of what he says there on these threads in a fair bit more detail. TBH we've probably gone as far as we can go now until DirectStorage is released and we can understand what the hell it actually does. Will it be a few tweaks here and there, or will it be a revolutionary overhaul of the PC's IO architecture?

One point he did make which stuck out to me was the around the PC "for the first time ever" being the lowest common denominator. In fact, that has always been the case with every console generation. Hell, it's probably still the case now with this generation. The question isn't whether some target gaming PC's are slower in some ways than the latest consoles (the answer is always yes), the question is do those consoles exceed any PC in one or more areas, i.e. are they breaking new ground. And even then, the answer is usually yes at the start of a console generation.
 
In fact, that has always been the case with every console generation.

More so backthen then now, OG xbox basically had a GF4 half a year before the PC got it, it had twin vertex shaders giving it a huge performance advantage. Even the first Halo coming much later to pc had many missing effects.
 
Heck that's easy.

That's amazing, really. :yes: I truly appreciate you going to such lengths here because I was expecting to have to write something like this in response. but what it doesn't really zone in on all of the higher level inter-driver I/O in the kernel itself which is the reason why the harder you push I/O, the less real CPU time you have left over. The Windows 10 kernel does balance its CPU usage based on a number of hardware factors, unlike back in the days of Windows 95/982000 where pushing a few IDE drives could literally leave almost no free CPU time for the user at all.

The fundamental issue is many/most I/O requests are copied and re-copied as the data is transferred through the various subsystems. This is one of the problems that Nvidia's GPUDirect Storage aims to solve. To re-iterate how bad pushing more I/O can be, i.e. going with faster and faster SSDs, particularly when operating in parallel. Over in the console forum, @London-boy recently posted this apology from Linus at Linus Tech Tips to Tim Sweeney given some misunderstanding about what PS5 does so differently, but usefully it includes a slice on what happened when these guys tried loading more SSD I/O into their existing servers because of all the I/O overhead.

If you skip the apology and overview of PS5's different approach to I/O, skip to 6:30 into where he begins talking about why Halo MCC running PC with a vastly faster SSD actually doesn't produce any faster loads, he also talks about why going NVMe over and older slow SSD actually results in a fraction of the performance improvements you would expect based on the raw speeds, then talks about what actually happened to their server when the SSD I/O load was crazy high. They the server running Linux run but barely, Windows was worse. Identical hardware, the only different being the software stack/OS.

Interesting watch although I think we've already been through pretty much all of what he says there on these threads in a fair bit more detail. TBH we've probably gone as far as we can go now until DirectStorage is released and we can understand what the hell it actually does. Will it be a few tweaks here and there, or will it be a revolutionary overhaul of the PC's IO architecture?

Indeed.
 
Current games are not optimized to take advantage of SSD's. Star Citizen and Doom Eternal are and show massive improvements in gameworld streaming and instant loading, as opposed to unplayable and long load times. Both show improvements on nvme (and even optane) over old sata disks.

As for that Linus video, he's not saying PC nor PS5 is better then eachother either regarding SSD tech. Obviously like he noted, both are not available yet (PS5 and future SSD pc tech)
 
I'm guessing that's because you're using the desktop version of Zen 2 (Matisse)
This still won't enable GPUDirect Storage or RDMA on desktops.

Advanced CUDA features are designed to work on specific HPC hardware - like POWER-8 supercomputers which implementt heterogeneous memory access with NVLink cache coherence protocols on NVIDIA GPUs, or custom-built DGX-2 systems which use server-grade Xeon Platinum CPUs with 48 PCIe lanes.
Such features have to use fallbacks on regular desktop computers, so they won't deliver the same performance.

Specifically NVIDIA DGX-2 supercomputers include two-level PCIe Switch Complex with twelve switches connecting each pair of GPUs to one shared PCIe slot, designed to enable RDMA with high-performance fiber-optic 100GBe NIC cards.

So to prototype the GPUDirect Storage driver, they simply installed a NVMe RAID card in one of the 8 PCIe slots (see the text below Figure 2).


Thus it's the sheer number of PCIe lanes and custom system design based on PCIe switches which enables direct P2P transactions between NVMe disks and GPUs on the DGX-2 system - and the price starts at US $350,000 so they can afford the cost of additional hardware (though note the massive radiators on each of the PCIe switches, legend 5 on the system diagram below).

54bb-nvidia-2018-dgx2-explode.png

DXG-2 system diagram

we've probably gone as far as we can go now until DirectStorage is released and we can understand what the hell it actually does
It provides a new file I/O API designed to expose the performance of NVMe drives by following optimal flash memory access patterns.

Hopefully it requires no proprietary hardware - I wouldn't want GPU vendors to start adding M.2 slots on their video cards and offering 'certified' SSD bundles with outrageous price tags.

the question is do those consoles exceed any PC in one or more areas, i.e. are they breaking new ground.
New PCs sold in 2021 will be the same or better, and the consoles will end up being equivalent to a low-end PC in the following years. That's no different than any previous generation.
 
Last edited:
the higher level inter-driver I/O in the kernel itself which is the reason why the harder you push I/O, the less real CPU time you have left over
The fundamental issue is many/most I/O requests are copied and re-copied as the data is transferred through the various subsystems.
Direct I/O DMA does not copy any data - it's delivered directly to the physical memory in page size granularity and projected into virtual address space of each process as needed.

It's rather applications and services which don't allocate large enough buffers and/or implement data fetching logic designed for slow hard disk drive access, thus overwhelming the NVMe controller with I/O requests for LBA translation from 512B sectors.

Since it's not possible to revise all this legacy code, Microsoft will have to either a) design a transaction layer that tries to optimize existing file I/O requests, and/or b) come up with a new API optimized for NVMe drives. Maybe they will have to do both.


Windows 10 kernel does balance its CPU usage based on a number of hardware factors, unlike back in the days of Windows 95/98/2000 where pushing a few IDE drives could literally leave almost no free CPU time for the user at all
It didn't really start with Windows 10 - actually 'NT OS/2' (and OS/2 1.x before it) was designed with all these direct I/O capabilities right from the start, because it targeted IBM PS/2 systems capable of MCA bus-mastering.

The quality of NT storage drivers did improve around Windows 2000/2003/Vista, which was the direct consequence of the port/class/miniport/filter layers in WDM (Windows Driver Model) and additional abstractions in WDF (Windows Driver Framework). These frameworks are built on the same NTDDK OS interfaces and structures as in the original NT 3.x, but they abstract some fine details of the kernel-mode programming that require careful attention from driver writers, such as thread contexts, power management and plug and play configuration. This port/miniport model really paid off to improve driver stability when PCI/PCIe and USB2/3 became the industry standard for all peripheral devices. Video drivers were fundamentally broken until WDDM 2.0 though.


Windows/386 3.1x and Windows 9x used an entirely different monolithic kernel designed to a much lesser standard. These were an utter nightmare to use, since the system would barely last a single day without a fatal system crash. It was such as relief to move on to Windows 2000 and XP (notwithstanding their security vulnerabilities which resulted in the maelstorm of rootkit infections a few years later).
 
Last edited:
Specifically NVIDIA DGX-2 supercomputers include two-level PCIe Switch Complex with twelve switches connecting each pair of GPUs to one shared PCIe slot, designed to enable RDMA with high-performance fiber-optic 100GBe NIC cards.

So to prototype the GPUDirect Storage driver, they simply installed a NVMe RAID card in one of the 8 PCIe slots (see the text below Figure 2).


Thus it's the sheer number of PCIe lanes and custom system design based on PCIe switches which enables direct P2P transactions between NVMe disks and GPUs on the DGX-2 system

It seems that based on the table below figure 2 they can use the systems default NVMe storage too which is connected to the first level PCIe switch. The RAID cards appear to simply be an optional extra which they've included in the diagram/table to illustrate the maximum possible aggregate bandwidth into the GPU's.

So if we ignore the RAID NVMe cards and focus only on the default NVMe drives which connect to the first layer PCIe switch, wouldn't that basically replicate how a desktop Zen is configured? (on a much smaller scale of course). I don't understand why adding a second PCIe switch layer to which the GPU's are connected would make P2P DMA more possible, if anything wouldn't it make it more complex as you're adding an extra switch layer in there? On the Zen, both the NVMe and the GPU would hang straight off the CPU (on separate ports) so the number of PCIe lanes is sufficient and there's no PCIe switch required so the configuration is much simpler while still offering the same connection between the single NVMe drive and GPU that the DXG does between multiple instances of each (thanks to the more complex PCIe switch configuration).
 
wouldn't that basically replicate how a desktop Zen is configured
No. Unlike the DGX-2, there are no Switches in between the M.2 slot, the PCIe x16 slot, and the chipset NIC.

There is a Switch between the chipset I/O controllers, with an additional lower-level Switch between the Ethernet controllers, and a Switch between the two PCIe x16 slots. This makes possible P2P transactions between the two GPUs, and P2P transactions between Ethernet controllers and chipset USB 3 and SATA controllers and M.2 slot.

On the Zen, both the NVMe and the GPU would hang straight off the CPU (on separate ports) so the number of PCIe lanes is sufficient and there's no PCIe switch required
Why use an Ethernet switch (
network bridge) to connect Ethernet ports when you can just cut the cables and physically connect same-colored twisted pairs to each other?

There are physical wires running from the slots or chips into the CPU and it can talk to each device on an end-to-end connection, but it doesn't mean the CPU is designed to initiate transmission protocols or maintain physical connections from one peripheral device to another. For starters, the length of the end-to-end signal path is potentially twice as much so additional redriver/retimer/multiplexer/clock buffer/whatever logic will be required to maintain signal integrity.

So it can only work reliably with a PCIe Switch at the devices' end.
 
Last edited:
Why use an Ethernet switch (network bridge) to connect Ethernet ports when you can just cut the cables and physically connect same-colored twisted pairs to each other?

There are physical wires running into the CPU and it can talk to each device on an end-to-end connection, but it doesn't mean the CPU is designed to initiate transmission protocols or maintain physical connections from one device to another. For starters, the length of the signal path is potentially twice as much so additional redriver/retimer/multiplexer/clock buffer/whatever logic will be required to maintain signal integrity.

No. Unlike the DGX-2, there are no Switches in between the M.2 slot, the PCIe x16 slot, and the chipset NIC.

There is a Switch between the chipset I/O controllers, with an additional lower-level Switch between the Ethernet controllers, and a Switch between the two PCIe x16 slots. This makes possible P2P transactions between the two GPUs, and P2P transactions between Ethernet controllers and chipset USB 3 and SATA controllers and M.2 slot.

Thanks, I think I understand now, so essentially the GPU's and the NVMe drives on each of the 4 PCI trees in DGX use the same PCIe root port, hence why P2P DMA is relatively straight forward. But that is obviously not the case with a Zen.

Presumably then in this implementation of GDS we wouldn't expect an NVMe drive on a different PCIe tree to be able to P2P DMA to a GPU other than the 4 on it's own tree? I've no idea how that works with what is presumably a RAID array but I guess there is some way of balancing the throughput from all the drives across each of the trees to maximise the data transfer rate to the 16 GPU's.
 
the GPU's and the NVMe drives on each of the 4 PCI trees in DGX use the same PCIe root port, hence why P2P DMA is relatively straight forward
we wouldn't expect an NVMe drive on a different PCIe tree to be able to P2P DMA to a GPU other than the 4 on it's own tree

Yes, P2P works for multiple devices on the same root port because, by design, they are physically connected to this port through a switch (a port is a collection of lanes, while a switch has a single upstream port and several downstream ports to connect multiple devices to that same port by physically dividing its lanes between all devices).

Any switch can route P2P transactions between its ports - so P2P should also work with devices connected through several switches in a multi-level hierarchy.

P2P won't work for devices connected to different root ports (since the root complex is not required to route P2P transactions between its ports).
1000px-Example_PCI_Express_Topology.svg.png

https://en.wikipedia.org/wiki/Root_complex


PS. The P2PDMA driver actually enables P2P capabity by detecting a switch in the hierarchy - i.e. finding if any two devices are connected through the same upstream port.

The whitelist is a fallback when they cannot detect a common upstream port, so they check if these devices belong to the same whitelisted Root Complex since they just won't currently detect multi-level switching as used by recent CPUs and chipsets.

EDIT Sep 2020: The code actually allows P2P DMA transfers between devices on the same PCIe Root Complex, and a recent commit enables it on AMD Zen and later processors.
 
Last edited:
At the end though, he still has the questions that a lot of us here have as well. How much of the overhead resides on the OS Kernel vs GPU driver, whether a high core CPU can sufficiently mitigate some of the decompression overhead, as well as what DirectStorage is going to do on the PC side of things.

After talking to some co workers I don't think it will matter. I myself have had access to a card that has the same amount of ram as the entire xsx and around twice the flops if I did my math right. Pair that up with system ram and ssd's in the 4-8GB range plus a cpu that gets 15-20% more ipc and more cores than whats in the console and then direct storage is the cherry on top. I don't think the ps5 will have an advantage on the pc side of things because a pc wont need to stream as much
 
Direct I/O DMA does not copy any data - it's delivered directly to the physical memory in page size granularity and projected into virtual address space of each process as needed.

Windows I/O subsystem, and it's handling of I/O Request Packets (which dates back to VMS because Dave Cutler) spends a lot of time managing the I/O stack and copying/moving packets. Just see how often functions like IoCopyCurrentIrpStackLocationToNext, RtlCopyMemory, RtlMoveMemory are called. Now figure the number of individual IRPs which are spawned from a single read of just a 40mb file as the particular I/O activity is passed around the different Windows subsystems.

I don't know what Linus Tech Tips were trying to do in terms of I/O with their server where Windows overloaded a 24 core processor but the Windows kernel overhead for I/O is not an unknown. There is a reason that linux is so prevalent in network infrastructure.

Pair that up with system ram and ssd's in the 4-8GB range plus a cpu that gets 15-20% more ipc and more cores than whats in the console and then direct storage is the cherry on top.
Can you tell us what DirectStorage is?
 
Last edited by a moderator:
it's handling of I/O Request Packets ... spends a lot of time managing the I/O stack and copying/moving packets
Just see how often functions like IoCopyCurrentIrpStackLocationToNext, RtlCopyMemory, RtlMoveMemory are called.
I/O request packets (IRPs) are not copied or moved - they are passed down the driver stack by reference, i.e. by a pointer in IoCallDriver() .

I/O stack parameters are indeed copied - the I/O stack is how each driver keeps track of its own actions on that particular IRP - but these only take 36 bytes.

There is a reason that linux is so prevalent in network infrastructure.
It costs less to license and support in large-scale deployments.
 
Pair that up with system ram and ssd's in the 4-8GB range plus a cpu that gets 15-20% more ipc and more cores than whats in the console and then direct storage is the cherry on top. I don't think the ps5 will have an advantage on the pc side of things because a pc wont need to stream as much
Minimum Requirements: 24GB RAM, 24 Cores, 8GB/sec NVME SSD ...
Not that this is a bad thing especially if we can get this at a reasonable price. Of course the only way to bring the price down would be to the need to keep up with the PS5, so it would be a win/win :devilish:
 
Can you tell us what DirectStorage is?
Sorry

Minimum Requirements: 24GB RAM, 24 Cores, 8GB/sec NVME SSD ...
Not that this is a bad thing especially if we can get this at a reasonable price. Of course the only way to bring the price down would be to the need to keep up with the PS5, so it would be a win/win :devilish:

I remember the days of having to upgrade constantly on the pc side. Not only that but there was much more you had to upgrade. CPU/MOBO/Ram of course that's as it always is. There were co processors at one point which gave way to graphics cards and sound cards and then optical storage and heck you'd have to upgrade your modem ! Damn I remember paying big bucks for a hardware 56k modem. Remember that ? There were some 56k modems that off loaded alot of work to the cpu so you'd often get worse performance but the more expensive modems would use less of the cpu and you'd get better connections. I remember having to buy a true cd rom because it used 2 lasers and got much faster read speeds. The good old expensive days.

Ram is cheap enough that most people running an 16 gig system can simply buy another 16 gigs for a $100-$200 bucks. I don't think if you have 32gigs of ram with a 16 gig graphics card that you really even need a 8GB/s nvme drive

that's also baring there being a different solution on pcs
 
Back
Top