Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

GPUDirect Benchmarking - HPC-Works - Confluence (atlassian.net)
September 27, 2021
The GPUDirect RDMA technology exposes GPU memory to I/O devices by enabling the direct communication path between GPUs in two remote systems. This feature eliminates the need to use the system CPUs to stage GPU data in and out intermediate system memory buffers. As a result the end-to-end latency is reduced and the sustained bandwidth is increased (depending on the PCIe topology).

The GDRCopy (GPUDirect RDMA Copy) library leverages the GPUDirect RDMA APIs to create CPU memory mappings of the GPU memory. The advantage of a CPU driven copy is the very small overhead involved. That is helpful when low latencies are required.
...
Latency
Over 8.5X performance boosting is achieved when comparing up to 128B messages (with and without GPUDirect and GDR Copy) and over 5X performance boosting 128B-4KB.
Another observation that is clearly seen is that GDR copy provides a latency benefit for small messages.
GPUDirect RDMA by itself, for small messages is not enough for best performance.
...
GPUDirect manages to push the GPU bandwidth to maximum PCIe capacity. GDRCopy doesn’t influence bandwidth.
 
^^ Seems about right. Tiny block sizes are where the most overhead would be so a significant latency reduction in tiny blocks certanily makes sense. And further to no surprise, using this ability to pipe GPU data directly to a network interface has significant upsides for HPC applications.

More surprising to me is the benefit in even larger messages sizes. Do we know if Linux struggled with this in the past? Even at "peak" utilization the memory transfer rates still seem... low? Curious how this might translate over (if ever) to a physical storage medium.
 
https://schedule.gdconf.com/session...nologies-of-forspoken-presented-by-amd/886052

A GDC session scheduled for March 23rd will be focused on Forspoken, a highly anticipated PC+PS5 game from Square Enix that's due to launch in late May. Part of the discussion will cover Forspoken's support for DirectStorage, a major reworking of how Windows handles storage I/O which seeks to:

  • Bring PS5/XBSX-like super-fast loading times to PC gaming

  • Better enables games to be designed with super-fast storage in mind
Snippets from the GDC session summary:

Forspoken is also supporting the new Microsoft DirectStorage API. A part of the session will be dedicated to its addition to the game highlighting the challenges the studio faced and the benefits it is bringing to the title.

The audience will also be presented with DirectStorage in a real case scenario and get some advice to integrate it into their future projects.

The intended audience is graphics and engine programmers interested in adding cutting-edge graphics technologies and DirectStorage into their game.

I will be excited for this , hopefully its posted some where so you don't have to be an attendee
 
Nvidia Unveils Big Accelerator Memory: Solid-State Storage for GPUs | Tom's Hardware (tomshardware.com)
March 15, 2022
Microsoft's DirectStorage application programming interface (API) promises to improve the efficiency of GPU-to-SSD data transfers for games in a Windows environment, but Nvidia and its partners have found a way to make GPUs seamlessly work with SSDs without a proprietary API. The method, called Big Accelerator Memory (BaM), promises to be useful for various compute tasks, but it will be particularly useful for emerging workloads that use large datasets.
...
"The goal of Big Accelerator Memory is to extend GPU memory capacity and enhance the effective storage access bandwidth while providing high-level abstractions for the GPU threads to easily make on-demand, fine-grain access to massive data structures in the extended memory hierarchy," a description of the concept by Nvidia, IBM, and Cornell University cited by The Register reads.

BaM essentially enables Nvidia GPU to fetch data directly from system memory and storage without using the CPU, which makes GPUs more self-sufficient than they are today. Compute GPUs continue to use local memory as software-managed cache, but will move data using a PCIe interface, RDMA, and a custom Linux kernel driver that enables SSDs to read and write GPU memory directly when needed. Commands for the SSDs are queued up by the GPU threads if the required data is not available locally. Meanwhile, BaM does not use virtual memory address translation and therefore does not experience serialization events like TLB misses. Nvidia and its partners plan to open-source the driver to allow others to use their BaM concept.

Astute readers will remember that AMD attempted to wed GPUs with solid-state storage with its Radeon Pro SSG graphics card several years ago. While bringing additional storage to a graphics card allows the hardware to optimize access to large datasets, the Radeon Pro SSG board was designed purely as a graphics solution and was not designed for complex compute workloads. Nvidia, IBM, and others are taking things a step further with BaM.
 
Hilarious how their PR spins the Microsoft DirectStorage API that is now a standard for Windows as being "proprietary API". I couldn't take anything they said in the rest of their material seriously or at face value after that.
 
Intresting, as i understand AMD had something similar already in their workstation gpus, this is a step up from that. AMD probably is going to implement something similar, they already explored the tech so thats good.
 
Hilarious how their PR spins the Microsoft DirectStorage API that is now a standard for Windows as being "proprietary API". I couldn't take anything they said in the rest of their material seriously or at face value after that.

Even more hilarious since it appears to try to imply that whatever NV are doing isn't proprietary. :p

Regards,
SB
 
https://www.theverge.com/2022/3/23/22993860/forspoken-pc-microsoft-directstorage-nvme-ssd-gdc

1st example
1.9 sec on an NVMe SSD
4 sec on a SATA SSD
21.5 sec on an HDD

2nd example
1.7 sec NVMe SSD
3.2 sec SATA SSD
19.9 sec HDD


The developers say they
"But, says Ono, “I/O is no longer a bottleneck for loading times” — the data transfer speeds of DirectStorage are clearly faster for SSDs, and they could improve it in future if they figure out other CPU bottlenecks and take full advantage of GPU asset decompression."

The game will also support Fidelity FX Super Resolution 2.0

Great to see PC more than keeping up with consoles in IO.
 
I don’t think that is the conclusion of this article. And they may be in current situations already, but this direct storage release seems currently not to bring much in that respect until the rest of the bottlenecks have been addressed. But as always PC will get there eventually and in some cases was already there with sheer brute force.
 
Nice to see the pc atleast matching the ps5 in IO. DS seems very streamlined, removing the bottlenecks that existed in Windows. There is a ps5 version of this game, so comparisons between the two will be intresting to see.
 
I don’t think that is the conclusion of this article. And they may be in current situations already, but this direct storage release seems currently not to bring much in that respect until the rest of the bottlenecks have been addressed. But as always PC will get there eventually and in some cases was already there with sheer brute force.

I have to disagree , it shows that the pc was always capable of the same speeds the current consoles can manage. Load times are sub 2 seconds with an nvme drive rated at just shy of 5GB/s transfers

forspoken_ssd_speed_2.jpg



From the article here you can see that with direct storage the file IO speed transfer is 4829MB/s vs 2862MB/s with the old win32 IO. The bottleneck has moved now to other tasks and MS isn't done with Direct IO and in time those other bottlenecks will be removed.

"he current implementation of DirectStorage in Forspoken is only removing one of the big I/O bottlenecks — others exist on the CPU"

It's the first game using this and was supposed to be released around this time before the delay. So better more efficient implementations will come along with more improvements to Direct IO.
 
Thats before the GPU is put to work. We are already seeing north of 7gb/s raw speeds on higher end nvme drives. GPU decompression will make some impressive numbers there.
And as the above graph shows, it isnt just load speeds, its the read performance that has increased alot with the first iteration of DS, which will come in handy for streaming, not just loading, which isnt the only thing that is improved by going SSDs.
Also i must say, quite impressive performance coming from old sata ssds.
 
Thats before the GPU is put to work. We are already seeing north of 7gb/s raw speeds on higher end nvme drives. GPU decompression will make some impressive numbers there.
And as the above graph shows, it isnt just load speeds, its the read performance that has increased alot with the first iteration of DS, which will come in handy for streaming, not just loading, which isnt the only thing that is improved by going SSDs.
Also i must say, quite impressive performance coming from old sata ssds.

Yes , the old school ssds are impressive. I can't wait to see what else microsoft does with windows 11 going into the future.
 
Back
Top