Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

I wonder if DirectStorage and RTXIO are able to scale up that far? 56GB/s is faster than your average RAM throughput!

will be interesting . I would still just rather see a ssd added to the graphics card itself.

Imagine this card but directly added to your graphics card. by pass everything else
 
RTX IO detailed in Ampere Whitepaper

GeForce RTX GPUs will deliver decompression performance beyond the limits of even Gen4 SSDs, of f loading potentially dozens of CPU cores’ worth of work to ensure maximum overall system performance for next-generation games. Lossless decompression is implemented with high performance compute kernels, asynchronously scheduled. This functionality leverages the DMA and copy engines of Turing and Ampere, as well as the advanced instruction set, and architecture of these GPU’s SM’s. The advantage of this is that the enormous compute power of the GPU can be leveraged for burst or bulk loading (at level load for example) when GPU resources can be leveraged as a high performance I/O processor, delivering decompression performance well beyond the limits of Gen4 NVMe. During streaming scenarios, bandwidths are a tiny fraction of the GPU capability, further leveraging the advanced asynchronous compute capabilities of Turing and Ampere.
https://www.nvidia.com/content/dam/...pere-GA102-GPU-Architecture-Whitepaper-V1.pdf

Looks promising.
 
Existing 4TB NVME at $500 times 4 is $2K plus whatever the cost of the card is.
The AORUS Gen4 AIC card goes for $150 at NewEgg, and a 4TB NVMe disk would part you with ~$800.

Initial Phison E18 SSDs seem to be limited to 2TB though, and Samsung 980 Pro only goes up to 1TB from the leaked specs.


56GB/s is faster than your average RAM throughput!
56 Gbyte/s is quite possible with dual-channel DDR4-3600 (PC4-28800), and PCIe 4.0 x16 goes up to 32 GByte/s (in each direction).

But I don't think upcoming DirectStorage games could make use of simultaneous reads/writes on the scale of 30 Gbyte/s, because any GPU will be unable to keep up with decompression at this data rate. Not until the year 2028 - and I would rather spend on a trip to the Los Angeles Summer Olympics than on a $1500 NVMe RAID, a $2000 HEDT platform, and a $2500 Titan video card. :nope:

I would still just rather see a ssd added to the graphics card itself.
It would make no difference if the PCIe Switch was located directly on the add-on card and not in the CPU Root Complex. PCIe is a point-to-point protocol, unlike conventional PCI.
 
Last edited:
So, 4 * $800 + $150 = a cool $3350, with tax around $3618. No problem. :LOL:
 
The AORUS Gen4 AIC card goes for $150 at NewEgg, and a 4TB NVMe disk would part you with ~$800.

Initial Phison E18 SSDs seem to be limited to 2TB though, and Samsung 980 Pro only goes up to 1TB from the leaked specs.



56 Gbyte/s is quite possible with dual-channel DDR4-3600 (PC4-28800), and PCIe 4.0 x16 goes up to 32 GByte/s (in each direction).

But I don't think upcoming DirectStorage games could make use of simultaneous reads/writes on the scale of 30 Gbyte/s, because any GPU will be unable to keep up with decompression at this data rate. Not until the year 2028 - and I would rather spend on a trip to the Los Angeles Olympics than a $1500 NVMe RAID, a $2000 HEDT platform, and a $2500 Titan video card).

It would make no difference if the PCIe Switch was located directly on the add-on card, and not in the CPU Root Complex. PCIe is a point-to-point protocol, unlike conventional PCI.

Doesn't it currently go hdd/ssd to cpu cpu to gpu ? So if you could just do ssd to you'd have an easier time and you'd be able to iterate faster in storage because you wouldn't be reliant on motherboards and cpus supporting it.
 
Doesn't it currently go hdd/ssd to cpu cpu to gpu ?
PCIe devices use physical point-to-point links - it does not matter if transfers go through the built-in PCIe Switch in the CPU's PCIe Root Complex, or through a dedicated PCIe Switch chip on the add-on board or a multi-function ASIC. Either way the links are physically swtiched to connect different endpoints.

if you could just do ssd to [gpu] you'd have an easier time
Only if you connect the NVMe disks to the GPU memory controller with dedicated PCIe x4 links - however for the SSD to be visible to the host CPU and accessible by the OS disk/file management, such dedicated GPU links would still have to go through a PCIe Switch.

So if you can simply use the SSD connected to the CPU Root Port to the same effect, then why bother with dedicated GPU links and either proprietary driver code or another radical redesign of the Windows Display Driver Model?

and you'd be able to iterate faster in storage because you wouldn't be reliant on motherboards and cpus supporting it.
Dedicated 16-lane 2-port PCIe Switch chip costs an extra, and Gen4 switches are not available on the market yet - whereas a built-in PCIe Switch is included with your Zen* processor for free.
 
Last edited:
Nvidia are claiming the performance cost is trivial at 7GB/s. So it doesn't sound like 4x that thoughput would be out of reach.
The GA102 whitepaper above talks about memory bandwidth, not decompression performance (and uses the same slide where the SSD goes through the NIC and decompressed data have double the bandwidth).

Suppose RTX IO / DirectStorage could sustain 7 GByte/s reads from the SSD at all times, and their lossless algorithm has an average 2:1 compression rate (50%) - it does not really follow that decompression takes zero time with no significant delay, or that assigned GPU cores can always keep up with any dataset to double the bandwidth at any input rate.
 
The GA102 whitepaper above talks about memory bandwidth, not decompression performance (and uses the same slide where the SSD goes through the NIC and decompressed data have double the bandwidth).

Suppose RTX IO / DirectStorage could sustain 7 GByte/s reads from the SSD at all times, and their lossless algorithm has an average 2:1 compression rate (50%) - it does not really follow that decompression takes zero time with no significant delay, or that assigned GPU cores can always keep up with any dataset to double the bandwidth at any input rate.

There have been various quotes from Nvidia about how small the performance impact is. Here's one but I've seen at least a couple of others along the same lines:

https://www.back2gaming.com/guides/nvidia-rtx-io-in-detail/

"When asked about the performance hit of RTX IO on the GPU itself, an NVIDIA representative responded that RTX IO utilizes only a tiny fraction of the GPU, “probably not measurable”."
 
There have been various quotes from Nvidia about how small the performance impact is.
NVIDIA representative responded that RTX IO utilizes only a tiny fraction of the GPU
7 GByte/s is indeed a fraction of the total video memory bandwidth, no matter 200 GByte/s in a mid-range card or 1 TByte/s in a high-end card. However compression overhead is another thing - traditional lossless algorihthms are based on dictionary coding, which is not easily parallelized even with large blocks. LZ78/LZX/LZW-based algorithms surely can't sustain ~30 GByte/s output even on top-tier HEDT CPUs with 32 or more threads, and GPU implementations have not shown any significant speed-up comparing to CPUs.
 
Last edited:
7 GByte/s is indeed a fraction of the total video memory bandwidth, no matter 200 GByte/s in a mid-range card or 1 TByte/s in a high-end card. However compression overhead is another thing - traditional lossless algorihthms are based on dictionary coding, which is not easily parallelized even with large blocks. LZ78/LZX/LZW-based algorithms surely can't sustain ~30 GByte/s output even on top-tier HEDT CPUs with 32 or more threads, and GPU implementations have not shown any significant speed-up comparing to CPUs.
Did you just ignore the whitepaper?

GeForce RTX GPUs will deliver decompression performance beyond the limits of even Gen4 SSDs, of f loading potentially dozens of CPU cores’ worth of work to ensure maximum overall system performance for next-generation games. Lossless decompression is implemented with high performance compute kernels, asynchronously scheduled. This functionality leverages the DMA and copy engines of Turing and Ampere, as well as the advanced instruction set, and architecture of these GPU’s SM’s. The advantage of this is that the enormous compute power of the GPU can be leveraged for burst or bulk loading (at level load for example) when GPU resources can be leveraged as a high performance I/O processor, delivering decompression performance well beyond the limits of Gen4 NVMe. During streaming scenarios, bandwidths are a tiny fraction of the GPU capability, further leveraging the advanced asynchronous compute capabilities of Turing and Ampere.
 
Let uncle Davros try and straighten everything out
You say "nvidia cant keep up with the demands of decompression"
nvidia say "yes we can"
 
Compression on the GPU is probably both faster, efficient and more flexible then whats in the consoles.
If it was all that on GPU they wouldn't bothered creating custom hardware blocks to do it and would have just invested in beefier GPU. And you probably mean decompression.
 
If it was all that on GPU they wouldn't bothered creating custom hardware blocks to do it and would have just invested in beefier GPU. And you probably mean decompression.

More flexible is a given. Faster is already confirmed, at least in relation to XSX, but very likely in relation to PS5 as well.

Efficiency almost certainly goes to the consoles though in terms of silicon budget and power draw given that fixed function hardware almost always beats similarly performing general purpose hardware in that regard.

And also the advantage of not having your limited GPU resources pulling double duty.
 
GPU decompression won't take much to exceed that of even the PS5, maybe 2TF or so. I posted it before, but on an early version 1.0 of GPU-based decompression they were getting 60-120 GB/s on a PS5. There's still possibilities of improvements, but even at first pass that's 6-12 GB/s per GPU TF used.

Here's a repost of the info from the Nvidia Ampere thread.

----------
Radgames comes up quite a few times in the Console Tech section, here's some posts referring to some nice posts about it. Be sure to read the full twitter threads about it.

Mostly that you could get 60-120 GB/s of textures decompressed if you used the entire PS5 GPU (10.28 TF). The Ampere has near that much TF to spare over and above the PS5.

Naturally, you wouldn't need to use that much, but it gives you an idea on how powerful the GPUs are when it comes to decompression.

https://forum.beyond3d.com/posts/2134570/

https://forum.beyond3d.com/posts/2151140/
https://forum.beyond3d.com/posts/2134405/


External references --

http://www.radgametools.com/oodlecompressors.htm
http://www.radgametools.com/oodletexture.htm
https://cbloomrants.blogspot.com/

Oodle is so fast that running the compressors and saving or loading compressed data is faster than just doing IO of uncompressed data. Oodle can decompress faster than the hardware decompression on PS4 and XBox One.



GPU benchmark info thread unrolled: https://threadreaderapp.com/thread/1274120303249985536

A few people have asked the last few days, and I hadn't benchmarked it before, so FWIW: BC7Prep GPU decode on PS5 (the only platform it currently ships on) is around 60-120GB/s throughput for large enough jobs (preferably, you want to decode >=256k at a time).

That's 60-120GB/s output BC7 data written; you also pay ~the same in read BW. MANY caveats here, primarily that peak decode BW if you get the entire GPU to do it is kind of the opposite of the way this is intended to be used.

These are quite lightweight async compute jobs meant to be running in the background. Also the shaders are very much not final, this is the initial version, there's already several improvements in the pipe. (...so many TODO items, so little time...)

Also, the GPU is not busy the entire time. There are several sync points in between so utilization isn't awesome (but hey that's why it's async compute - do it along other stuff). This is all likely to improve in the future; we're still at v1.0 of everything. :)
 
You say "nvidia cant keep up with the demands of decompression"
nvidia say "yes we can"
No. Nvidia did not specify the algorithm or the maximum bandwidth, but if they're using LZ-family block compression, CUDA-based libraries are typically reaching a few GByte/s according to academic papers, so 28 GByte/s would be too high even when you consume the entire GPU, and not just 'a fraction of GPU'.

The entire idea of DirectStorage is to free the CPU from loading and decompression tasks by streaming the data directly to video memory as fast as possible and using dedicated hardware chip (on the Xbox) or compute units (on the PC) - which means their block compression algorithm has to be designed for simplicity and low decompression overhead, not for best possible compression efficiency or processing bandwidth. Not sure why it is so hard to understand.


on an early version 1.0 of GPU-based decompression they were getting 60-120 GB/s on a PS5
Here's a repost of the info from the Nvidia Ampere thread.
Interesting, but BC7Prep is not a general-purpose compression algorithm - it's a more efficient version of BC7 texture compression from 2009, which has to be decoded to the baseline BC7 format before the actual TMUs can consume it.
 
Last edited:
Back
Top