DirectStorage GPU Decompression, RTX IO, Smart Access Storage

so maybe X670's PCH is better than X570 or maybe it's really a PCIe 3.0 problem.
PCH connection doesn't affect read speeds much when benched by CrystalDiskMark, the losses are negligible. There are no other bus traffic either - GPU is connected to CPU lanes so it's pretty much SSD->PCH->CPU->GPU with nothing else eating the bandwidth.
It is surprising to see 980 Pro showing 3X higher bandwidth when it's almost exactly 2X in pure reads benchmarks in CDM. Wonder what's going on there.

CPU d/c is limiting both NVMe drives at ~6.3 GB/s which is about twice the raw read speed for 960 Pro so that's working fine. Although it is also a bit surprising to see a 12C/24T 5900X not being faster than that here.
 
It's got to be some kind of bug. We've seen PCIe 3 drives getting much higher speeds than that before. I have one in my system, I'll try to get some time later to try it out there.
 
I'm getting at 12.4GB/s on my 3.5GB/s PCIe 3 drive so it looks to be working as expected there. How do you force it to use CPU decompression?

I'm getting about 1.86GB/s on my HDD which seems crazy fast. I guess it might be using some solid state cache to speed things up there.

My SATA drive gets about 1.1GB/s so is slower than my HDD lol. Well it is pretty old but I expected a bit better than that!

Watching performance monitor in Task Manager is fascinating. GPU usage is actually higher on the slower drives than it is on the NVMe (by a lot). I wonder if that's just because it's averaging use over a longer period of time than the NVMe decompress takes. Also while both the HDD and SATA drives spike to 100% while the load takes place, the NVMe drive barely registers any use. Maybe the same reason as GPU above?
 
NVIDIA released another RTX Path Tracing remaster of the popular Portal Prelude, this time RTXIO is integrated in the mod, NVIDIA is claiming 3X the texture loading speeds with RTXIO.


 
Last edited:
I'm getting at 12.4GB/s on my 3.5GB/s PCIe 3 drive so it looks to be working as expected there.
Is it CPU attached? I wonder if that's why it isn't as fast as you'd expect in my case.

How do you force it to use CPU decompression?
-gpu-decompression {0|1} launch parameter. See docs here: https://github.com/microsoft/DirectStorage/blob/main/Samples/BulkLoadDemo/README.md
It runs with it by default so the only needed option is -gpu-decompression 0 really.
 
Is it CPU attached? I wonder if that's why it isn't as fast as you'd expect in my case.


-gpu-decompression {0|1} launch parameter. See docs here: https://github.com/microsoft/DirectStorage/blob/main/Samples/BulkLoadDemo/README.md
It runs with it by default so the only needed option is -gpu-decompression 0 really.

Thanks! I get around 4.4GB/s on the CPU with the NVMe. And yes it is CPU attached.

EDIT: still around 1.1GB/s on the SATA driver with CPU decompression. It's clearly well below the CPU's capabilities as it's barely breaking 30% utilisation during loads.

EDIT 2: For the record the benchmark doesn't always give accurate results, at least wrt CPU utilisation. With CPU decompression on my SATA drive it was reading only about 5% CPU usage but I could clearly see from task manager that each load was taking around 20% of the CPU performance.
 
Last edited:
I thought this was a new demo... but it's just the same one I compiled quite a while ago now.

Nice to hear about Portal Prelude RTX having RTXIO support though! Can't wait to download that and try it out.
 
Forza MotorSport will support DirectStorage on PC. And Half Life 2 RTX will support RTX-IO.


 
Last edited:
Forza MotorSport will support DirectStorage on PC. And Half Life 2 RTX will support RTX-IO.


Yes, those requirements make sense, as it scales down to the Series S. Seems to scale very well, just like FH5 did.

Lets you wonder what exactly went wrong with that new UE5.1 game immortals having much higher min spec than Series S.
 
Lets you wonder what exactly went wrong with that new UE5.1 game immortals having much higher min spec than Series S.
It's 720p at 60 on Series X. It likely is even lower than that on Series S while being limited to 30. UE5 is just a very heavy engine (not that surprising considering what it does and how it does it really).
 
It's 720p at 60 on Series X. It likely is even lower than that on Series S while being limited to 30. UE5 is just a very heavy engine (not that surprising considering what it does and how it does it really).

It's also just not well optimized, or built for that matter. Once you get beyond the big vistas where artists spent time at all there's a ton of generic copy pasted corridors. I wouldn't take Immortals as a good benchmark for UE5, at least a benchmark for a bigger budget studio anyway. More of an indie vastly over stretching its budget (psst, over ambitious devs, I'd take a good 10 hour campaign over a boring 20 hour one for the same price any day)
 
Semi-related question since it's something that might have been talked about in the past related to DirectStorage: what is the latency of modern PCI-Express?

Having worked on UMA GPUs in the past, I always assumed PCI-E latency was horrendous, but I wrote a few CUDA test programs and it looks to me like a fully uncached load from host memory is only ~1500 cycles on a 2.5GHz RTX 4090 (with an AMD CPU/motherboard) vs ~600 cycles for GPU DRAM. That's completely uncongested though and I also stumbled on this NVIDIA interview where they claimed the main problem with PCI-E was tail latency rather than average latency though: https://blocksandfiles.com/2022/07/13/nvidia-thoughts-on-composability-tail-latency-limits-cxl/

I'm not sure anyone here would know, but I figured it might be surprising to others as well that PCI-E latency is that low, it was certainly surprising to me!
 
For whomever might care, I ran the DS 1.2 bulk load demo on my 3080Ti against two NVMe drives on different PCIe busses:

Samsung 980 Pro 1Tb on a PCIe 3 m.2 interface:
bulkload-980pro-pcie3-png.9514


Samsung 980 Pro 2Tb on a PCIe 4 m.2 interface:
bulkload-980pro-pcie4-png.9515



So, not quite double the performance for a theoretical doubling of interface bandwidth. Not sure if this helps @DegustatoR
 

Attachments

  • Bulkload - 980Pro Pcie3.png
    Bulkload - 980Pro Pcie3.png
    366.9 KB · Views: 149
  • Bulkload - 980Pro Pcie4.png
    Bulkload - 980Pro Pcie4.png
    385.4 KB · Views: 154
The problem with DS as I see it is that it's now up to the developers to balance the loading in a way which won't eat up all bandwidths - bus, vram and potentially GPU processing throughput as well.

Previously the storage in either speed or CPU overhead limit (which in turn limited the speed with which the data was read) was a native limit inherent to the system. Now this limit is suddenly several times higher at which point it starts competing with rendering for bandwidths even without the GPU doing decompression. This must be hand tuned so that the loading may happen slower maybe but without any impact on rendering performance.

Thus DS is just another thing which needs additional care from developers doing PC games or ports. It allows faster storage access with less CPU overhead but this may not in fact be optimal as without hand coded limits it may affect game's performance in a negative way while providing invisible to the user benefits like 1s loading instead of 1.5s
 
Avatar is the third (?) title using DirectStorage, v1.2 according to DLLs in the game's folder.
Not sure if it's using GPU decompression though.
Anyone know if it's been confirmed to be in use? The Agility SDK ships with these DStorage files by default but it's not really a sign that they are used in any way. Diablo 4 also ships with those files but the application never actually loads them.

On a side note, I find it both funny and sad how much the hype surrounding DStorage has died. The new Forza supports it but people seemingly no longer care enough to have tested it in any meaningful way.
 
Back
Top