DirectStorage GPU Decompression, RTX IO, Smart Access Storage

This doesn't make any sense though. There are no changes in 1.2 in comparison to 1.1 which would lead to such gains on an NVMe SSD.

from the vid:

"But the most interesting one is a performance improvement that comes by way of moving the copy after GPU decompression onto the compute queue. This will provide a fairly big performance boost to applications that support GPU decompression.

As can be seen from the video, the Gen5 SSD in particular benefits greatly from this change. In DirectStorage 1.1, it was being held back and its performance was virtually identical to the Gen4 drive. Now, the Gen5 SSD is able to flex its muscles a bit more. There are still more improvements coming to DirectStorage. We know that Direct Memory Access support is on the roadmap, according to Microsoft, which should increase performance even further."
 
from the vid:

"But the most interesting one is a performance improvement that comes by way of moving the copy after GPU decompression onto the compute queue. This will provide a fairly big performance boost to applications that support GPU decompression.

As can be seen from the video, the Gen5 SSD in particular benefits greatly from this change. In DirectStorage 1.1, it was being held back and its performance was virtually identical to the Gen4 drive. Now, the Gen5 SSD is able to flex its muscles a bit more. There are still more improvements coming to DirectStorage. We know that Direct Memory Access support is on the roadmap, according to Microsoft, which should increase performance even further."
This is only true for GPUs where such move would actually lead to a performance improvement - and I'm not sure why this would be the case on modern AMD/Nv GPUs with dedicated h/w DMA/copy queues.
 
It's explained in the vid. The copy is VRAM to VRAM and it runs faster on the compute queue than using the copy engines which are optimized for RAM to VRAM transfers.
"Optimized" is a strange way of saying "limited to". Because that's what the copy engines really are, they are throttled down to the speed necessary to saturate the external interfaces. Nothing beyond that. They also only support format conversions typical for that interface boundary.

That actually goes to pretty extreme extents on professional series Nvidia hardware where you got up to 3 copy engines, each tailored to saturate exactly one of the links (2x PCIe, 1x NVlink), and they are exactly matched so that you can reach full duplex transfer on the PCIe as well as (half) duplex transfer on the NVLink (the other partner making it full duplex), but that's all they are good for.

I think I mentioned it in the past - it's really weird how neither DirectX nor Vulkan so far exposed this difference between the different copy engines, nor did any driver enforce this. Because that's the odd part - full duplex operation e.g. only works at peak efficiency when all applications agree on scheduling transfers in a specific direction only exclusively onto the same instance. If you mix, you got stalls. And you actually don't get any form on overlapping execution of transfers on these engines either - you schedule badly and your bus drops down to effectively being merely half-duplex.

You'd had expected that to be also cleared out now that we've got such a strong focus on IO efficiency.

Well, not that it matters now with DMA on the horizon, which ultimately removes the copy engines from the equation entirely. Still sad how little love they got, for how friendly they were to your CPU.
 
"Optimized" is a strange way of saying "limited to". Because that's what the copy engines really are, they are throttled down to the speed necessary to saturate the external interfaces. Nothing beyond that. They also only support format conversions typical for that interface boundary.

That actually goes to pretty extreme extents on professional series Nvidia hardware where you got up to 3 copy engines, each tailored to saturate exactly one of the links (2x PCIe, 1x NVlink), and they are exactly matched so that you can reach full duplex transfer on the PCIe as well as (half) duplex transfer on the NVLink (the other partner making it full duplex), but that's all they are good for.

I think I mentioned it in the past - it's really weird how neither DirectX nor Vulkan so far exposed this difference between the different copy engines, nor did any driver enforce this. Because that's the odd part - full duplex operation e.g. only works at peak efficiency when all applications agree on scheduling transfers in a specific direction only exclusively onto the same instance. If you mix, you got stalls. And you actually don't get any form on overlapping execution of transfers on these engines either - you schedule badly and your bus drops down to effectively being merely half-duplex.

You'd had expected that to be also cleared out now that we've got such a strong focus on IO efficiency.

Well, not that it matters now with DMA on the horizon, which ultimately removes the copy engines from the equation entirely. Still sad how little love they got, for how friendly they were to your CPU.

With those kind of limitations I guess it's not surprising that there were such big speed ups to be had when moving to faster drives.
 
Nvidia have once again updated and improved their DirectStorage GPU decompression performance.


And as a reminder to everyone, that's 33GB/s on "only" a 3080Ti. So much more should be achievable on a faster GPU as long as we're not hitting a driver limit.

Also if a 3080Ti can hit this kind of performance then it shouldn't take huge resources to hit PS5 like levels which are only 1/3 of this at max, let alone a more normal in game streaming rate which is probably going going to be a third of that again or less most of the time.
 
And as a reminder to everyone, that's 33GB/s on "only" a 3080Ti. So much more should be achievable on a faster GPU as long as we're not hitting a driver limit.

Also if a 3080Ti can hit this kind of performance then it shouldn't take huge resources to hit PS5 like levels which are only 1/3 of this at max, let alone a more normal in game streaming rate which is probably going going to be a third of that again or less most of the time.
Yeah. If we take the bandwidth they're getting (33GB/s) from the SSD they're using (12.5GB/s) we can basically figure out the compression ratio which = ~2.64x

So max theoretical bandwidth currently with a 16GB/s PCIe Gen5 drive would be up to ~42GB/s.


Of course this ONLY refers to this specific benchmark.
 
Nvidia have once again updated and improved their DirectStorage GPU decompression performance.

This doesn't make much sense though. If the gains observed there are because of GPU decompression improvements in newer drivers then they should be present on Gen4 SSD too, no?
 
This doesn't make much sense though. If the gains observed there are because of GPU decompression improvements in newer drivers then they should be present on Gen4 SSD too, no?
Not if it doesn't have the raw bandwidth to show off the improvements.

The driver improvements could also be taking advantage of hardware that's specific to Gen5 SSDs as well.
 
Not if it doesn't have the bandwidth to take advantage of the improvements.

The driver improvements could also be taking advantage of hardware that's specific to Gen5 SSDs as well.
GPU decompression is a part of the pipeline. You do a compressed data read, send it to GPU which then decompress it.
If the latter part happens faster in newer drivers then the overall results should improve on all PCIE versions.
 
GPU decompression is a part of the pipeline. You do a compressed data read, send it to GPU which then decompress it.
If the latter part happens faster in newer drivers then the overall results should improve on all PCIE versions.
A GPU can only be sent so much per second on Gen4... The improvements may allow decompression bandwidths to go higher, but you're limited by the speed at which you can send the raw data over the PCIe.
 
Not sure if this has been discussed and theorized about yet but with respect to GPU Decompression has there been any talk about the implications of DRM and/or if encryption is involved? The issue here would be that actual implementation on the PC side would be problematic if for instance you could not actually bypass having the data in system memory and have the CPU perform the above.
 
Not sure if this has been discussed and theorized about yet but with respect to GPU Decompression has there been any talk about the implications of DRM and/or if encryption is involved? The issue here would be that actual implementation on the PC side would be problematic if for instance you could not actually bypass having the data in system memory and have the CPU perform the above.

Had never considered that. I’m not really sure how drm works but I wouldn’t expect it would get in the way of loading textures, models … but maybe it would? What exactly is drm normally doing to verify software is legitimate and licensed as you play?
 
Had never considered that. I’m not really sure how drm works but I wouldn’t expect it would get in the way of loading textures, models … but maybe it would? What exactly is drm normally doing to verify software is legitimate and licensed as you play?

My understanding with DRM now is that there's often layered DRM solutions, sometimes with an inhouse proprietary system that actually checks the license and then a commercial solution that actually protects that DRM. An related concern is the the DRM mechanism is actually an anti tamper one, particularly with respect to that games that implement MP.

The other concern is more if asset encryption is involved. Some form of basic asset encryption is relatively common to guard against asset ripping. Even though the solutions are often rather rudimentary and bypassable (not just for ripping but also people looking to mod) I'm wondering if that would complicated a GPU compression implementation if the CPU has to handle to the decryption.

A part this also just me wondering about the possible reasons behind a slower (and yet to happen) implementation or mentions of future implementations of GPU compression.
 
A compiled BulkLoadDemo from DS 1.2 with some results.

Ran it on my drives and here's what I've got:

6e9Screenshot2023071804.png
 
Last edited:
I'm a bit surprised by my old 960 Pro result - seems rather low in comparison to whan 980 Pro shows.
Wonder if it's a PCIE3 issue or a PCH connection one.

Could it be limited by the decompression process? It's certainly the case when CPU is involved.
I have tried it on a PCIe 5.0 SSD (Crucial T700) and it's only slightly faster than your 980 Pro at 0.30 seconds (~29GB/s).

[EDIT] Just realized that you are running them on the same computer? If that's the case then maybe the effects of PCH is worse than expected...
[EDIT2] I tried to run it on other nvme drives on the same computer, one is WD SN850X and another is WD SN850, both have very similar reading performance (both are PCIe 4.0).
My understanding is that one of the drive is connected to the CPU and another is connected to the X670 chipset.
It takes ~0.38 seconds (almost the same as your 980 Pro result) on both drives, so maybe X670's PCH is better than X570 or maybe it's really a PCIe 3.0 problem.
 
Last edited:
Back
Top