DirectStorage GPU Decompression, RTX IO, Smart Access Storage

@DegustatoR, How much of that performance difference do you think is due to the drives and how much is due to what supplies the pci-e lanes?
need to check what supplies the m2 slots with pci-e lanes...

 
@DegustatoR, How much of that performance difference do you think is due to the drives and how much is due to what supplies the pci-e lanes?
need to check what supplies the m2 slots with pci-e lanes...

Dunno. From my general observation chipset attached drive seem to work about as fine as it did back when if was attached to the CPU. Haven't seen any issues with running anything off it either.
 
BypassIO is just one part of DS. In fact 3DMark says that it's disabled for me on all drives, NVMe including. Probably because of my AV.
Is a rather essential part, no BypassIO...no Direct Storage:
1733420318396.png
1733420437480.png

Hence why I was taken by surprise when people talk like SSD/HDD's are "supported"....as Microsoft/Windows 11 is telling me that my SDD's are not supported by DirectStorage:
1733420811181.png
 
Is a rather essential part, no BypassIO...no Direct Storage
Not really, as you can see from the results on the previous page.

o5m6tvzi.png


I honestly have no idea what the hell that means. None of my drives have BitLocker on them.

Edit: Or maybe they do? Damn, 24H2 clean install seem to have enabled it.
 
Last edited:
Not really, as you can see from the results on the previous page.

o5m6tvzi.png


I honestly have no idea what the hell that means. None of my drives have BitLocker on them.
You results and this:
1733421946607.png

Should tell you that is does nothing for SSD's.
All the "benefits" are after System Memory.

I think we are mis-communicating.
I am taking about the "red box" I drew in that picture.

MVME's gain speed there, SSD's do not...nor HDD's.
 
Should tell you that is does nothing for SSD's.
It does quite a bit for "SSDs" (aka SATA SSDs) as my results are showing.
As I've said DS isn't just "BybassIO" and the latter isn't really even that important.

On a side note I think it hella fun that "BypassIO" is not compatible with BitLocker...
 
Last edited:
I just marked it with red to show that the SSD's do not support DirectStorage....only NVME's do that :LOL:
DirectStorage can be used with SATA SSDs (which you continue calling "SSDs" for some reason) and even HDDs as of v1.2.
BypassIO is not the only feature of DS and it is not required for the DS API to operate. It provides some CPU overhead wins if present but that's all.
 
It does quite a bit for "SSDs" (aka SATA SSDs) as my results are showing.
As I've said DS isn't just "BybassIO" and the latter isn't really even that important.

On a side note I think it hella fun that "BypassIO" is not compatible with BitLocker...
Not, it does something to the system AFTER the SSD, you own data shows this:
980 NVME(to the CPU), increase in bandwith:
1733429798150.png

960 MVME (Chipset attached), increase in bandwith:
1733429875222.png
(You should verify your motherboard bifurication, some NVME slots share bandwith with other units, such as SSD's/SATA etc. and thus never get full bandwith)

Now the real pudding:
SATA SSD (non-NVME)
1733430013118.png

NO bandwidth increase
All the inscrease is AFTER the data has entered the RAM and thus the drive is no longer a factor.
SSD/HDD's are not "supported", the system might do something after the data have left the drives, but the drives themself get nothing from DirectStorage, unlike NVME's

So I think we are have a miscommunication...but SDD's/HDD's get nothing for DirectStorage, if you look at your Windows GameBar, I bet it will tell you the drives are not supported.

You own data show this, quite clearly ;)
 
Not, it does something to the system AFTER the SSD, you own data shows this
So? That's still the result of using DirectStorage.

NO bandwidth increase
BypassIO doesn't grant any b/w increase. It lowers CPU overhead on data reads which means that more data can be read from the storage at a similar CPU usage. If your CPU is fast enough to handle reads at storage peak speed w/o BypassIO then BypassIO won't net you anything. Which is likely the main reason why it's not supported on slow SATA devices.

You should verify your motherboard bifurication
I shouldn't do anything of the sorts. 960 Pro is a NVMe3 device and it shows roughly half speed of a CPU attached NVMe4 980 Pro here.
I will disable BitLocker though (don't need it on the desktop, thanks MS) and check if that will improve the results.
 
So? That's still the result of using DirectStorage.
Not on the SSD is not ;)

BypassIO doesn't grant any b/w increase. It lowers CPU overhead on data reads which means that more data can be read from the storage at a similar CPU usage. If your CPU is fast enough to handle reads at storage peak speed w/o BypassIO then BypassIO won't net you anything. Which is likely the main reason why it's not supported on slow SATA devices.
You forget...NVME's are not using the SATA protocal.
All SATA drives (SSD or HDD) test the same in "Avacado":
1733431719432.pngS
SATA being the limiting factor there.

Again, Microsoft Game bar will tell you the exact same thing.
Your SATA SSD is not supported 🤷‍♂️
 
Not on the SSD is not ;)
Yes, on the "SSD" it is also the result of using DirectStorage API.
Dunno why keep arguing about this.

SATA being the limiting factor there.
Exactly, so using BypassIO on them is likely pointless as you'd need to go back to Pentium 3 to see any performance gains.
Still doesn't mean that you can't use DS on SATA devices and get a boost from decompression.
 
All SATA drives (SSD or HDD) test the same in "Avacado":
There's an interesting reason for that. Most/all HDDs have DRAM caches in the 64MB-256MB range and the reason that HDDs/SSDs perform the same in this test is that Avocado.marc is a only small 2MB file.

But you can try this with a larger 700MB version of this test here which includes a mix of Gdeflate, Zlib, and uncompressed assets. Performance is unaffected on SATA SSDs but HDDs completely crumble under this larger load because it no longer fits in cache.

Regarding BypassIO, remember that it's unsupported on Win10 yet DirectStorage runs fine without it. It's an optional feature and not a required one.
 
Which is likely the main reason why it's not supported on slow SATA devices.
The driver model for SATA and NVMe behaves differently. SATA requires serialization of all communication with the drive onto a single thread. That adds synchronization overhead as a baseline that you can't eliminate in the first place. In the opposite direction, it also requires scatter from a small staging buffer visible to the controller to the user space mapped memory.

There is hardly a point in eliminating the entire rest of the storage stack, if you still can't get rid of the most expensive implementation detail that's ruining your IOPS and storage latency in any threaded scenario.

Even without BypassIO, NVMe simply scales a lot better in threaded benchmarks. It's as simple as "each logical CPU core gets a dedicated NVMe protocol queue for talking with the drive", and the NVMe device speaks linear memory with no staging buffer.



On an unrelated side note, that benchmark is doing really weird stuff:
  • The "RAM to VRAM" with "DirectStorage off" test is done using a shader - not using the copy engine which would had been optimized for steady ideal utilization of the PCIe bus.
    • A shader running with only a single thread group. Which ends up being stalled 90% and not properly establishing a sufficiently deep queue for host memory reads, and also fails to even wake up the GPU from idle clocks...
    • Doing a host<->device transfer on the 3D engine is not a smart choice either... With "DirectStorage on", it's doing the copy on the copy engine instead, as it should have done from the start. And surprise, transfer performance doesn't come crashing down on an idle GPU...
    • No comparison whatsoever is done with a host-initiated copy to host-visible VRAM, which tends to perform much better - usually at full PCIe link speed...
  • The "Storage to VRAM" test appears to be so poorly optimized, it's not even remotely hitting a 100% active time of the NVMe drive (more like 50-60%?), indicating a lack of queue depth and/or threading.
  • The "Storage to VRAM" performance number displayed above "DirectStorage on" is off. It's pre-multiplied by the GDeflate compression ratio?...
    • For a strongly parallel / async resource loader, real world performance would approach min("Storage to RAM", "RAM to VRAM", "GDeflate Input rate").
    • For a single threaded, sequential loader it's 1(1/"Storage to RAM" + 1/"RAM to VRAM" + 1/"GDeflate Input rate").
    • By mixing up input and output rates, it's an apples & pears comparison.

On an extra unrelated side note - it appears that with DirectStorage, Microsoft has taken control over the association of copy engines to transfer directions, at last. That used to be one of the worst pitfalls in a transparent, SHARED multi-engine concept, where even though from the device side every one was SUPPOSED TO use one of the engines for upstream and the other one for downstream, that detail was completely left open in the API design. Still fucked up though, in the sense that it was still left as an opaque implementation detail. Looks like MS simply went for "Copy engine 0 for downloads from device to host, Copy engine 1 for uploads from host to device". Mirroring the association that Nvndia had established for their (also opaque) CUDA implementation. (Copy engine 2 - if present - is for device2device transfers.)
 
Last edited:
Back
Top