DirectStorage GPU Decompression, RTX IO, Smart Access Storage

Totally not related to DirectStorage, but I got a kick out of seeing an iPad, PS5, and NSW in the video. :p

[edit] So, a couple things I take from that. Much like Dx12 in general, a developer that isn't using this properly will actually worsen their game's storage I/O performance. Some storage I/O requests are still better serviced by the Win32 storage API. So, if a developer just naively dumps all of their storage I/O requests onto DirectStorage it's likely that I/O performance in their game will be significantly worse.

Also, GPU decompression isn't supported yet and is something they are still working on.

Regards,
SB
I would HOPE... that any developers building games that really require, or take advantage of DirectStorage.. would know how to properly manage their I/O throughput more effectively than not.

Most developers I can see doing the bare minimum.. which should still produce better results than the win32 API. Also, I think once the GPU decompression element comes online, that alone could help many developers improve rapid load times and reduce CPU utilization during streaming.

Seems like there's reasons why it needs to be this complex though... to be able to support future developments in the API and of course future hardware implementations.
 
It's been a really boring gen thus far, unfortunately.

Atleast we have tech demos again.

I would HOPE... that any developers building games that really require, or take advantage of DirectStorage.. would know how to properly manage their I/O throughput more effectively than not.

Most developers I can see doing the bare minimum.. which should still produce better results than the win32 API. Also, I think once the GPU decompression element comes online, that alone could help many developers improve rapid load times and reduce CPU utilization during streaming.

Seems like there's reasons why it needs to be this complex though... to be able to support future developments in the API and of course future hardware implementations.

Agree on this. If a developer doesnt know how to take advantage of DS, the game probably wasnt worth it technically to begin with.
 
SquareEnix put up a video detailing Luminous Studios' Forspoken technologies.

At 1:53 it shows multiple DirectStorage comparisons on PC using different drives. Loading in just:

1st example
1.9s on an NVMe SSD
4s on a SATA SSD
21.5s on an HDD

2nd example
1.7s NVMe SSD
3.2s SATA SSD
19.9s HDD


It's hard to say how those will compare to the same drives NOT utilizing DirectStorage... but I gotta say... those are some impressive load times regardless.
 
1.7 seconds load time does sound very impressive indeed. Have they any load times for the ps5 version?
Indeed. Not that I'm aware of... but honestly I can't imagine it being much more impressive than that lol. I think ~2 sec is the usual "optimized for PS5" load time so far.

DS vs no DS comparison would be more interesting TBH.
From this one it seems that NVMe is about twice faster as SATA SSD which to me suggests that the bottleneck is not in storage reads speed.

Yea, it's strange that they opted to not do a comparison NVMe DS on vs off.. but it's all we got for now.

This is without GPU based decompression too.. I wonder what kind of PC they were running for these results?
 
Ok The Verge actually has some answers to our questions

https://www.theverge.com/2022/3/23/22993860/forspoken-pc-microsoft-directstorage-nvme-ssd-gdc

You might be wondering if that’s substantially faster than games run without DirectStorage, and Ono admits the answer is actually no, not yet: while you’ll definitely see a huge speed boost from an SSD over the magnetic spinning platters of a hard drive, and from an NVMe SSD over a slower SATA-based drive, the current implementation of DirectStorage in Forspoken is only removing one of the big I/O bottlenecks — others exist on the CPU.

And if you’ve been hoping that DirectStorage will bring enhanced performance with regular spinning hard drives, the answer is not yet: “HDDs are not delivering the anticipated result due to hardware performance limitations,” says Ono.

forspoken_ssd_speed_5.jpg


forspoken_ssd_speed_6.jpg


forspoken_ssd_speed_4.jpg



But, says Ono, “I/O is no longer a bottleneck for loading times” — the data transfer speeds of DirectStorage are clearly faster for SSDs, and they could improve it in future if they figure out other CPU bottlenecks and take full advantage of GPU asset decompression.

forspoken_ssd_speed_2.jpg



So yea, as we figured.. the game is no longer I/O bottlenecked, but there are other CPU bottlenecks which need to be worked out before it can get much faster. Obviously GPU based decompression is one such incoming feature to help remove the new bottlenecks.
 
When I saw the performance bandwidth chart my first question is how SATA SSDs are able to go past 600MB/s, then I realized that the "bandwidth" are likely decompression bandwidth (based on decompressed data size). This makes sense because once they are able to do decompression on the GPU it's likely to be even faster.
 
So the slides for the Forspoken presentation by AMD and SquareEnix were posted..

https://gpuopen.com/gdc-presentations/2022/GDC_Breaking_Down_The_World_Of_Athia.pdf (11MB)


Some good info there. Decompression and Asset Initialization are the big bottlenecks still preventing things from being faster. GPU decompression should in theory help alleviate both, since the decompression will obviously be quicker, and the data remains compressed until it hits the GPU.

At the end of the day... this is a game with ~1-3sec loading times... It doesn't really matter how we got there.. the point is that we ARE there. 1 to 3 seconds is that for all intents and purposes "instant" loading. DirectStorage may not be much faster over the traditional Win32 API in this specific game.. but we still haven't seen CPU utilization comparisons yet. In game world streaming is where DirectStorage can come into its own in the future.

Given this information, I'm going to assume the PS5 is slightly faster still. Possibly not even having the loading screen flash for a second at all. I can't wait to see the comparisons when they come. :)
 
It should be, for now, although over 5gb/s is still kinda fast. GPU decompression will make things abit faster.
Well, the game isn't really utilizing more than 2.8GB/s from what they've shown.. anything more than that currently is bottlenecked by a different aspect of the CPU > GPU transfer. Yea, GPU decompression should help a bit in this case.. possibly because less data needs to be sent to the GPU over the PCIe bus in that case.

It will still be interesting. At 1-3 seconds.. it's in the realm of "I don't give a damn anymore" lol..
 
Well, the game isn't really utilizing more than 2.8GB/s from what they've shown.. anything more than that currently is bottlenecked by a different aspect of the CPU > GPU transfer. Yea, GPU decompression should help a bit in this case.. possibly because less data needs to be sent to the GPU over the PCIe bus in that case.

It will still be interesting. At 1-3 seconds.. it's in the realm of "I don't give a damn anymore" lol..

I ment the raw read speeds that wherearound 5gb/s with DS activated benchmark.
 
AMD revealed Smart Access Storage which is their name for DirectStorage support with GPU decompression and more, more detailts coming later.

should be added to thread title, maybe, along with whatever Intel makes up for theirs
 
Last edited:
And maybe point out what's at the core / on the kernel end in the first place?

https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations

It's ultimately just io_uring like batched submission with gather / scatter semantics. Because that's really all you need in order to efficiently assemble a buffer for GPU transmission.

Microsoft / Windows only? Nope. Linux users hat that for a while now. And for Windows users, it's also Windows 11 *only*. And - to be honest - still a long shot off from what the Linux precursor has matured into.

That whole "decompressing stuff on the GPU" ain't new either. Heck, people started using software implementations of JPEG codecs and stuff over a decade ago, with more than just decent success.
 
There were some further details on AMDs Smart Access Storage in this interview.
No separate API or compression algorithm but improves performance/latency of DirectStorage for AMD systems.

Interesting. They do make it sound like they're bypassing the CPU and System Memory completely (which Direct Storage alone still has to traverse) which suggests they're taking advantage of P2P DMA direct from SSD to GPU which AMD platforms are certainly capable of at a hardware level. The SSD certification they're doing may be to ensure the SSDs have the requisite DMA capabilities.

If true that does support the possibility that RTX-IO (if it actually exists) is doing something similar.
 
Back
Top