DirectStorage GPU Decompression, RTX IO, Smart Access Storage

DegustatoR · Nov 8, 2022

pjbliverpool said:
"Can be implemented cheaply in fixed-function hardware, using existing IP" - so hardware based decompression appears to be a design goal and thus a definite possibility moving forwards.

I've read this more like that you can use FF h/w for this, including already existing h/w if needed, but you're likely to have better results on a GPU anyway.

Remij · Nov 8, 2022

This is what I like to see (check the frametime graphs at the bottom)

On the CPU you can easily tell when the decompression is happening with that large cluster of spikes. On the GPU you can see two spikes, and those only happen because the scene is being reloaded, and then displayed after loading. In the middle, it's comparatively perfectly smooth as it decompresses. This test obviously shows a blank screen while it loads, but you can imagine that in streaming during gameplay scenarios... on the CPU you might see some hitches, while on the GPU it would remain perfectly smooth.

------------------------CPU---------------------------------------------------------GPU----------------------------

-

DegustatoR · Nov 8, 2022

Remij said:
on the CPU you might see some hitches, while on the GPU it would remain perfectly smooth.

This depends on the decompression implementation I suppose. CPU decompression can (ans should really in case of streaming) certainly be implemented without locking rendering thread(s).

DegustatoR · Nov 8, 2022

The bigger issue here is that GDeflate don't seem to run that well on CPUs at the moment which may be an issue when choosing on how to store the assets in the first place.

Remij · Nov 8, 2022

DegustatoR said:
This depends on the decompression implementation I suppose. CPU decompression can (ans should really in case of streaming) certainly be implemented without locking rendering thread(s).

Yea you're right, but there are some games out there which don't do the best job of this.

Now I think Forza Horizon 5 is a good example of what I'm saying. The game has loading screens, but once it loads the open world, you're basically in there. If you fast travel somewhere, you'll get another quick loading screen, but it's super fast. FH5 uses videos for their traditional loading screens.. and of course the stuttering that happens during these loads is inconsequential, because you're not playing anything. However... when you enter races, the game cuts to your vehicle in the showroom, in real-time, and does some panning around.. and this loading can cause stuttering.. like this:

(1h29m)

There's also games which load behind pre-rendered video, and often times those can have stuttering issues as well. I'm thinking GPU decompression helps solve a lot of those potential stutters.

pjbliverpool · Nov 8, 2022

Scott_Arm said:
Ryzen 7700X and 3080ti. The 3080ti can load the scene faster using about 90% gpu. The 7700X loads the scene about 0.3-0.4s slower using 100% of the CPU. That bench looks like it's designed to be pretty heavy. Really curious about benchmarks that take a streaming approach vs a full scene loading benchmark.

I'm struggling to follow that video tbh but I would assume that the GPU is simply being bottlenecked there by the SSD throughput whereas the CPU is clearly unable to keep up with the SSD throughput. I'm not seeing where you are getting the 90% GPU usage from though? I saw variation between 55-81% during what I assume was the actual decompression stage but it's not clear what was changing between runs to vary that usage.

Intels Demo for example shows a 2.7x speed up on an Arc A770 over a 12900K while reducing CPU usage from 100% to 0%.

Jay · Nov 8, 2022

DegustatoR said:
View attachment 7472

The bigger issue here is that GDeflate don't seem to run that well on CPUs at the moment which may be an issue when choosing on how to store the assets in the first place.

Seems to run well enough to be a reasonable solution for even a cpu.
Guess if you're going to be using this, your likely to have a reasonable minimum spec for your game.

Jay · Nov 8, 2022

Did wonder if one of the reasons we hadn't heard much about BCPack was due to waiting for a compatible PC way of packaging assets.
So even if this isn't directly BCPack compatible, by the sounds of it you may organise the assets the same, and maybe use gDeflate on PC and BCPack on XS.
Could see Forza:M being first to make big use of DS in full on PC & console.

DegustatoR · Nov 8, 2022

Jay said:
BCPack

Seem to be only BCx texture compressor though? XS h/w support both BCPack and Deflate - the latter should be compatible with GDeflate. BCPack seem to be DOA on PC for whatever reason - probably the need for a FF h/w decoder?

Jay · Nov 8, 2022

DegustatoR said:
Seem to be only BCx texture compressor though? XS h/w support both BCPack and Deflate - the latter should be compatible with GDeflate. BCPack seem to be DOA on PC for whatever reason - probably the need for a FF h/w decoder?

Thanks, couldn't remember if XS supported deflate or not.

I've given up until it happens of BCPack being on PC as everything around BCPack in general has been extremely quiet and NDA'd.
But the way you have to package /compress for gDeflate sounds like it would be similar for BCPack due to tile streaming.
So could use gDeflate on PC and BCPack on console, but the assets are structured the same way.

Until current gen only, probably not been big enough reasons to implement it.

DegustatoR · Nov 8, 2022

Jay said:
So could use gDeflate on PC and BCPack on console, but the assets are structured the same way.

Probably more complex than that as you can likely compress BCPack with Deflate even further, and also there's PS5 which doesn't support either (?) and just use Kraken instead.
I'd say that you'll need (at least) three different asset stores for major platforms anyway.

Scott_Arm · Nov 8, 2022

pjbliverpool said:
I'm struggling to follow that video tbh but I would assume that the GPU is simply being bottlenecked there by the SSD throughput whereas the CPU is clearly unable to keep up with the SSD throughput. I'm not seeing where you are getting the 90% GPU usage from though? I saw variation between 55-81% during what I assume was the actual decompression stage but it's not clear what was changing between runs to vary that usage.

Intels Demo for example shows a 2.7x speed up on an Arc A770 over a 12900K while reducing CPU usage from 100% to 0%.

Yah, looking at it again your numbers look to be more correct.

Jay · Nov 8, 2022

DegustatoR said:
Probably more complex than that as you can likely compress BCPack with Deflate even further, and also there's PS5 which doesn't support either (?) and just use Kraken instead.
I'd say that you'll need (at least) three different asset stores for major platforms anyway.

It's DirectStorage it literally sits under DirectX.
So nothing to do with PS.
MS is about getting people to use their GDK for cross platform development PC and Xbox. So I expect them to make it as easy as possible.
I wasn't too sure if gDeflate would work well at 64k tile granularity so was half expecting it to be at the mip level.

But in terms of wider development as in PS5, gDeflate will be open source with a reference implementation available.
But even if it's not used on PS5, I expect a similar streaming method to be available making use of its hardware.
Just use different compression format.

With the era of cross gen I don't think there's been much need in getting JIT streaming working on current gen.
Hoping that changes with current gen only & this release.

LordVulkan · Nov 9, 2022

Jay said:
Did wonder if one of the reasons we hadn't heard much about BCPack was due to waiting for a compatible PC way of packaging assets.
So even if this isn't directly BCPack compatible, by the sounds of it you may organise the assets the same, and maybe use gDeflate on PC and BCPack on XS.
Could see Forza:M being first to make big use of DS in full on PC & console.

GDflate/Deflate and BCPack have completely different purposes.

GDFlate/Deflate (like RAD's Oodle Kraken) are lossless data compression formats, can be used on everything and always apply to all the data you send to the GPU and can be implemented in HW blocks like PS5 and Xbox consoles have done.

BCPack and others (such as RAD's Oodle Texture) are texture compression encoding algorithms for S3TC formats (see https://en.wikipedia.org/wiki/S3_Texture_Compression), these formats are very widespread and are what the GPU ends up needing in VRAM for rendeing. They are not decompressed before being written to VRAM.

So, every common workflow should first compress the textures to an S3TC format using whatever texture compression algorithm (on PC you can use Basis Universal which Bionomial donated to Khronos for standardization a few years ago) and then package everything with the lossless data compression algorithm.

This has been the common workflow for years but the CPU was in charge of decoding each data stream, now we have dedicated HW blocks for that purpose on consoles and, on PC, now the GPU can do the job itself many times faster than CPU that allows to catch up with NVMe bandwidth capabilities.

Jay · Nov 9, 2022

LordVulkan said:
BCPack and others (such as RAD's Oodle Texture) are texture compression encoding algorithms for S3TC formats (see https://en.wikipedia.org/wiki/S3_Texture_Compression), these formats are very widespread and are what the GPU ends up needing in VRAM for rendeing. They are not decompressed before being written to VRAM.

Thanks for that.
Understand what you mean, but this bit was a bit unclear, so for any one else.
BCPack & kraken are decompressed resulting in S3TC which is also a compression format that is directly readable by the gpu so doesn't need to be decompressed also.
Hopefully not misleading here.

Jay · Nov 9, 2022

LordVulkan said:
GDflate/Deflate and BCPack have completely different purposes.

GDFlate/Deflate (like RAD's Oodle Kraken) are lossless data compression formats, can be used on everything and always apply to all the data you send to the GPU and can be implemented in HW blocks like PS5 and Xbox consoles have done.

So does that mean there all compatible with eachother.
I.e use gDeflate inside BCPack, kraken inside BCPack also?
Reason I ask is that if it was just at the mip level that easily makes sense, but these are dealing with tiles inside of the mips? Guess they all have that ability.

Ethatron · Nov 9, 2022

Jay said:
So does that mean there all compatible with eachother.
I.e use gDeflate inside BCPack, kraken inside BCPack also?
Reason I ask is that if it was just at the mip level that easily makes sense, but these are dealing with tiles inside of the mips? Guess they all have that ability.

It's the only thing you can do, massage the hardware format's bitstream to make it more compressable by general purpose coders. You can not compete with the hardware sampler's speed of decoding, and drop that representation entirely. In the most extreme case, Binomial's, they invent internal representations which are entirely different but fully transcodable to hardware formats. Not only to make them more compressible, but also to make them scalable for quality.
Maybe one day there can be i/o shaders, that means programmable samplers/input blocks, and programmable rops/output blocks. Then you can truly approach rate-distortion limits (with "rate" also be computation cost, or overall latency "cost", not just size).

Ext3h · Nov 13, 2022

DegustatoR said:
I've read this more like that you can use FF h/w for this, including already existing h/w if needed, but you're likely to have better results on a GPU anyway.

"Existing IP" refers to ready-to-license building blocks. Unfortunately not to already existing or even integrated FF H/W.

pjbliverpool said:
It goes via system RAM. But there is basically no CPU impact of that.

Actually that's only the fallback solution.

Unless I'm entirely mistaken, then AMD has already demonstrated that they can stream directly to BAR from NVMe, for a specific combination of own chipset, CPU generation, GPU and qualified NVMe devices which work with their own (not the MS one) NVMe driver, but still with standard NTFS.

It's not as tricky as people tend to make it for a vendor which has control over all the involved components. Just requires clean tracking of resources over multiple driver stacks. The file system overhead some are so afraid of are not that bad when you can just build lookup tables in RAM as a one-time cost per application. Even alignment issues and alike don't make much of a difference for typical asset sizes - just something the GPU driver needs to mask.

Not expecting it from Intel any time soon though, their driver stack looks ... fractured. And NVidia will likely struggle until this is eventually properly standardized.

Going to system RAM does have a huge impact on the CPU after all. With PCIe 4.0 4x on the storage side, that's still 16GB/s of memory bandwidth (half-duplex!) burnt. Accounting for some inefficiencies, that's almost one DDR4 memory channels bandwidth worth lost. One of these nasty details you won't see in a synthetic benchmark (due to lack of easily accessible load statistics!), but which will bite you later on.

Jay said:
I.e use gDeflate inside BCPack, kraken inside BCPack also?

BC itself doesn't exactly have a place in the storage formats any more. It doesn't produce a bitstream suitable for further packing.
You would be using any of the image formats suitable for (respectively their compression ends in) deflate compression (there are many!). Requires some additional post-processing on the GPU side, in order to reverse the additonal transformations which were enabling the image to achieve an actually decent compression rate in the first place.

What we are seeing showcased right now - GDeflate being applied directly to an RGB bitmap or vertex buffers - represents still a very early stage of development. Expect compression rates to get much better as people figure that out.

Chaining a BC encoder on the GPU, after the decompression, would be more viable to get the VRAM impact back down to where you originally had it when using BC family compression throughout the entire pipeline.

DegustatoR said:
CPU decompression can (ans should really in case of streaming) certainly be implemented without locking rendering thread(s).

Without locking - yes. But don't forget about the size of the working set for the read-only lookup tables. Easy to run parallel decompression (from the same parameter set), but likely to trash L2/L3 for other workloads.

There is actually one elephant in the room:
What to do with all the target systems not getting an update with official DirectStorage support? Windows 10 is still going to stay for a long, long time.

The way it looks there are some constraints you simply can't work around without it, specifically the reduction of GPU uploads from 4 memory transfers down to 2 or 0 (AMD only).

But you are also facing the need to maintain a common format for your assets. With the current tendency that MS will recommend you to use the IHVs proprietary implementation of GDeflate - that's looking like something you can't even consider targeting for a very long time. The "software fallback" - for now - isn't provided for a significant portion of your target audience either.

So you will be reliant on shipping your own kernels for GDeflate support (and whatever future extensions you will need) as a fallback solution for the foreseeable future.

Using CPU decompression for the fallback path would be a huge mistake, after all at that point you have already dropped at least the previously used texture compressions schemes, now causing a vastly higher load (CPU and caches for decompression, memory and PCIe bandwidth for the now doubled data rates).

DegustatoR · Nov 13, 2022

Ext3h said:
What to do with all the target systems not getting an update with official DirectStorage support? Windows 10 is still going to stay for a long, long time.

Win10 has DirectStorage support. It doesn't have bypass I/O but I don't think that this will mean much.

Ext3h said:
With the current tendency that MS will recommend you to use the IHVs proprietary implementation of GDeflate - that's looking like something you can't even consider targeting for a very long time. The "software fallback" - for now - isn't provided for a significant portion of your target audience either.

Not sure what you mean here. GDeflate is a common storage format which is then can be decoded by either IHVs decoders or DX s/w fallback - which I presume will be a GPU compute solution too just not that optimized as a custom IHV provided one may be. CPU decompression is likely a secondary fallback solution here - which you could argue isn't even necessary as a h/w which won't be able to run GPU compute decompressor (SM 6.0) is very unlikely to be able to run a game made for DS 1.1 API.

Ext3h · Nov 13, 2022

DegustatoR said:
Win10 has DirectStorage support.

It does? Completely missed that. So as long as your GPU is getting driver updates (at all) you should be good? Or is this tied to a specific Windows 10 minor release?

DirectStorage GPU Decompression, RTX IO, Smart Access Storage

DegustatoR

Remij

DegustatoR

DegustatoR

Remij

pjbliverpool

B3D Scallywag

Jay

Jay

DegustatoR

Jay

DegustatoR

Scott_Arm

Jay

LordVulkan

Jay

Jay

Ethatron

Ext3h

DegustatoR

Ext3h

Similar threads