Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

It's the loaded file size (likely original uncompressed dataset size), 5.65 GB, divided by the decompression time, 0.8s and 2.36s respectively, giving a GB/s final data provided per second. At 100% CPU utilisation it can decompress 2.4 GBs. The GPU uses 15% CPU to decompress 7 GB/s. No mention of GPU utilisation or power draw though, which would be interesting comparison-points espeically in relation to the value of custom decompression hardware.

Edit: Is also worth noting the text describes it as a 'highly optimised sample', implying a best case comparison rather than general case. Oh, and there's no mention of the GPU used either! Very little comparative data and this is just a nice preview of the potential and indicator that, at some level at least, PC should scale okay with next-gen storage at the raw data level.

I thinking the bandwidth cost could be the bigger consumer of GPU resources rather than the decompression task itself as that'll like be done with a-sync.

3GB of compressed assets in VRAM couldn't fit in to GPU at once so will need to be decompressed in chunks which will mean larger write cost than read.

Read 1MB in to GPU cache > Decompress > 2MB output written to VRAM

Likely won't be a problem for the monster GPU's that have a tone of bandwidth but to the GPU's on smaller 192bit busses with lower bandwidth those few GB/s might be precious for actual rendering performance, especially if RT is on.

And not disclosing the specs I'm not surprised about, they've never been particularly open about Direct Storage.
 
Last edited:
I thinking the bandwidth cost could be the bigger consumer of GPU resources rather than the decompression task itself as that'll like be done with a-sync.

3GB of compressed assets in VRAM couldn't fit in to GPU at once so will need to be decompressed in chunks which will mean larger write cost than read.

Read 1MB in to GPU cache > Decompress > 2MB output written to VRAM

Likely won't be a problem for the monster GPU's that have a tone of bandwidth but to the GPU's on smaller 192bit busses with lower bandwidth those few GB/s might be precious for actual rendering performance, especially if RT is on.

And not disclosing the specs I'm not surprised about, they've never been particularly open about Direct Storage.
DS1.1 is likely targeting the mid-higher end spec right now, as in the future this will fall backwards into a low end spec.
I can't see a reason for a low end gpu to want to have this level of asset streaming into the GPU, it's just not likely and we wouldn't ask last gen consoles to do it either, you know what I mean?

MS recommends a DX12U card, so that really narrows a starting point here for us to determine what is likely the ideal baseline.
 
The real killer feature is DirectStorage 1.1 combined with Sampler Feedback Streaming, for streaming textures and assets at high speeds. That's much more interesting than just better loading times.

However, as Epic continues to ignore DX12 Ultimate features with UE5, the adoption rate of this technique in real games might be pretty low.
 
DS1.1 is likely targeting the mid-higher end spec right now, as in the future this will fall backwards into a low end spec.
I can't see a reason for a low end gpu to want to have this level of asset streaming into the GPU, it's just not likely and we wouldn't ask last gen consoles to do it either, you know what I mean?

MS recommends a DX12U card, so that really narrows a starting point here for us to determine what is likely the ideal baseline.

I know what you mean, however systems with say an RTX3050 are likely to benefit more from DS1.1 than a system with an RTX3080 or above as these higher end system generally have much better CPU's with more cores available for decompression.

The lower end systems will likely have CPU's that will really benefit from GPU decompression.
 
I think its a given DS is more ideal with RTX3060/Ti and above systems. RTX3050 probably wouldnt suffice for games that are reading 7gb/s from storage. Though i think DS should benefit all hardware as it reduces CPU load and optimizes file systems etc.
Its a much needed overhaul regarding IO in the pc space, it will be interesting to follow development, and hopefully my 2080Ti wont be missing out much.
 
Interesting point if lower spec'd GPUs would benefit from super fast storage or not. The initial assumption would be not, but the requirement for rendering power generally isn't tied to asset quality but lighting, shading and effects. Lower end GPUs need simpler assets because they have less VRAM, at least for textures. Polycount limitations would reduce VRAM/asset quality requirements but aren't we at a point where drawing geometry isn't much of a bottleneck? And then with idealised streaming assets, quality is tied to pixel resolution. So the reasons to think lower end needs assets less than the higher end might not really be there.

Edit: Nanite LOD would serve a lower data-rate stream when scaled to lower geometry specs, reducing bandwidth requirements, right?
 
The real killer feature is DirectStorage 1.1 combined with Sampler Feedback Streaming, for streaming textures and assets at high speeds. That's much more interesting than just better loading times.

However, as Epic continues to ignore DX12 Ultimate features with UE5, the adoption rate of this technique in real games might be pretty low.
They're both effectively working to do the same thing, just in different ways. DirectStorage isn't just about better load times without SFS. It will also function as an effective memory multiplier to push more detail in games. SFS is just another memory multiplier on top of that.

But yes, utilizing both should provide developers with an incredible level of overhead to do what they want.
 
The real killer feature is DirectStorage 1.1 combined with Sampler Feedback Streaming, for streaming textures and assets at high speeds. That's much more interesting than just better loading times.

However, as Epic continues to ignore DX12 Ultimate features with UE5, the adoption rate of this technique in real games might be pretty low.

You gotta stop calling ue5 out for "ignoring" features that aren't appropriate for their bottlenecks dude.
 
I don't know if it's relevant to DirectStorage but Gdeflate is a part of nvidia's nvCOMP library and they have some benchmarks for it here. I tried running the Silesia test on my 2080 Ti which gave me a decompression speed of ~40GB/s. I don't have their full test suite but I also tried it on some 16k textures and those decompressed in the 30-35GB/s range.

I'd still take these numbers with a grain of salt as they might not translate to DS. But I found it interesting.
 
I don't know if it's relevant to DirectStorage but Gdeflate is a part of nvidia's nvCOMP library and they have some benchmarks for it here. I tried running the Silesia test on my 2080 Ti which gave me a decompression speed of ~40GB/s. I don't have their full test suite but I also tried it on some 16k textures and those decompressed in the 30-35GB/s range.

I'd still take these numbers with a grain of salt as they might not translate to DS. But I found it interesting.
Addendum:

Running these same tests on Kraken (in software) on my 9900k gives me ~1.5GB/s for the Silesia test and ~1GB/s for the textures. The compressed sizes are smaller though.
 
Last edited:
You gotta stop calling ue5 out for "ignoring" features that aren't appropriate for their bottlenecks dude.
I agree.
They have their list of priorities and it probably is very low on it.
If individual companies or studios want to make use of a HW feature they can implement it and get it added or release as plugin or something.
Coalition didn't see SFS as high enough priority when they helped with the matrix demo.
MS needs to do it if they think there's a benefit especially on XS.
 
I don't know if it's relevant to DirectStorage but Gdeflate is a part of nvidia's nvCOMP library and they have some benchmarks for it here. I tried running the Silesia test on my 2080 Ti which gave me a decompression speed of ~40GB/s. I don't have their full test suite but I also tried it on some 16k textures and those decompressed in the 30-35GB/s range.

I'd still take these numbers with a grain of salt as they might not translate to DS. But I found it interesting.

This is an awesome find, nice one! Looking through the linked papers it also gives compression ratios for GDeflate on Texture and Geometry data sets coming in at ~1.5x and 2x respectively. So about the same or maybe a bit better than straight up Kraken, but less that advertised for BCPack or Kraken + Oodle Texture. However I see no reason why this couldn't be used in combination with Oodle texture or a similar RDO encoder for similar results.

They also give decompression throughput figures of around 50GB/s on an A100 down to about 30GB/s on an A30. So that seems in line with your findings.

Additionally they confirm that the decompression task is performed asynchronously.

So this further explains why Nvidia has previously described the operation as barely measurable (on game performance). If you consider that even a very high streaming throughput would be maybe 1GB/s, that's say 1/30th of the GPU's maximum throughput which when slotted in asynchronously should indeed be barely measurable. Also, given they are benching this on the A series Tensor core GPU's, and it's Nvidia developed, I wonder if it can be executed on the Tensor cores of RTX GPU's which would presumably further nullify any performance impact.
 
This is an awesome find, nice one! Looking through the linked papers it also gives compression ratios for GDeflate on Texture and Geometry data sets coming in at ~1.5x and 2x respectively. So about the same or maybe a bit better than straight up Kraken, but less that advertised for BCPack or Kraken + Oodle Texture. However I see no reason why this couldn't be used in combination with Oodle texture or a similar RDO encoder for similar results.

They also give decompression throughput figures of around 50GB/s on an A100 down to about 30GB/s on an A30. So that seems in line with your findings.

Additionally they confirm that the decompression task is performed asynchronously.

So this further explains why Nvidia has previously described the operation as barely measurable (on game performance). If you consider that even a very high streaming throughput would be maybe 1GB/s, that's say 1/30th of the GPU's maximum throughput which when slotted in asynchronously should indeed be barely measurable. Also, given they are benching this on the A series Tensor core GPU's, and it's Nvidia developed, I wonder if it can be executed on the Tensor cores of RTX GPU's which would presumably further nullify any performance impact.

You can't measure its compute cost but you'll easily measure its bandwidth cost.

Decompressing 50GB/s worth of data is likely going to consume over 50GB/s of GPU bandwidth.
 
The fact that no games are really even coming close to continually streaming in that amount of data.

Most heavy games spike to like 500MB/s for a split second and then drop..

No game... and that's PS5 included... is going to be continually streaming in 4-5GB/s.
Have we had any true next gen games built with NVNE in mind yet?

People need to stop believing that just because games haven't historically pushed a lot of data due to mechanical HDD limitations that they won't push that beyond that now.

Some games could very well end up streaming that much data, that's what makes this generation so exciting, we don't know.
 
Have we had any true next gen games built with NVNE in mind yet?

People need to stop believing that just because games haven't historically pushed a lot of data due to mechanical HDD limitations that they won't push that beyond that now.

Some games could very well end up streaming that much data, that's what makes this generation so exciting, we don't know.

The takeaway should be this... even if they DO... any modern RTX/RX gpu will EASILY handle it with minimal impact to performance. Doing that decompression on the CPU would be a complete other story... and that's the point.
 
We need to distinguish between sustained throughput and burst/latency. A single frame at 60 fps is only 16.67ms. To push say only 500 MB through in that time frame would actually be equivalent to almost 30 GB/s. If we are talking about a move to more reliance on real time streaming in of data how we look at the numbers need to change accordingly.

While I don't think we will actually see sustained throughput requirements in the multiple GB/s anytime soon (not even sure what type of game design would make use of that), burst rate performance could very well be meaningful in those ranges even if you were only wanting to push a few hundred MB of data in a short period.

Albeit given SSD speed limitations (relatively speaking) they will likely be the limiting factor as opposed to GPU processing speeds given the above numbers.
 
Back
Top