Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Yes, advances like SFS act like memory multipliers. So the 2.7x memory increase may now be closer to 6.75x with SFS and the low latency IO so less needs to be buffered.
...
Yeah, that still something I want to see in action. It already started with xb1 and tiled resources, but as far as I know it never benefitted of that, just because a HDD was inside and so everything is loaded in big packets that include much more data than needed.
I guess last gen has to be dead for a while before we see engines build with that in mind (at least on Xbox). But I still think it won't change graphic quality to much. We already have quite good texture quality more or less only limited by space and design time. It might really get used resources down significantly but than again I doubt that current GPUs have enough power to really use that "more" memory in a significant way. It can help to reduce high res texture pop in but I don't see that many more changes this could bring.
More memory for bvh structures ... But someone must calculate that stuff ...
Maybe more memory available for better TAA solutions...
In the end, it is just one bottleneck less.
 
From my understanding of this under the 'Kraken creates optimized streams for the decoder' section they could do this with PS5?

Yes it does seem that within Kraken itself they can apply various options at compression time to vary compression ratio with the trade off being decompress speed. That part of the article though does seem to specify that the use case for that capability on the PS5 is specifically to ensure the decoder is always able to decode at the drives maximum throughput of 5.5GB/s, i.e. they reduce the compression factor if the decompression unit would be unable to decompress it at that speed. What you're suggesting here then is essentially bottlenecking the drive by clogging up the decoder with a data stream that is compressed so heavily that the decoder can't keep up with it.

It's not clear whether that is even possible from a system level but furthermore, this capability, if it were possible is nothing to do with the PS5's SSD drive speed. Rather it's associated with it's decompression unit speed. You're assuming that the PSS's decompression unit is capable of decompressing a more compressed (higher workload) data stream at 2.5GB/s than the Xbox Series decompression unit is. That may or may not be the case - it's very difficult to compare given they deal with different compression formats, but it's nothing to do with the drive speed.

But a few points

1. Games have to use DS in the first place and on PC this can take a long time for it to be standard in games and game engines
2. GPU decompression at 14GB/s will still be lower than PS5's maximum
3. PS5's I/O is proven in the real world with a handful of games already loading in sub 2 seconds, DS is not proven in the real world
4. What if Sony release a PS5 Pro with even faster speeds? PC will be behind again.
5. Nvidia's numbers for RTX I/O make no sense.

On 2, I think you've misunderstood those numbers. The average compression ratio of the PS5 using Oodle texture is 2:1. Some data sets will be more compressed and some will be less, but on average it's 2:1. I believe the compression unit itself can handle up to 4:1 or 22GB/s for those data sets which are highly compressible, but those are peaks, not average.

In the same way Nvidia has stated that their GPU's can easily keep up with a 7GB/s input stream for an average of 14GB/s output. So we can safely assume that the GPU's are also capable of much higher decompression rates to deal with the highly compressible data or else they would act as a bottleneck on a 7GB/s drive.

In fact it's probably safe to assume modern GPU's can go well beyond that and won't bottleneck the incoming generation of PCI-e 5.0 drives either for an average throughput of >20GB/s (with peaks at >40GB/s)

I'm curious what it is about Nvidia's RTX-IO numbers that you think don't make sense?
 
In the same way Nvidia has stated that their GPU's can easily keep up with a 7GB/s input stream for an average of 14GB/s output. So we can safely assume that the GPU's are also capable of much higher decompression rates to deal with the highly compressible data or else they would act as a bottleneck on a 7GB/s drive.

In fact it's probably safe to assume modern GPU's can go well beyond that and won't bottleneck the incoming generation of PCI-e 5.0 drives either for an average throughput of >20GB/s (with peaks at >40GB/s)

I'm curious what it is about Nvidia's RTX-IO numbers that you think don't make sense?

I would think when using as much GPU as you can to decompress as fast as you can you may possibly move the bottleneck back to the CPU?

Nvidia state that RTX I/0 is 14GB/s using compression, that is fine and I have no issues with that.

My issue is the amount of CPU cores they claim that it's equivalent to which they state is 24, that does not really add up with other numbers on the chart nor with numbers from Microsoft, Sony and now with Forsaken.

If Forsaken is pulling 5GB/s using ~5 cores (40%) of a Ryzen 5900, that would mean ~15 cores for 15GB/s (In a perfect world)

So how are Nvidia getting 24 CPU cores for 14GB/s? Are they using 24 Sandy Bridge cores?

Their numbers don't make any sense to me and they seem to have used different CPU cores for each result rather than the same.

So for me the numbers presented by them are nothing but PR and marketing fluff and if that's the case is the 14GB/s number also marketing fluff?

Or maybe I'm just being dumb, who knows.
 
Last edited:
All the vendors are throwing around different numbers (Sony, MS, Nvidia) but I think they're all talking about broadly similar things.

AFAI can tell: Everyone can use the likes of Oodle pre-compression tools, e.g. lossy RDO for up to "20 - 50%" smaller post-compressed BCN textures (with varying levels of degradation), and actual compression rates seem pretty broadly similar for common lossless compression formats.

Direct Storage is set to be able to use hardware supported BCPack, DEFLATE, or nothing (or a combination), along with some as yet unimplemented swizzle operations that I don't understand but that can apparently help. I don't know if there's going to be a DS software fallback for if there is no hardware implementation though ...?

So it appears perfectly possible to use DS with no specified type of decompression, and then do all your decompression on the CPU (and just eat CPU cycles).

I found this forum post, which I think is by Fabian “ryg” Giesen of Rad tools. Guy seems like a pro and seems to know about PS5 and BCPack, but I don't know for sure. I mean, it's the internet innit.


BC = BC1,2,3,4,5,6H,7 = suite of lossy texture formats used by PC GPUs. All these need to support efficient random access and work by encoding 4x4 blocks of pixels to a fixed-size format (either 64 or 128 bits per block). They are decoded as part of the texture fetch. It is safe to assume that at any point in time, the vast majority of texture data in GPU memory is in one of these formats.

BC1-7 encoding is lossy. BCPack is a lossless coder on top of the lossy BCn data. As such, comparing to Kraken is fair.

Both are run on reading data from disk into memory, thus they don't have the random-access requirement that textures in memory do, and indeed better entropy coding is a likely thing to try. I can't comment on what exactly BCPack does or how its performance compares to Kraken, either "full fat" or the PS5 subset; this is covered by NDAs with both MS and Sony.

For BC1-5 there are several well-known, easy lossless transforms that tend to significantly increase compression ratio (5-15% reduction is common, depending on the BCn format and the data) with your usual LZ. This is really simple stuff. For example, a 64-bit BC1 block is 32 bits of color endpoints then 32 bits of 2-bit indices for every pixel in the 4x4 block. Reordering data to separate endpoints from indices, putting them into separate blocks (and so they get separate Huffman tables in say a Deflate stream), helps massively for coders that don't use the low-order bits of the position in the stream as context. This is a lot better than straight Deflate/Kraken on this type of data and both the Xbox Series S/X and the PS5 support it as part of their output write from the decompressor to memory. (Not quite free, but very nearly so.)

BCPack is more sophisticated than that, PS5 decided that the basic reordering plus Kraken was good enough.

BC6H and BC7 have many modes and a more irregular block layout and such trivial transforms don't work. Oodle Texture has "BC7Prep" which is a lossless transform that can be run on BC7 blocks to make them more amenable to compression by byte-aligned LZs, and is easy (and very fast, often >180GB/s) to undo on the GPU. It mostly boils down to making the transform aware of the different modes. BC6H we don't have anything in particular yet because it was <1% of the texture data sets for all games we looked at so there were better things to spend our time on.

In short, XBox Series S/X have regular Deflate and BCPack which is a lossless coder for _only_ BC1-7 data (itself lossy), PS5 has Kraken, both support certain simple on-the-fly transforms on BC1-5 blocks as part of the decompression process, and both can use Oodle Texture BC7Prep via GPU compute shader for BC7 where the simple transforms don't work.

BCPack is aware of BC1-7 data, Kraken is not, and both can potentially get much better results when the BC1-7 encoder feeds them the kind of redundancy they know to exploit.
 
I would think when using as much GPU as you can to decompress as fast as you can you may possibly move the bottleneck back to the CPU?

I don't follow? Why would moving decompression work to the GPU increase the bottleneck on the CPU? Or did you mean to say GPU? It's worth remembering that when we're talking about these crazy 14GB/s + numbers we're talking about loading screens because you're never going to have an scenario where you're streaming that level of data mid game (no game is even close to large enough to justify that). So we don't have to worry about that level of decompression impacting on framerates. In terms of decompression while streaming which is likely to be in the 100's of MB/s range Nvidia have stated that the GPU impact is barely measurable.

My issue is the amount of CPU cores they claim that it's equivalent to which they state is 24, that does not really add up with other numbers on the chart nor with numbers from Microsoft, Sony and now with Forsaken.

If Forsaken is pulling 5GB/s using ~5 cores (40%) of a Ryzen 5900, that would mean ~15 cores for 15GB/s (In a perfect world)

So how are Nvidia getting 24 CPU cores for 14GB/s? Are they using 24 Sandy Bridge cores?

I believe they were using a Zen 2 based Threadripper which is pretty low clocked so it would presumably be fewer cores on a faster system. That said though I don't think you can compare these core numbers to what Microsoft, Sony or the Foresaken devs have put out there because GPU based decompression is something relatively new. It'll be using a different algorithm that is optimised for GPU's and may well be very inefficient on CPU's. That's as opposed to the CPU based decompression schemes being used in the other examples, which in the case of the PS5 and XBS is accelerated through dedicated hardware.

Their numbers don't make any sense to me and they seem to have used different CPU cores for each result rather than the same.

So for me the numbers presented by them are nothing but PR and marketing fluff and if that's the case is the 14GB/s number also marketing fluff?

Or maybe I'm just being dumb, who knows.

I honestly wouldn't be surprised if there is a marketing element in those CPU numbers. For example there is a 24 core Threadripper so perhaps they're just saying they needed that as opposed to an 8 or 16 core CPU?

In any case though the 14GB/s number is different. They've stated directly in interviews that the GPU's are more than capable of keeping up with that level of throughput and why wouldn't they be? We're talking multi TF GPU's here so there's no reason they shouldn't be able to keep up if a handful of CPU cores can do it - provided the algorithm is well architectured to take advantage of GPU parallelism that is. Just look at the quote above about the BC7Prep GPU transform operating at over 180GB/s. Its not the same as full decompression but if gives a flavour of what GPU's can be capable of in this regard.
 
I would think when using as much GPU as you can to decompress as fast as you can you may possibly move the bottleneck back to the CPU?

Nvidia state that RTX I/0 is 14GB/s using compression, that is fine and I have no issues with that.

My issue is the amount of CPU cores they claim that it's equivalent to which they state is 24, that does not really add up with other numbers on the chart nor with numbers from Microsoft, Sony and now with Forsaken.

If Forsaken is pulling 5GB/s using ~5 cores (40%) of a Ryzen 5900, that would mean ~15 cores for 15GB/s (In a perfect world)

So how are Nvidia getting 24 CPU cores for 14GB/s? Are they using 24 Sandy Bridge cores?

Their numbers don't make any sense to me and they seem to have used different CPU cores for each result rather than the same.

So for me the numbers presented by them are nothing but PR and marketing fluff and if that's the case is the 14GB/s number also marketing fluff?

Or maybe I'm just being dumb, who knows.

Or they may have used a GPU decompression scheme that is highly parallelizable so more cores are favored for max throughput even on a CPU. I remember these images appearing in other older Nvidia literature. I am not absolutely sure but it may have come from NVcomp marketing which is a solution for data centers. I am guessing Nvidia took the data and applied it to RTX IO without considering that the data wasn't totally applicable to a gaming setup where CPU decompression uses a more appropriate scheme on those types of cores.
 
I just done Googled, and there seems to be some new stuff up on DStorage on the MS site. Haven't had time to go through it yet but there might be some stuff pertinent to this topic (page is dated 3 days ago):


Direct Storage is set to be able to use hardware supported BCPack, DEFLATE, or nothing (or a combination), along with some as yet unimplemented swizzle operations that I don't understand but that can apparently help. I don't know if there's going to be a DS software fallback for if there is no hardware implementation though ...?

I think that's only for the Xbox version of DS though. I don't think we've had any confirmation of what compression schemes the PC GPU based decompression will use. It's possible (likely?) though that they will be something entirely different that's designed specifically for GPUs.

I did pick up some really interesting bits from that article that you linked though which are specific to the Xbox. Here's a couple:

DirectStorage provides a queue type to invoke the decompression hardware with the decompression source being memory instead of a disk file. This allows the decompression hardware to be utilized if the compressed asset is not sourced from a file, or was previously sourced and kept into memory as a cache.

So you can pre-cache compressed data in RAM where it can then be decompressed as needed. This is pretty cool and I hope the PC gets a similar capability which could be used to effectively increase RAM capacity for pre-caching. For example in previous discussions around Ratchet & Clank I've always taken the install size on PS5 and doubled it for to account for decompression to give the amount of data you would need to pre-cache in RAM/VRAM on PC (33GB->66GB). If the data can remain compressed while in cache then once could argue a strong gaming PC with 32GB system memory and 16GB graphics memory could potentially have the entire game sitting in RAM/VRAM in a compressed form waiting to be decompressed on demand, almost completely mitigating the issue of potentially slower SSD's being used.

DirectStorage removes most of the overhead caused by the operating system. This allows a minimum guarantee closer to the hardware limits. The new minimum performance guarantee is 2.0 GB/s over a 250 ms window for raw data. The use of decompression on the content will push the final bandwidth higher.

So now we have a solid figure on this and the earlier "sustained 2.5GB/s" throughput that Microsoft talked about was seemingly more marketing speak.
 
I think that's only for the Xbox version of DS though. I don't think we've had any confirmation of what compression schemes the PC GPU based decompression will use. It's possible (likely?) though that they will be something entirely different that's designed specifically for GPUs.

I did pick up some really interesting bits from that article that you linked though which are specific to the Xbox. Here's a couple:



So you can pre-cache compressed data in RAM where it can then be decompressed as needed. This is pretty cool and I hope the PC gets a similar capability which could be used to effectively increase RAM capacity for pre-caching. For example in previous discussions around Ratchet & Clank I've always taken the install size on PS5 and doubled it for to account for decompression to give the amount of data you would need to pre-cache in RAM/VRAM on PC (33GB->66GB). If the data can remain compressed while in cache then once could argue a strong gaming PC with 32GB system memory and 16GB graphics memory could potentially have the entire game sitting in RAM/VRAM in a compressed form waiting to be decompressed on demand, almost completely mitigating the issue of potentially slower SSD's being used.



So now we have a solid figure on this and the earlier "sustained 2.5GB/s" throughput that Microsoft talked about was seemingly more marketing speak.

Yeah I had about 25 tabs open, think I got a couple of wires crossed. Cheers for pointing that out. I got confused by a structure that defined DS access, that included BCPack / DEFLATE compression settings, not being being an NDA. But that does appear to be console only.

So yeah, software decompression only (for the time being anyway). Compression ratios and stuff should carry across pretty similarly though, depending on how much grunt you want to expend on decompression.

MS did have this on the May PC announcement too:

"GPU decompression is next on our roadmap, a feature that will give developers more control over resources and how hardware is leveraged." So hopefully there is going to be some kind of DirectX standardised compression scheme so we don't end up a bit fragmented and with proprietory implementations like with DLSS (which is great, but I wish it was implement once, use anywhere).

The 2.0 GB/s figure is for the games, meaning that OS activities like saving 4K video, downloading etc won't reduce that over the 250ms window. PS5 will have some similar minimum threshold guaranteed too, I'd guess. As they're consoles they can probably make guarantees about SSD bandwidth more easily than a full on PC can - I'd probably try to spec an SSD a bit faster in the hope of getting similar results.

Edit:

Okay, so it looks like Direct Storage allows you to create custom decompression queues, presumably to allow you to run whatever kind of custom decompression routine suits you best. The example here is about using lots of threads to get the job done quickly.

 
Last edited:
Yeah I had about 25 tabs open, think I got a couple of wires crossed. Cheers for pointing that out. I got confused by a structure that defined DS access, that included BCPack / DEFLATE compression settings, not being being an NDA. But that does appear to be console only.

So yeah, software decompression only (for the time being anyway). Compression ratios and stuff should carry across pretty similarly though, depending on how much grunt you want to expend on decompression.

MS did have this on the May PC announcement too:

"GPU decompression is next on our roadmap, a feature that will give developers more control over resources and how hardware is leveraged." So hopefully there is going to be some kind of DirectX standardised compression scheme so we don't end up a bit fragmented and with proprietory implementations like with DLSS (which is great, but I wish it was implement once, use anywhere).

Yes there's a video from AMD posted earlier in this thread where they mention the GPU decompression formats that "Direct Storage promotes, supports, endorses and asks ISV's to design to". So it doesn't necessarily sound like Direct Storage / Microsoft supplies a format, but it does set the standards of what formats can be used

Edit:

Okay, so it looks like Direct Storage allows you to create custom decompression queues, presumably to allow you to run whatever kind of custom decompression routine suits you best. The example here is about using lots of threads to get the job done quickly.


Yes for the CPU based decompression Direct Storage hands the task back to the game so that it can use whatever CPU based routine it wants. It seems GPU based decompression will be more prescribed though based on the above.

BCpack is coming to the pc.

What's the source for that? It's be great to have that confirmed
 
Yes for the CPU based decompression Direct Storage hands the task back to the game so that it can use whatever CPU based routine it wants. It seems GPU based decompression will be more prescribed though based on the above.

I've been having a look for CPU decompression related stuff that might be relevant to this topic. On the Oodle development history page ( http://www.radgametools.com/oodlehist.htm ), under the "Release 2.9.3 - July 26, 2021" section it says:
  • "enhancement : CPU BC7Prep decode will now use AVX2 256-bit instructions when available, for a speed win of about 25% typically (it's frequently memory bound). You can disable usage of wide vectors when desired via OodleTexRT_BC7PrepDecodeFlags_AvoidWideVectors (see docs)."

So on Xbox and most PCs, it looks like developers will be able to take advantage of their full width AVX 256 for at least Oodles BC7Prep optimised decoder. I think that's one of the formats that can be used with that Oodle Texture RDO to offers those delightful "20 - 50% smaller" but with near lossless results.

From the more recent "Release 2.9.6 - May 2, 2022" there's also this:
  • "new : Data : New example showing how to use Oodle Data with DirectStorage on Windows"
... but I can't find that on the site. It's probably detailed in the software or documentation or something.
 
I've been having a look for CPU decompression related stuff that might be relevant to this topic. On the Oodle development history page ( http://www.radgametools.com/oodlehist.htm ), under the "Release 2.9.3 - July 26, 2021" section it says:
  • "enhancement : CPU BC7Prep decode will now use AVX2 256-bit instructions when available, for a speed win of about 25% typically (it's frequently memory bound). You can disable usage of wide vectors when desired via OodleTexRT_BC7PrepDecodeFlags_AvoidWideVectors (see docs)."

So on Xbox and most PCs, it looks like developers will be able to take advantage of their full width AVX 256 for at least Oodles BC7Prep optimised decoder. I think that's one of the formats that can be used with that Oodle Texture RDO to offers those delightful "20 - 50% smaller" but with near lossless results.

From the more recent "Release 2.9.6 - May 2, 2022" there's also this:
  • "new : Data : New example showing how to use Oodle Data with DirectStorage on Windows"
... but I can't find that on the site. It's probably detailed in the software or documentation or something.

Nice find. Guess thats why MS opted to not cut avx.
 
Nice find. Guess thats why MS opted to not cut avx.

Well it certainly wouldn't hurt if a developer decided to roll their own decompression stuff, but MS do have their own pretty cool hardware block in Series X/S, and Sony have a subset of Oodle stuff that's very fast in their hardware so if devs use that they're laughing too, even with cut down vector units on the CPU.

I think MS were looking for commonality across platforms and so AVX256 was a must have. Same with stuff like DS and SFS - it'll help lift all their platforms. Sony were looking primarily at the money making monster that is PS5, so their pressures were different.

Truth is, both Sony and MS have probably made lots of good decisions given their own circumstances. And I think all the technologies generally converge around a similar goal.

Good times man!

You're either not using Chrome or have 128GB of RAM.

Firefox and 32GB. Compatibility be damned!
 
Well it certainly wouldn't hurt if a developer decided to roll their own decompression stuff, but MS do have their own pretty cool hardware block in Series X/S, and Sony have a subset of Oodle stuff that's very fast in their hardware so if devs use that they're laughing too, even with cut down vector units on the CPU.

I think MS were looking for commonality across platforms and so AVX256 was a must have. Same with stuff like DS and SFS - it'll help lift all their platforms. Sony were looking primarily at the money making monster that is PS5, so their pressures were different.

Truth is, both Sony and MS have probably made lots of good decisions given their own circumstances. And I think all the technologies generally converge around a similar goal.

Good times man!

Cant agree more on this :)
 
Nice find. Guess thats why MS opted to not cut avx.

BC7Prep on consoles have gpu decode support. BC7Prep isn't a compression step. It rearranges bits to make compression more efficient but the bits have to reordered after decompression to get back the original BC7 data. The PS5 handles BC7Prep decode on the gpu and the Series, One and PS4 can too.
 
I wonder if Microsoft or Sony had gone around to developers and gave them 2 choices:

16GB RAM and fast nvme SSD storage

32GB RAM and regular old SATA SSD storage


Which would developers have wanted?

I'm pretty sure the obvious answer is 32GB RAM.

What's the lowest that number can go and still be favorable overall?

Would 20GB RAM with SATA SSD storage be a better option for developers than what we got? This seems like it could have been a reality with the Series X given the memory configuration and probably pretty comparable in manufacturing cost to what we got.

Just curious.
 
I wonder if Microsoft or Sony had gone around to developers and gave them 2 choices:

16GB RAM and fast nvme SSD storage

32GB RAM and regular old SATA SSD storage


Which would developers have wanted?

I'm pretty sure the obvious answer is 32GB RAM.

What's the lowest that number can go and still be favorable overall?

Would 20GB RAM with SATA SSD storage be a better option for developers than what we got? This seems like it could have been a reality with the Series X given the memory configuration and probably pretty comparable in manufacturing cost to what we got.

Just curious.

Your forgetting the single most important number here, memory Bandwidth.
32GB at the same bandwidth as the 16GB, gets you almost nothing.

My favorite way to think about these things is actually memory bandwidth / Per Frame.
lets take the PS5 as an example.

Mem bw = 448 GB/s,
At 60fps, that means a single frame can access a total of 7.466, GB of data.
Thats plenty of textures, models, render targets etc.
This also means that if you have a total of 16Gb ram, your NEVER going to access your entire ram in a the
process of rendering a single frame.
now a 30fps game, can access pretty close to the entire 16GB in the process of rendering a frame,
but even @ 30fps, using a 32GM ram over half would be wasted.

Now lets think about what a fast SSD gives us.
1. The ability to refresh our smaller pool of ram faster
2. The ability to cache more game state at any single time.

The faster refilling of actual RAM, is good because it simplifies the entire process.
HD or even slow SSD, are not quick enough to enable on demand paging systems for textures or other data structures.

The ability to cache more game state is a HUGE step forward - imho.
Richer, more living worlds, where game state only needs to be updated on a 1-10 fps basis, ie "the living world" that isn't immediately on screen.
will see a huge step forward this gen.

Now if your talking about 32GB @ 2 x 448GB/s, yeah sure the double the ram starts to make sense,
but now your no longer at price parity, even with no SSD. Thats 3090 prices.

Overall i think almost all developers, would opt for the 16GB + SSD option.
modern computing is well versed in the techniques for handling multi-level caching, and addressing across
memories of different BW and latency. But only once you get to decent SSD speeds, it all falls apart with a HD.

I think the MS approach of letting the SSD act as sort of a victim cache for the RAM is a very smart idea,
and does in theory give them access to 100+GB of total memory, which is all HW managed for best performance.
but it's a somewhat unique mechanism, and may not be fully exploited.
( I dunno if you can sue the PS5 the same way.. )
 
Back
Top