Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

indeed. but that can also be attributed to mip-map selection algorithms as well however.

It's difficult to use what you don't have in memory. Especially if you had a spinner that gives something like 20MB/s(spiderman streaming from ps4 gdc talk). Of course spiderman solved this partially be limiting movement speed. gtav didn't add such limitations to movement speed. It is good title to examine how things fall apart when spinning slow disk can't keep up.

Another example is many open world games with fast travel. Often it takes a while, textures might look like garbage and detailed textured get loaded over time. Very annoying and immersion breaking.
 
It's difficult to use what you don't have in memory. Especially if you had a spinner that gives something like 20MB/s(spiderman streaming from ps4 gdc talk). Of course spiderman solved this partially be limiting movement speed. gtav didn't add such limitations to movement speed. It is good title to examine how things fall apart when spinning slow disk can't keep up.

Another example is many open world games with fast travel. Often it takes a while, textures might look like garbage and detailed textured get loaded over time. Very annoying and immersion breaking.
indeed, but this process may not be a memory issue necessarily (though I'm not saying it isn't). The reason Sampler Feedback was created was to address this particular issue you're referring to. So instead of leaving this as being up to each developer to handle it differently, the driver/hardware can handle it.

https://devblogs.microsoft.com/dire...edback-some-useful-once-hidden-data-unlocked/
Why Feedback: A Streaming Scenario

Suppose you are shading a complicated 3D scene. The camera moves swiftly throughout the scene, causing some objects to be moved into different levels of detail. Since you need to aggressively optimize for memory, you bind resources to cope with the demand for different LODs. Perhaps you use a texture streaming system; perhaps it uses tiled resources to keep those gigantic 4K mip 0s non-resident if you don’t need them. Anyway, you have a shader which samples a mipped texture using A Very Complicated sampling pattern. Pick your favorite one, say anisotropic.

The sampling in this shader has you asking some questions.

What mip level did it ultimately sample? Seems like a very basic question. In a world before Sampler Feedback there’s no easy way to know. You could cobble together a heuristic. You can get to thinking about the sampling pattern, and make some educated guesses. But 1) You don’t have time for that, and 2) there’s no way it’d be 100% reliable.

Where exactly in the resource did it sample? More specifically, what you really need to know is— which tiles? Could be in the top left corner, or right in the middle of the texture. Your streaming system would really benefit from this so that you’d know which mips to load up next. Yeah while you could always use HLSL CheckAccessFullyMapped to determine yes/no did-a-sample-try-to-get-at-something-nonresident, it’s definitely not the right tool for the job.

Direct3D Sampler Feedback answers these powerful questions.
 
indeed, but this process may not be a memory issue necessarily (though I'm not saying it isn't). The reason Sampler Feedback was created was to address this particular issue you're referring to. So instead of leaving this as being up to each developer to handle it differently, the driver/hardware can handle it.

https://devblogs.microsoft.com/dire...edback-some-useful-once-hidden-data-unlocked/
Why Feedback: A Streaming Scenario

Suppose you are shading a complicated 3D scene. The camera moves swiftly throughout the scene, causing some objects to be moved into different levels of detail. Since you need to aggressively optimize for memory, you bind resources to cope with the demand for different LODs. Perhaps you use a texture streaming system; perhaps it uses tiled resources to keep those gigantic 4K mip 0s non-resident if you don’t need them. Anyway, you have a shader which samples a mipped texture using A Very Complicated sampling pattern. Pick your favorite one, say anisotropic.

The sampling in this shader has you asking some questions.

What mip level did it ultimately sample? Seems like a very basic question. In a world before Sampler Feedback there’s no easy way to know. You could cobble together a heuristic. You can get to thinking about the sampling pattern, and make some educated guesses. But 1) You don’t have time for that, and 2) there’s no way it’d be 100% reliable.

Where exactly in the resource did it sample? More specifically, what you really need to know is— which tiles? Could be in the top left corner, or right in the middle of the texture. Your streaming system would really benefit from this so that you’d know which mips to load up next. Yeah while you could always use HLSL CheckAccessFullyMapped to determine yes/no did-a-sample-try-to-get-at-something-nonresident, it’s definitely not the right tool for the job.

Direct3D Sampler Feedback answers these powerful questions.

You read way too much into what I wrote. What I wrote still helps with sfs as sfs would be DOA with spinning disk due to seek times. Big drawback of SFS is that it has to miss before it knows what to fetch. It would be better to have heuristic/dnn to predict what is needed to avoid misses and use sfs to handle what is left to handle. B ad thing about miss is that by the time you have the data in ram it might not be needed anymore. This could in worst case lead to pathological case like dog chasing its tail.

What I try to write is that there is no single magic bullet. Fast ssd, compression, sfs, better heuristics, possibly dnn's etc. all will contribute towards better solution. I'm not trying to minimize the problem into a magic bullet. I was just trying to write compression and faster ssd helps to avoid some issues we have seen in ps4 generation. Will it solve all issues? Hell no.
 
You read way too much into what I wrote. What I wrote still helps with sfs as sfs would be DOA with spinning disk due to seek times. Big drawback of SFS is that it has to miss before it knows what to fetch. It would be better to have heuristic/dnn to predict what is needed to avoid misses and use sfs to handle what is left to handle. B ad thing about miss is that by the time you have the data in ram it might not be needed anymore. This could in worst case lead to pathological case like dog chasing its tail.

What I try to write is that there is no single magic bullet. Fast ssd, compression, sfs, better heuristics, possibly dnn's etc. all will contribute towards better solution. I'm not trying to minimize the problem into a magic bullet. I was just trying to write compression and faster ssd helps to avoid some issues we have seen in ps4 generation. Will it solve all issues? Hell no.
Yea I'm on the same wavelength, there's all sorts of issues. But yes, a faster SSD will help for sure.
 
Using current compression routines no. But both BCPACK and now Oodle Texture + Kraken are claiming just that. There's no reason RTX IO couldn;t be using one of these or an equivalent.
Oodle software isn't free. Sony is paying for a license in every PS5 devkit to use Oodle Texture and Kraken.
Are you suggesting nvidia will be paying Oodle Texture and Kraken licenses for every single PC game developer? Did nvidia say they support Kraken decompression using RTX IO?

Now that you mention this, did nvidia ever even mention what formats they're able to decompress on their GPUs? I can't see that anywhere.


Sony's 8-9GB/s (11GB/s with Oodle Texture) is the figure that is comparable to Nvidia's 14GB/s.
Where are these 11GB/s coming from?
The 9GB/s with Kraken alone came from the 5.5GB/s raw with 1.64:1 compression ratio, 5.5GB/s x 1.64 = 9GB/s.
With Oodle Texture on top of Kraken, Oodle is claiming ~74% higher compression than Kraken alone, meaning 2.85:1.
So on the same raw 5.5GB/s they now get 5.5GB/s x 2.85 = 15.7GB/s. On the example taken from that blog post, they're mentioning a texture "from a recent game" showing a 3.16:1 compression ratio with Kraken + Oodle Texture, meaning the PS5's IO is sending that texture at 17.38GB/s.

Where are these 11GB/s coming from, that you conveniently compare to nvidia's superior 14GB/s number that no one was ever able to measure anywhere?


If your argument is simply that "nvidia is lying" then this isn't a technical discussion anymore and we should end it there.
No, my argument is you're taking way too many conclusions out of a couple of slides from nvidia that doesn't really say anything other than:
- RTX IO could do 14GB/s effective throughput if
1-
there's a SSD in your system that reads at 7GB/s,
2-
your O.S. / DirectX is upgraded to support DirectStorage (and it works as advertised),
3- your CPU+motherboard combo lets the GPU use the SSD as slave and
4- there's a 2:1 compression ratio in place using some format that no one knows what it is because nvidia hasn't told us which.
Not to mention, 2:1 is a compression ratio that despite nvidia's claims, no fast compression format had ever been able to achieve before Oodle Texture, which will not become widely available to PC game developers unless you really think nvidia will be buying licences to them all.

(edit: formatting)
 
Last edited by a moderator:
Oodle software isn't free. Sony is paying for a license in every PS5 devkit to use Oodle Texture and Kraken.
Are you suggesting nvidia will be paying Oodle Texture and Kraken licenses for every single PC game developer? Did nvidia say they support Kraken decompression using RTX IO?

Now that you mention this, did nvidia ever even mention what formats they're able to decompress on their GPUs? I can't see that anywhere.



Where are these 11GB/s coming from?

11GB/s is coming from oodle. They have made improvements which should allow compression ratio to be 1:2 which would imply 2x5.5GB/s = 11GB/s.

Sony has previously published that the SSD is capable of 5.5 GB/s and expected decompressed bandwidth around 8-9 GB/s, based on measurements of average compression ratios of games around 1.5 to 1. While Kraken is an excellent generic compressor, it struggled to find usable patterns on a crucial type of content : GPU textures, which make up a large fraction of game content. Since then we've made huge progress on improving the compression ratio of GPU textures, with Oodle Texture which encodes them such that subsequent Kraken compression can find patterns it can exploit. The result is that we expect the average compression ratio of games to be much better in the future, closer to 2 to 1

https://cbloomrants.blogspot.com/2020/09/how-oodle-kraken-and-oodle-texture.html?m=1

Nvidia will be limited by whatever developers use in their games. That is unless nvidia somehow during install time decompresses and then compresses the assets again. Not sure if that would even be feasible as game engine might have assumptions baked in about compression.
 
11GB/s is coming from oodle. They have made improvements which should allow compression ratio to be 1:2 which would imply 2x5.5GB/s = 11GB/s.
Ok, that seems to be for the lossless compression mode. I don't think many devs will be using the lossless mode of RDO as there seems to be little to nothing to gain.

Nvidia will be limited by whatever developers use in their games.
And all compression formats work with a 2:1 compression ratio (quick answer: they don't)? And nvidia's GPGPU decompressor works with all of them?


That is unless nvidia somehow during install time decompresses and then compresses the assets again. Not sure if that would even be feasible as game engine might have assumptions baked in about compression.
This would indeed be a really cool solution, to have a software that hijacks the data to rearrange it during installation. I also don't know if game engines would play well with that, though.
 
And all compression formats work with a 2:1 compression ratio (quick answer: they don't)? And nvidia's GPGPU decompressor works with all of them?

We might not even see DirectStorage out of beta and in release version of windows until 2022. I don't feel comfortable in predicting anything that happens pc side until DirectStorage beta ships and developers chime in on it. I wouldn't be surprised if zen4/zen4 chipsets would have something in them to make DirectStorage work better. Also I wouldn't be surprised if 2022/2023 timeframe gpu's would have some dedicated decompression blocks that align with what microsoft standardizes with DirectStorage.
 
Oodle software isn't free. Sony is paying for a license in every PS5 devkit to use Oodle Texture and Kraken.
Are you suggesting nvidia will be paying Oodle Texture and Kraken licenses for every single PC game developer? Did nvidia say they support Kraken decompression using RTX IO?

Now that you mention this, did nvidia ever even mention what formats they're able to decompress on their GPUs? I can't see that anywhere.

It would be down to individual game studio's to license whatever compression software they want to use with their games. Although I do wonder if DirectStorage might mandate the use of BCPACK and license that freely to any developer that signs up for DirectStorage.

Nvidia haven't mentioned what compression formats RTXIO supports. It'll be very interesting to find out.


Where are these 11GB/s coming from?
The 9GB/s with Kraken alone came from the 5.5GB/s raw with 1.64:1 compression ratio, 5.5GB/s x 1.64 = 9GB/s.
With Oodle Texture on top of Kraken, Oodle is claiming ~74% higher compression than Kraken alone, meaning 2.85:1.
So on the same raw 5.5GB/s they now get 5.5GB/s x 2.85 = 15.7GB/s. On the example taken from that blog post, they're mentioning a texture "from a recent game" showing a 3.16:1 compression ratio with Kraken + Oodle Texture, meaning the PS5's IO is sending that texture at 17.38GB/s.

Where are these 11GB/s coming from, that you conveniently compare to nvidia's superior 14GB/s number that no one was ever able to measure anywhere?

manux beat me to it. Examples of individual texture sets which happen to compress very well are not representative of average throughput. RAD Gametools have given us the average throughput at near 2:1. I assume their word is good enough for you?


No, my argument is you're taking way too many conclusions out of a couple of slides from nvidia that doesn't really say anything other than:
- RTX IO could do 14GB/s effective throughput if 1- there's a SSD in your system that reads at 7GB/s, 2- your S.O. / DirectX is upgraded to support DirectStorage (and it works as advertised), 3- your CPU+motherboard combo lets the GPU use the SSD as slave and 4- there's a 2:1 compression ratio in place using some format that no one knows what it is because nvidia hasn't told us which.

So let's break that down shall we?

1. Obviously this only applies to 7GB/s drives, of which several have already been announced or released.
2. Yes it's dependant on DirectStorage, who ever argued otherwise?
3. No-one has ever mentioned this as a requirement. Seems you're the one jumping to conclusions there without evidence to support.
4. Yes it assumed 2:1 compression, just like BCPACK, just like Kraken+Oodle Texture and just like Nvidia have explicitly said the solution will achieve. I'm not sure why you're treating this as such a controversial point.

I'm not personally drawing any conclusions that haven't been presented in black and white by Nvidia and backed up by Microsoft. So it seems to me that you're arguing that Nvidia is lying when they explicitly state that the solution can achieve 14GB/s at a typical 2:1 compression ratio using a 7GB/s SSD.

Not to mention, 2:1 is a compression ratio that despite nvidia's claims, no fast compression format had ever been able to achieve before Oodle Texture, which will not become widely available to PC game developers unless you really think nvidia will be buying licences to them all.

You mean other than BCPACK.... you know, that compression routine that's intrinsically linked to DirectStorage which RTXIO also happens to be intrinsically linked too.
 
Ok, that seems to be for the lossless compression mode. I don't think many devs will be using the lossless mode of RDO as there seems to be little to nothing to gain.

What lossless compression mode? RDO encoding is by it's nature lossy. There are different quality options that can be set when encoding the texture but as far as I'm aware they are all lossy, just to different degrees. RAD Gametools have stated the typical compression ratio will be 2:1. Surely you're not saying they're incorrect?

And all compression formats work with a 2:1 compression ratio (quick answer: they don't)? And nvidia's GPGPU decompressor works with all of them?

Why is that relevant? When using whatever compression scheme Nvidia is referring to in it's slides, it can achieve a 2:1 compression ratio. Maybe not all games will use that compression scheme (if there's a choice at all). Almost certainly not all games will use DirectStorage and RTXIO anyway so it seems to be a moot point. The hardware supports those speeds if game developers choose to utilise them. It's as simple as that.
 
Right. Neither is kraken at the same time. You can compress more at the cost of speed.

I will say that the discussion around compression seems to have gone too far into marketing numbers here. The discussion seems to be around compression how fast and how much and using the largest numbers to represent real world performance. the real question is what are you compressing.

As I understand, modern games all use BC7 texture compression and it’s very difficult to compress it; even with oodle texture bc7prep with kraken can only get 5-15% compression more on bc7. That’s not really a lot; and that takes 1 additional step to decode the bc7prep.

so it’s a question of how many developers are comfortable with lossy RDO textures vs lossless. When I look at the modern landscape of game with quality modes, performance modes and phot modes; to me it makes sense to still stay lossless. It takes more space to duplicate the texture, and if you use bc7prep you are wasting a compute shader to decode the texture after retrieval; so to me raw throughout 5.5GB/s is actually the most important number here. I see 9 and 11, but what are the chances developers are willing to lose that texture quality? Ie UE5 was likely lossless textures I assume at least for the landscape.

It seems like if you want to make a graphical tour de force; you’re going to stick very close to lossless for the things that matter; you’ll optimize and use the RDO oodle for fast and heavy lossy compression for lower quality fewer channel ones like normals. Heck; IIRC no normal maps for UE5. You’ll still need BC6H if you want HDR; not sure how well these do on kraken either. But it just seems like the discussion focused on how high compression can go.

B7Prep is an extra step and there is a question of resources devoted for decoding including computation and bandwidth. RAD recommends performing the untransform step right after decompression which seems to indicate you do lossless decompression into RAM and then push it to the gpu for untransform into the native BC7 and then back into RAM. This probably fine during level loads but what about when streaming textures during gameplay?

The savings from SSD to RAM doesn't seem worth the pressure put on the bandwidth between RAM and the GPU.

I may need to be enlightened but my thought process creates this scenario.

Streaming texture Data flow with B7Prep: SSD -> 5.5 GBps -> RAM -> 8-9 GBps -> GPU -> 10-11 GBps (?) -> RAM <-> 100s of GBs <-> GPU during normal rendering.

The SDD is only pushing out 5.5 GBs to RAM but the RAM bandwidth to and from the GPU must simultaneously handle textures being sent to the GPU for the prep decode step, textures being sent back to RAM after decode and finally texture data being normally sent and back and forth to GPU during rendering.

I don't see the benefit as the increased performance gained by the SDD seems mitigated by the bandwidth lost between the gpu and RAM.

However, B7Prep makes alot of sense if it doesn't inhibit random access and can be performed on the gpu without all the back and forth.
 
Last edited:
2. Yes it's dependant on DirectStorage, who ever argued otherwise?
Which might come to Windows PCs well past the RTX 3000's active life on the shelves, but that's probably a different discussion.



3- your CPU+motherboard combo lets the GPU use the SSD as slave and
3. No-one has ever mentioned this as a requirement. Seems you're the one jumping to conclusions there without evidence to support.

Is nvidia's own marketing material enough evidence for you?

a3Bz83v.jpg







4. Yes it assumed 2:1 compression, just like BCPACK, just like Kraken+Oodle Texture and just like Nvidia have explicitly said the solution will achieve. I'm not sure why you're treating this as such a controversial point.
(...)
You mean other than BCPACK.... you know, that compression routine that's intrinsically linked to DirectStorage which RTXIO also happens to be intrinsically linked too.
Because not every compression format gains performance with parallel execution. Kraken uses 2 threads (!!!) per file, meaning you're either decompressing hundreds/thousands of different textures at the same time or you don't gain anything from running it through GPGPU.
And now you're just assuming RTX graphics cards can decompress BCPack through GPGPU just because they're both related to DirectStorage, without knowing if BCPack decompression is even effective through GPGPU. Microsoft is using dedicated decompression hardware for BCPack, so where's the evidence that BCPack is effective on GPGPU?




Why is that relevant? When using whatever compression scheme Nvidia is referring to in it's slides, it can achieve a 2:1 compression ratio.
Of course it's relevant. Are the 14GB/s achievable using a widely adopted industry standard in the PC space? With how many files decompressing at the same time (i.e. how dependent on parallel execution is that throughput)? Or is it only achievable using a nvidia-proprietary texture format that decompresses at 14GB/s when using a nvidia GPU but then can only decompress at 2GB/s if you're using a GPU from another vendor?
nVidia purposely omitting the texture compression format can be a very telling move on their part. They don't even need to be lying for the real-life performance of RTX IO being completely different to their statements and/or having repercussions to the PC ecossystem.
 
Oodle software isn't free. Sony is paying for a license in every PS5 devkit to use Oodle Texture and Kraken.
Are you suggesting nvidia will be paying Oodle Texture and Kraken licenses for every single PC game developer? Did nvidia say they support Kraken decompression using RTX IO?

Now that you mention this, did nvidia ever even mention what formats they're able to decompress on their GPUs? I can't see that anywhere.



Where are these 11GB/s coming from?
The 9GB/s with Kraken alone came from the 5.5GB/s raw with 1.64:1 compression ratio, 5.5GB/s x 1.64 = 9GB/s.
With Oodle Texture on top of Kraken, Oodle is claiming ~74% higher compression than Kraken alone, meaning 2.85:1.
So on the same raw 5.5GB/s they now get 5.5GB/s x 2.85 = 15.7GB/s. On the example taken from that blog post, they're mentioning a texture "from a recent game" showing a 3.16:1 compression ratio with Kraken + Oodle Texture, meaning the PS5's IO is sending that texture at 17.38GB/s.

Where are these 11GB/s coming from, that you conveniently compare to nvidia's superior 14GB/s number that no one was ever able to measure anywhere?



No, my argument is you're taking way too many conclusions out of a couple of slides from nvidia that doesn't really say anything other than:
- RTX IO could do 14GB/s effective throughput if
1-
there's a SSD in your system that reads at 7GB/s,
2- your O.S. / DirectX is upgraded to support DirectStorage (and it works as advertised),
3- your CPU+motherboard combo lets the GPU use the SSD as slave and
4- there's a 2:1 compression ratio in place using some format that no one knows what it is because nvidia hasn't told us which.
Not to mention, 2:1 is a compression ratio that despite nvidia's claims, no fast compression format had ever been able to achieve before Oodle Texture, which will not become widely available to PC game developers unless you really think nvidia will be buying licences to them all.

(edit: formatting)

And if you use Oodle Texture with RTX IO? Oodle Texture is compressor agnostic as it doesn't require Kraken or any other compressor in RAD suite of data compressors. Or if you use some other type of texture encoder that uses RDO optimization? What RAD is using to increase compression ratios isn't exclusive to Oodle Texture or the PS5. There are a number of encoders that support different super compression techniques.

Unity devs have access to Crunch. Binomial offers Basis.

Nvidia is just using a factor of 2 because its a general compression factor for most lossless compression algorithms.
 
Last edited:
B7Prep is an extra step and there is a question of resources devoted for decoding including computation and bandwidth. RAD recommends performing the untransform step right after decompression which seems to indicate you do lossless decompression into RAM and then push it to the gpu for untransform into the native BC7 and then back into RAM. This probably fine during level loads but what about when streaming textures during gameplay?

The savings from SSD to RAM doesn't seem worth the pressure put on the bandwidth between RAM and the GPU.

I may need to be enlightened but my thought process creates this scenario.

Streaming texture Data flow with B7Prep: SSD -> 5.5 GBps -> RAM -> 8-9 GBps -> GPU -> 10-11 GBps (?) -> RAM <-> 100s of GBs <-> GPU during normal rendering.

The SDD is only pushing out 5.5 GBs to RAM but the RAM bandwidth to and from the GPU must simultaneously handle textures being sent to the GPU for the prep decode step, textures being sent back to RAM after decode and finally texture data being normally sent and back and forth to GPU during rendering.

I don't see the benefit as the increased performance gained by the SDD seems mitigated by the bandwidth lost between the gpu and RAM.

However, B7Prep makes alot of sense if it doesn't inhibit random access and can be performed on the gpu without all the back and forth.
Yup, I feel exactly the same way. Which is why the talk about compression and using the highest possible numbers isn't really illustrative of what's happening unless your game is made up of lossy textures.
 
Back
Top