Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Well, now I'm just trying to figure out what you interpreted as a pretty intense take on whose words, and what exactly you're implying was a direct attack.
I certainly didn't mean any of what I wrote as an attack, only as presenting a diverging opinion.
This part here: You think I'm in the conspiracy theory level territory, I think you're in the drinking the Kool-Aid too soon territory.

He made no suggestion except to wait for the final data points from nvidia
 
This part here: You think I'm in the conspiracy theory level territory, I think you're in the drinking the Kool-Aid too soon territory.

He made no suggestion except to wait for the final data points from nvidia

I actually quoted the "conspiracy theory level" expression which was very clearly directed at me. I took no offense from it as it's just a different PoV from mine.
I also meant no offense with my Kool Aid comment as I meant it just as an expression of opposing view.
I'm sorry if offense was taken.
 
It's just using decompression of several files in parallel.

Just like any game would be doing when retrieving textures measuring in the tens of MB's in 64-256k blocks at multi GB per second.

I recommend reading this paper on the subject, in which the authors from IBM and Columbia U. comment on the limitations of Zlib for parallel computing performance and propose a new compression format that is parallel-friendly:
Massively-Parallel Lossless Data Decompression



Note: they're using compression of html text and sparse matrixes that compress a lot more than textures, which is why zlib reaches 3:1 and 5:1 compression ratios in there, whereas with textures it's usually ~1.8 or less.

nzv1Onf.png


In the end, they came up with a compressor that is indeed much faster at decompressing, but it also has a much lower compression ratio (meaning effective throughput is very far from nvidia's "as fast as 24 cores" claim). With their method they spend a bit less energy than CPU Zlib for the decompression operation, though at the cost of significantly higher disk space and they depend on a very high storage source throughput. There's no free meal here. Kraken is probably much better here, and so should be BCPack for textures.


From the conclusion of that paper:

"Gompresso [their GPU based decompression routine] decompresses two real-world datasets 2× faster than the state-of-the-art blockparallel variant of zlib running on a modern multi-core CPU, while suffering no more than a 10 % penalty in compression ratio."

That doesn't sound like a much lower compression ratio to me. But in any case, despite working on an entirely different data set with very different block sizes and very possibly a completely different base compression routine, this article does seem to demonstrate that parallel GPU based decompression is entirely possible. So why you're using it as some kind of proof that Nvidia couldn't be doing GPU based texture decompression escapes me.

They couldn't do anything remotely close to the performance of dedicated ASICs or hardware blocks.

This is less an apples to oranges comparison and more an apples to screwdriver level comparison. I see no basis at all for using this to suggest Nvidia would be unable to perform high speed texture decompression on the GPU. Would a dedicated ASIC be faster for the same silicon cost? Almost certainly yes. And that answers your earlier question as to why the consoles use ASIC's rather than bigger GPU's.


Or they're just using massive amounts of parallel decompression threads of different texture files, which in the end makes the 14GB/s throughput an unrealistic load for any real-life scenario. And although the aggregated throughput is high, the time it takes to decompress one large texture makes it unusable for actual texture streaming in games.

As Rikimaru pointed out above, the PS5 used 256k blocks and Direct Storage is likely to use something similar. These aren't whole textures, they're blocks from whole textures which can be requested individually or en-masse as needed by the game engine. Even if these blocks are grouped together for decompression rather than being decompressed individually (and I'm not convinced that's how it works), you said yourself that a full 4k texture only neasures in the tens of MB. How many of those would you need to be decompressing per second to hit these multi-GB/s speeds that are being discussed? In your example, around 250 per second at the PS5's drive speed. And naturally, not everything being called would be a full texture.

And only future GPUs that actually have dedicated hardware blocks for decompression, will ever make DirectStorage with GPU decompression usable.

So we're back to Nvidia is making it all up then despite them showing a live demonstration of high speed decompression running on the GPU.

It's the same nvidia-the-worlds-foremost-gpu-maker who presented on stage a graphics card for (paper-)launch day, which photos later showed was being held together by woodscrews.
Lack of information is usually suspicious, and in this case they're omitting a ton of it.


Except for the aforementioned live demonstration.

He starts talking about storage at the 5 minute mark. He starts talking about Kraken and the Custom IO Unit at ~17m. He moves on from the storage talk at ~24m. In a 53 minute presentation.


You specifically said he "spent a third of the PS5 hardware presentation talking about the importance of their high-performance decompressor" to validate how important you think it is vs GPU based decompression. This is not the same as talking about storage and the overall IO system including expandable storage. He specifically started talking about the hardware decompressor at 17:35 and stopped at 18:05. A whole 30 seconds.

It's large throughput with very low single-threaded performance. Still not a good match for decompression.

If you were trying to decompress a single multi-gigabyte file. Which you're not.

They didn't show DirectStorage with CPU decompression, they only showed DirectStorage GPU vs. current on CPU. It's apples vs. oranges.
Why didn't they show CPU vs. GPU both on DirectStorage?


That isn't mentioned anywhere in the slides. The slides simply compare NVMe CPU compression to NVMe GPU compression (which is >3x faster than a 24 core threadripper). Use of Direct Storage isn't mentioned in either case. And even if it were the case that the GPU decompression used DS while the CPU didn't, you're attempting to draw conclusions from that which have no supporting basis from the data. We've seen no suggestion anywhere that Direct Storage reduces CPU decompression times to less than 1/3rd or original yet you're basing your conclusion that GPU decompression can't possibly be real/useful on this made up assumption.
 
After that they can remaster G-Police and I will be happy.

Day one purchase. Damn you had to remind me of that game lol, never came up until today.

Here an article that touches up on RTX IO i found:

https://thefanatic.net/the-rtx-30-made-the-ps5-and-xbox-series-x-obsolete/

''RTX IO technology allows the GPU to be used to speed up the workload when decompressing data from the SSD. So far, this task has been performed on the CPU and is so impactful that it can easily consume up to two full cores. By shifting the workload to the GPU The duty cycle is optimized and one of the most important bottlenecks game developers face today is almost completely eliminated.
Contrary to what Sony said at the time, The loading times are not lost, even with this major breakthrough. I have told you more than once before and as you can see I was not wrong, but it is evident that the difference has been enormous since then We’re going from 5 seconds to 1.5 seconds. If we look at the results with a traditional hard drive, the difference is huge.''


Article on RTX/RDNA2 gpus

https://www.tomsguide.com/news/forget-ps5-and-xbox-series-x-nvidias-rtx-3070-could-be-all-you-need
 
Bandwidth is just a rate. Max bandwidth allows easy and simple comparison without requiring finer details.

A max of 128 bytes per cycle vs a max 256 bytes may inform me about the bus width, it tells me nothing about transport speeds without knowing the frequencies involved. Max bandwidth accounts for the most basic but relevant variables and generates an easy to understand figure.

Everything else being equal 20 GBps vs 5 GBps tells me one solution moves data 4 times faster regardless of data size. That's true unless your data transfer needs are too small to readily saturate the bus on a per cycle level.

Even if future games typically involve streaming at just 1 GBps, a 20GB/s SDD will typically be faster than a 5GB/s SSDs in almost all cases.

In other words, telling me 100 mph doesn’t mean much because I never need to go 100 miles in an hour’s time, ignores the fact that 100 mph allows me to travel more quickly than lesser rates regardless of the distances involved.
 
Last edited:
Just like any game would be doing when retrieving textures measuring in the tens of MB's in 64-256k blocks at multi GB per second.
See my previous post. You can't process those 256KB blocks in parallel before stitching them up together into one compressed texture file. A highly parallel processing hardware architecture is worthless for those 64-256KB blocks.

That doesn't sound like a much lower compression ratio to me.
Compression ratio is in the figure 13 of the article that I posted here. Check the x axis.
It's >3:1 on ZLIB vs. 2:1 on the GPU compressor. It's below 66% of compression effectiveness.


This is less an apples to oranges comparison and more an apples to screwdriver level comparison. I see no basis at all for using this to suggest Nvidia would be unable to perform high speed texture decompression on the GPU.
A highly focused fixed function hardware block makes sense to decompress a file very fast. A highly parallel set of lower performing ALUs does not. The commercial IP blocks I linked to have more explanations why "throwing more cores" isn't the solution for the file decompression problem.

As Rikimaru pointed out above, the PS5 used 256k blocks and Direct Storage is likely to use something similar. These aren't whole textures, they're blocks from whole textures which can be requested individually or en-masse as needed by the game engine.
Why would the game engine request a random slice of a compressed texture file if it can't do anything with it?
At most, the game engine determines a desired LOD and requests a mipmap accordingly, but the smaller mip is still inside the larger texture file AFAIK. The advantage here is the engine doesn't need to put the large texture into the VRAM.

Even if these blocks are grouped together for decompression rather than being decompressed individually (and I'm not convinced that's how it works)
Well if you're not reading all the explanations and quotes I wrote here, nor the documentation and scientific articles I linked to about how file compression works, then it's really easy to not be convinced of anything..
¯\_(ツ)_/¯


you said yourself that a full 4k texture only neasures in the tens of MB. How many of those would you need to be decompressing per second to hit these multi-GB/s speeds that are being discussed? In your example, around 250 per second at the PS5's drive speed. And naturally, not everything being called would be a full texture.
You don't wait a second for a texture to load if you're streaming textures on the fly like they envision to do in this new generation (i.e. with little to no prefetching, while walking and turning, not during a loading screen or a narrow corridor).
Ideally, you wait a couple of 16ms frames or one 33ms frame (like they did in the Unreal Engine demo). This means that, within 33ms, the system needs to, for each texture:
1 - Identify the texture file and mipmap level and request the file from storage
2 - Send the texture file at 5.5GB/s (probably into the ESRAM that sits in the IO complex)
3 - Decompress the file in the ESRAM into the RAM, at 8-20GB/s

The bottleneck here is how many texture files you can send towards the ESRAM during step 2, within less than 33ms.
At 5.5GB/s, you can send around 180MB of compressed textures assuming you have the whole 33ms. Looking at the Unreal Engine's own documentation on DXT5 textures, that's just 9 (nine) 20MB 4K*4K textures being sent within 33ms, or maybe 18 if we assume a 2:1 Kraken compression afterwards.
What good are the 8704 parallel ALUs in a RTX 3080 for decompressing these 18 files that need to be processed in a mostly serial fashion? And how fast is each of these tiny ALUs is at decompressing each of these 5-20MB files?

It's not important to be able to decompress 5000 textures within 2 seconds (for which I'm sure GPUs could be very good at.. until they ran out of VRAM at least). What's important is to able to decompress one large compressed texture file - which is a mostly non-parallel task - within a couple of milisseconds.


You specifically said he "spent a third of the PS5 hardware presentation talking about the importance of their high-performance decompressor" to validate how important you think it is vs GPU based decompression.
Yes, he spent actually about half the presentation time talking about how the PS5's storage is so important for their next-gen vision, all of which is dependent on the SSD's raw speed and the decompressor's performance to drive data to the RAM.



The slides simply compare NVMe CPU compression to NVMe GPU compression (which is >3x faster than a 24 core threadripper). Use of Direct Storage isn't mentioned in either case.
It's literally in the article you posted yourself.

Yx4omEb.jpeg

A demo to show the theoretical benefits of NVIDIA RTX IO, that works in conjunction with Microsoft's DirectStorage API, was also shown. During the demo, handling the level load and decompression took about 4X as long on a PCIe Gen 4 SSD using current methods and used significantly more CPU core resources.
They're comparing DirectStorage vs. non-DirectStorage.




And even if it were the case that the GPU decompression used DS while the CPU didn't, you're attempting to draw conclusions from that which have no supporting basis from the data. We've seen no suggestion anywhere that Direct Storage reduces CPU decompression times to less than 1/3rd or original yet you're basing your conclusion that GPU decompression can't possibly be real/useful on this made up assumption.
Proof #88 on how nVidia is being super shady about this: they're not measuring decompression speed either in that demo.
Look at the graph, it says level load time, and they also talk about CPU utilization.

Here's what Microsoft says about DirectStorage:
https://news.xbox.com/en-us/2020/03/16/xbox-series-x-glossary/
Modern games perform asset streaming in the background to continuously load the next parts of the world while you play, and DirectStorage can reduce the CPU overhead for these I/O operations from multiple cores to taking just a small fraction of a single core;

https://devblogs.microsoft.com/directx/directstorage-is-coming-to-pc/
The DirectStorage API is architected in a way that takes all this into account and maximizes performance throughout the entire pipeline from NVMe drive all the way to the GPU.

It does this in several ways: by reducing per-request NVMe overhead, enabling batched many-at-a-time parallel IO requests which can be efficiently fed to the GPU, and giving games finer grain control over when they get notified of IO request completion instead of having to react to every tiny IO completion.

DirectStorage is all about reducing the number of steps (i.e. CPU cycles and communication to IO latency) that are currently needed to perform an IO request.
Loading from a NVMe will already be much faster with DirectStorage even if using CPU decompression, and nVidia very conveniently left that info out from their "demo".
 
See my previous post. You can't process those 256KB blocks in parallel before stitching them up together into one compressed texture file. A highly parallel processing hardware architecture is worthless for those 64-256KB blocks.

I can't help but feel we're going round in circles here. You seem to have convinced yourself that GPU texture decompression is a practical impossibility, and that when Nvidia say they can do it (and even demonstrate it) they're lying and deceiving us.

Fair enough.

As it seems that you won't be swayed by any argument at this point I suggest we end this thread derailment and leave it to more productive discussions.

When and if additional evidence of GPU based decompression emerges we can revisit the discussion.

I will just leave you with something to read over in the mean time though. The following are all papers relating to GPU based texture or shadow map decompression, including a patent by AMD for doing exactly what Nvidia are claiming with RTX-IO (real time GPU based texture compression).

https://pdfs.semanticscholar.org/6eca/44be96dcf1156c047656b9286485eeb51316.pdf

https://ieeexplore.ieee.org/document/5532495

https://patents.google.com/patent/US9378560B2/en

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.440.44&rep=rep1&type=pdf

Pay particular attention to the first link which describes parallel execution of LZW decompression on individual texture files using Nvidia CUDA cores and achieving an over 40x speed up vs an Intel i7 on a mere GTX980.
 
Has anybody mention MS's Project Zipline and Corisca?

https://azure.microsoft.com/en-us/blog/improved-cloud-service-performance-through-asic-acceleration/
https://www.datacenterknowledge.com...-hardware-using-its-new-compression-algorithm

MS even open sourced the RTL so you can freely implement the design into your hardware.

https://www.hotchips.org/hc31/HC31_T2_Microsoft_CarrieChiouChung.pdf
Have to go deep into the slides. But mentions that Corisca provides Xbox level of security. Corsica ASIC handles encryption, compression and authentication at 100 Gbps. Handles XP10, Zlib and gzip.
 
See my previous post. You can't process those 256KB blocks in parallel before stitching them up together into one compressed texture file. A highly parallel processing hardware architecture is worthless for those 64-256KB blocks.


Compression ratio is in the figure 13 of the article that I posted here. Check the x axis.
It's >3:1 on ZLIB vs. 2:1 on the GPU compressor. It's below 66% of compression effectiveness.



A highly focused fixed function hardware block makes sense to decompress a file very fast. A highly parallel set of lower performing ALUs does not. The commercial IP blocks I linked to have more explanations why "throwing more cores" isn't the solution for the file decompression problem.


Why would the game engine request a random slice of a compressed texture file if it can't do anything with it?
At most, the game engine determines a desired LOD and requests a mipmap accordingly, but the smaller mip is still inside the larger texture file AFAIK. The advantage here is the engine doesn't need to put the large texture into the VRAM.


Well if you're not reading all the explanations and quotes I wrote here, nor the documentation and scientific articles I linked to about how file compression works, then it's really easy to not be convinced of anything..
¯\_(ツ)_/¯



You don't wait a second for a texture to load if you're streaming textures on the fly like they envision to do in this new generation (i.e. with little to no prefetching, while walking and turning, not during a loading screen or a narrow corridor).
Ideally, you wait a couple of 16ms frames or one 33ms frame (like they did in the Unreal Engine demo). This means that, within 33ms, the system needs to, for each texture:
1 - Identify the texture file and mipmap level and request the file from storage
2 - Send the texture file at 5.5GB/s (probably into the ESRAM that sits in the IO complex)
3 - Decompress the file in the ESRAM into the RAM, at 8-20GB/s

The bottleneck here is how many texture files you can send towards the ESRAM during step 2, within less than 33ms.
At 5.5GB/s, you can send around 180MB of compressed textures assuming you have the whole 33ms. Looking at the Unreal Engine's own documentation on DXT5 textures, that's just 9 (nine) 20MB 4K*4K textures being sent within 33ms, or maybe 18 if we assume a 2:1 Kraken compression afterwards.
What good are the 8704 parallel ALUs in a RTX 3080 for decompressing these 18 files that need to be processed in a mostly serial fashion? And how fast is each of these tiny ALUs is at decompressing each of these 5-20MB files?

It's not important to be able to decompress 5000 textures within 2 seconds (for which I'm sure GPUs could be very good at.. until they ran out of VRAM at least). What's important is to able to decompress one large compressed texture file - which is a mostly non-parallel task - within a couple of milisseconds.



Yes, he spent actually about half the presentation time talking about how the PS5's storage is so important for their next-gen vision, all of which is dependent on the SSD's raw speed and the decompressor's performance to drive data to the RAM.




It's literally in the article you posted yourself.

Yx4omEb.jpeg


They're comparing DirectStorage vs. non-DirectStorage.





Proof #88 on how nVidia is being super shady about this: they're not measuring decompression speed either in that demo.
Look at the graph, it says level load time, and they also talk about CPU utilization.

Here's what Microsoft says about DirectStorage:
https://news.xbox.com/en-us/2020/03/16/xbox-series-x-glossary/


https://devblogs.microsoft.com/directx/directstorage-is-coming-to-pc/


DirectStorage is all about reducing the number of steps (i.e. CPU cycles and communication to IO latency) that are currently needed to perform an IO request.
Loading from a NVMe will already be much faster with DirectStorage even if using CPU decompression, and nVidia very conveniently left that info out from their "demo".

You have MiGz used by LinkedIn which offers mutithreaded compression and decompression while supporting any single thread gzip decoder.

GZip does not normally write the compressed size of each block in its header, so finding the position of the next block requires decompressing the current one, precluding multithreaded decompression. Fortunately, GZip supports additional, custom fields known as EXTRA fields. When writing a compressed file, MiGz adds an EXTRA field with the compressed size of the block; this field will be ignored by other GZip decompressors, but MiGz uses it to determine the location of the next block without having to decompress the current block. By reading a block, handing it to another thread for decompression, reading the next block and repeating, MiGz is able to decompress the file in parallel.

This isn't a patent or some research solution that hasn't been applied in the real world. It's being used by a major corporation namely Microsoft.

There are also variants of gzip similar to MiGz like GZinga and BGZip that offer random access. GZinga is used by Ebay.

https://tech.ebayinc.com/engineering/gzinga-seekable-and-splittable-gzip/

Who uses BGZip? Its used to compress bioinformatic data. Guess whose hardware can support this format to accelerate bioinformatics apps? Nvidia. BGZip is supported on Nvidia GPUs using CUDA.

http://nvlabs.github.io/nvbio/
 
Last edited:
Remember DirectStorage is the API, like Direct3D, the API is a standard, a protocol, it states what needs to be implemented and what to be accomplished. The functionality is implemented by the device driver (and some runtime component that come with Windows). So RTX I/O should be Nvidia's implementation of the set of capabilities that conform to DirectStorage, it's not entirely just a marketing name.

Or, it could be like Nvidia's OptiX, which is Nvidia's own API that let developers tap into Nvidia GPUs' ray tracing capabilities. And DirectStorage will be supported alongside it.
 
Remember DirectStorage is the API, like Direct3D, the API is a standard, a protocol, it states what needs to be implemented and what to be accomplished. The functionality is implemented by the device driver (and some runtime component that come with Windows). So RTX I/O should be Nvidia's implementation of the set of capabilities that conform to DirectStorage, it's not entirely just a marketing name.

Or, it could be like Nvidia's OptiX, which is Nvidia's own API that let developers tap into Nvidia GPUs' ray tracing capabilities. And DirectStorage will be supported alongside it.

Yep this is the real question IMO. Of course it will work. But how widely implemented will it be? If the former, then the answer is likely quite well within a couple of years, if its the latter then its likely no more than DLSS.
 
Are we sure there is no hardware requirement for Direct Storage on the mobo side of things?
 
We don't know at this stage. We only know none have been stated so far. Microsoft have listed an NVMe SSD and Windows 10 for DirectStorage, and Nvidia have listed DirectStorage and an RTX GPU.
thanks, just wondering if I missed an important clue. So we're basically still waiting for more information at this point in time.
 
It seems from Astro Bot, that they have shorten the "transition scene" rather than pure SSD/IO performance improvements. Meaning, the transition scene length is more of an artistic choice, more so than PS5 SSD/IO performance limitation. That PS5 SSD/IO is so damn fast, that transitioning between scenes or worlds requires more time crafting aesthetically pleasing transitions, more so than tweaking SSD/IO limitations... nice!

giphy.webp
 
Back
Top