Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

RTX IO takes advantage of DirectStorage which allows you to bypass the CPU and load directly into the GPU memory. But DirectStorage doesn't provide decompression hardware. That's up to the hardware vendor. The GPU vendor needs to provide the decompression scheme whether offering it's using the GPU's shaders or an ASIC. Otherwise, decompression has to happen on the CPU which defeats the purpose of DirectStorage.

DirectStorage offers a direct pathway from the SDD to the GPU's memory while RTX IO offers a way to widen that pathway.
Yes, but I don't think the decompression side of things is an issue. We already have Oodle saying their compression formats work really well on existing GPUs and it would probably cost nVidia or AMD pennies to add decompression ASICs to their cards somehow.
 
We already have Oodle saying their compression formats work really well on existing GPUs and it would probably cost nVidia or AMD pennies to add decompression ASICs to their cards somehow.

I think using existing hardware available on RTX gpus was NV's hardware solution, its both faster, more flexible and wider compatible range. I assume AMD GPUs will have the same support someway. Its in NV, MS, and AMD's intrest to address this universally since they all have huge stakes in this market.
 
Yes, but I don't think the decompression side of things is an issue. We already have Oodle saying their compression formats work really well on existing GPUs and it would probably cost nVidia or AMD pennies to add decompression ASICs to their cards somehow.

Hindsight is 20/20. But during the Nvidia 2000/3000 design phase HDDs were probably the dominant drives in terms of market share in the gaming space and DirectStorage wasn't anywhere in sight. There was no point in offering decompression ASICs capable of decoding 10s of GB of data per second of compressed data (ignoring DCC) on their consumer hardware.

HDDs were incapable of pushing that much data to the GPU and the data had to transverse the CPU anyway which was capable of handling decompression of compressed bandwidth measuring in MBps.

Consoles have the tech because SDDs are standard on new-gen hardware and console manufacturers have to be more forward-thinking as their hardware has to last 7-8 years into the future.
 
I think using existing hardware available on RTX gpus was NV's hardware solution, its both faster, more flexible and wider compatible range. I assume AMD GPUs will have the same support someway. Its in NV, MS, and AMD's intrest to address this universally since they all have huge stakes in this market.

I imagine it will be a shader-based solution. It's not ideal unless it offers random access because all the data that has to be shuttled back and forth across the GPU memory bus. It also takes up more VRAM. It's better to have ASICs sitting between video memory and the SSD.
 
Last edited:
I don't understand why some are viewing Nvidia's claims as so controversial. What's so outlandish about what they're claiming?

They've simply said that with a 7GB/s SSD (already available) and a 2:1 compression ratio (the same as claimed by both Sony and Microsoft) you will see a 14GB/s output by doing the decompression on the GPU.

So what you're saying is that future PC solutions will be better than consoles coming out a month from now? WOW!

In other news: The earth orbits the sun.
 
We honestly don't know if RTX I/O is simply a branding of parts of the DirectStorage API, something unique or a superset of it with additional Nvidia specific hooks. I agree when it becomes standardised everyone will use it but we don't know how far off that is and depending on how many vendors MS has to interact with to make it work with might take longer given that MS has largely designed DirectX with only three major hardware vendors over the past decade or more (AMD, Intel, Nvidia). In that case perhaps NV will start offering their own badge program "designed for RTX I/O" ahead of broader adoption might offer them an edge. I just don't know if DirectStorage or RTX I/O requires additional BIOS hooks or support for novel I/O commands to work.
We do not know the exact specifics but I think runs into conspiracy theory level territory to assume all those negative things about it being shady and mixing it in with those fp32/int32 whatever you mentioned. Why talk about possibilities of such negatives if there is no evidence of them?
 
We do not know the exact specifics but I think runs into conspiracy theory level territory to assume all those negative things about it being shady and mixing it in with those fp32/int32 whatever you mentioned. Why talk about possibilities of such negatives if there is no evidence of them?

I suspect that you were at least partially addressing ToTTenTranz with your comment but I'm going to expand out on my point as to why I think this is not going to be a simple win for any vendor trying to do direct DMA across the PCI-E bus today. One of the advantages of PCI-Express versus the older PCI standard is that moved the PCI bus itself to a switched standard allowing CPU manufacturers to add arbitrary numbers of lanes in a simple hierarchy by adding multiple PCI roots and bridging them with PCI to PCI bridges. Internally the CPU has switches on the root bus to handle swapping between these in a fashion that is basically transparent to the user so they perceive themselves as having 48 PCIe lanes when internally they have 3 x 16 for example.

DF themselves ran into the complexities this generates in the Horizon Zero Dawn benchmarking when it was discovered that due to a misconfiguration of expansion cards Alex had actually inadvertently halved the bandwidth available to his PCIe x16 slot, a quick juggling of expansion cards and his GPU got back an additional 8 lanes of PCIe. What HZD was doing should have been a bog standard utilisation of PCIe bus transactions but most games steer well clear of doing things like that because of the support risks Alex ran into, why bother with having to deal with how customers have configured their boards when you can just utilise a technique that doesn't require as many bus transactions (it also reinforces that HZD was a late port, no allowance was made at design stage for PC issues because it was designed for a fixed system where this was not a concern).

This remains an ongoing concern in general with PCIe as can be seen in these notes from the Linux Kernel (https://www.kernel.org/doc/html/latest/driver-api/pci/p2pdma.html), in this context they are mostly discussing the advanced ultra low latency NICs used by the likes of algo traders that attempt to bypass any sort of CPU involvement in NIC transactions at all. In their implementation they have been limited to only allowing P2P transactions within a given root complex of which modern CPUs have multiple instances. For example Kaby-Lake Intel CPUs have a x16 link controlled by the CPU which is typically dedicated to the GPU and the other PCIe lanes are controlled by the root complex in the PCH southbridge, does that restrict transactions between the two in a P2P example? How do the multiple complexes in a Ryzen CPU deal with this? Do I need my NVMe drives to be on the same root complex as my GPU? Is any of this even relevant for Windows?

P2P transactions across the PCI-E bus are a fairly new area for m/b manufacturers to consider and while long term this will all work out very well right now Sony and MS have an obvious advantage in doing this with their total control over bus and board topology versus the PC market where some of these choices on board layout defy explanation. If you are going to announce significant changes in how memory and bus transactions work in my PC I am going to be sceptical it can be simply and easily deployed if you don't come with detailed explanations of how it all works. When it comes from a section of the PC market already notorious for launching features early and aggressively marketing them as if they were already widely adopted I'm going to be even more sceptical.

Additional context (nice deep dive on PCIe, PCH and bus layout here): https://forums.tomshardware.com/thr...-root-complex-pcie-lanes-and-the-pch.2115479/
 
We do not know the exact specifics but I think runs into conspiracy theory level territory to assume all those negative things about it being shady and mixing it in with those fp32/int32 whatever you mentioned. Why talk about possibilities of such negatives if there is no evidence of them?

You think I'm in the conspiracy theory level territory, I think you're in the drinking the Kool-Aid too soon territory. Perhaps the real territory is somewhere in the middle.
Though I did my best to explain my position, and I've yet to see a reasonable counter-argument so far. Saying we should take everything from nvidia at face value because they wouldn't lie doesn't seem reasonable enough to me, especially at B3D.

If everything in a videogame could be done with low-clocked highly parallel processors, the new consoles would have tens of Jaguar/Atom-class cores at 1.5GHz instead of "only" 8 Zen2 cores at 3.4GHz+.
What would be your opinion if nvidia came out saying their GPUs are now 20x faster than a Zen2 core at running Javascript code?



DF themselves ran into the complexities this generates in the Horizon Zero Dawn benchmarking when it was discovered that due to a misconfiguration of expansion cards Alex had actually inadvertently halved the bandwidth available to his PCIe x16 slot, a quick juggling of expansion cards and his GPU got back an additional 8 lanes of PCIe.
You.. do know @Dictator is Alex, right?
 
Neither Sony nor Microsoft are claiming decompression on the GPU.

Nor was I suggesting they were. I'm fully aware they use a custom hardware unit.

Well for starters, what use are floating point operations for file decompression?

I assume you're well aware that modern GPU's have huge INT performance as well?

I wrote this several times, but let's try it again.

How Zlib works.


Zlib (or ZIP, the most popular compression algorithm), like the vast majority of compression formats uses single threaded decompression.

k3OZoQF.png


It basically counts known sequences of bits and groups them together by giving them a different "name". As a very simple example of compression:
"0000000 111111 0000 1111" can be compressed into sequences of "number of zeros"-"number of ones", of which you could say it's 7x0 + 6x1 + 4x0 + 4x1, or "'111 110 100 100".
With this, I "compressed" 21 digits into 12.
But the result is one sequential file of which you can't change the order or take random blocks out of, otherwise it becomes unreadable. Or as explained in the blog post:


To make ZIP compression levels higher, you need bigger files. To make ZIP decompression parallel, you'd need to split the original file into smaller files. So to make ZIP decompression more parallel, you lose compression ratio, and effective IO throughput in the process.
So at least with Zlib or anything else that uses ZIP, you don't gain by adding more threads to it.


GPUs aren't better than CPUs at running single-threaded code. They excel at highly parallel tasks. Which is why making a GPGPU decompressor for ZIP makes no sense, other than to save CPU cycles if the raw I/O is slow enough so the GPU decompression doesn't become a bottleneck. The only way to make Zlib decompression faster than a CPU is to make a dedicated hardware block for it.

Kraken is a very different compression format that was developed for decompressing on 2 threads. It's not great for parallel work, but it's better than Zlib's one thread. But it's still not going to gain ridiculous amounts of performance from a GPU.
Which is why, to get crazy high Kraken performance like an >8GB/s output, once again a dedicated hardware block is needed. Unless the game engine is consistently trying to load tens or hundreds of textures at the same time (which I don't think it happens). But decompressing one texture is always going to be faster on a 3GHz CPU than a 1.9GHz shader processor.

Which is great apart from a couple of things:
  1. zlib does scale with CPU core count. Here it is showing clear scaling up to 18 cores (the highest number tested).
  2. No-one said RTX-IO would be using zlib or indeed any LZ family based routine. BCPACK which is a block compression algorithm seems more likely but it could easily be something else entirely.
  3. Nvidia, the worlds foremost GPU maker say they can do decompression on the GPU at >14GB/s output with minimal performance impact. They probably know a thing or two about this so I'm inclined to believe them. Oh, and they've demonstrated it working to the press.

Cerny wasn't going to spend a third of the PS5 hardware presentation talking about the importance of their high-performance decompressor if the problem could have been solved with a couple more CUs on the GPU.

And he didn't. He spent, what, a couple of minutes? And unless you know the economic trade off between adding additional silicon to the main APU vs a custom hardware ASIC then it seems premature to dismiss the GPU based option merely because it wasn't implemented in the consoles.

This is why nVidia, by claiming "we'll just use our many parallel TFLOPs on this mostly single-threaded problem that uses INT operations" is sounding shady as hell.

As mentioned above, modern GPU's have massive INT throughput too. It may even be the case that RTXIO uses the Tensor cores which would explain why it's limited to RTX class GPU's and is described as having a tiny performance hit. You're looking at hundreds of TOPS on offer there which is largely unused.

And the fact that they're not even disclosing what compression format is making their GPUs so damn effective at decompression makes it even shadier.

Or they're constrained by NDA due to links into Direct Storage.

And it's not below nVidia to be shady (or lie or be dishonest) on features that are promised years in advance. Nor is it below AMD or Intel BTW. This isn't vendor-specific.
They're not lying if their GPGPU decompressor only outputs 14GB/s if it's loading 10000 zlib-compressed textures in parallel, even if it never happens in a real-life scenario. They're just being dishonest.
Their marketing team knows how much popularity and attention the fast decompression features on Microsoft and Sony's consoles have garnered, and it would be harder to sell their $700-$1500 graphics cards if they had nothing to say about it.

Okay, so Nvidia, aren't lying, they're just being dishonest. And presumably the demonstration they showed to the press was faked. Got it.
 
How do the multiple complexes in a Ryzen CPU deal with this? Do I need my NVMe drives to be on the same root complex as my GPU? Is any of this even relevant for Windows?

Apparently all AMD CPU's since Zen support P2P DMA between root ports on the same complex (which would mean any 2 capable devices in a typical desktop system:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.2-AMD-Zen-P2P-DMA

I'm not sure about Intel though. It looks like their support may be more sketchy.

I think the more fundamental question at this stage though is whether we're actually talking about P2P DMA here or in fact the data still goes via the CPU/system memory, but has much less interaction with the CPU. e.g. if we're looking at unbuffered data transfers with no CPU decompression then cutting the CPU out of the data flow in the diagrams might be warranted for illustration purposes to demonstrate the significantly reduced interaction.
 
I think the more fundamental question at this stage though is whether we're actually talking about P2P DMA here or in fact the data still goes via the CPU/system memory, but has much less interaction with the CPU. e.g. if we're looking at unbuffered data transfers with no CPU decompression then cutting the CPU out of the data flow in the diagrams might be warranted for illustration purposes to demonstrate the significantly reduced interaction.
I wouldn't be surprised if it didn't have to support both methods.

DirectStorage would need to, it wouldn't cut out that amount of hardware.
 
You think I'm in the conspiracy theory level territory, I think you're in the drinking the Kool-Aid too soon territory. Perhaps the real territory is somewhere in the middle.
Though I did my best to explain my position, and I've yet to see a reasonable counter-argument so far. Saying we should take everything from nvidia at face value because they wouldn't lie doesn't seem reasonable enough to me, especially at B3D.
That's a pretty intense take on his words here and I don't think Alex said anything offensive either, at least not to warrant a direct attack on another forum member. He said to assume all of those negative things was to be conspiratorial. Now I haven't followed this thread so I'm just jumping in here, but there is a big difference between looking at the data points and suggesting an explanation versus choosing a explanation and then looking for data points to support it. The latter is actually considered conspiratorial, most of the time this happens anyway, but people concede their point when the evidence is mounted against them. You'll need to decide if you have been working with grounded data points and working your way to an explanation, or choosing an end point (ie. Nvidia is lying) and finding data to prove that.

I think if a journalist who has access to materials and bound by embargo dates says just wait around for the real news, is a far reach to assume he means to drink the kool-aid and take everything Nvidia says at face value. He just may know things you don't.
 
Yup, you're going to need a heck of a lot of those being transferred and decompressed simultaneously to get anywhere near saturating a 5.5GB/s SSD.
You're assuming Sony won't resurrect Studio Liverpool and have them reprise Wipeout at 120Hz with more and more bespoke geometry and textures until the PS5 explodes. :runaway: After that they can remaster G-Police and I will be happy. :yes:
 
1. zlib does scale with CPU core count. Here it is showing clear scaling up to 18 cores (the highest number tested).
It's just using decompression of several files in parallel.

Observing that ZIP isn't scalable with CPU cores isn't that hard to observe. Anyone with a PC can do it.
Just try to compress a ~3GB folder using 7zip and then decompress it, preferably from one SSD to another (or from a SSD to e.g. a RAMdrive) so that storage doesn't become the bottleneck.
Then check on the Windows Task Manager how many threads are being pushed with high utilization during decompression.


I recommend reading this paper on the subject, in which the authors from IBM and Columbia U. comment on the limitations of Zlib for parallel computing performance and propose a new compression format that is parallel-friendly:
Massively-Parallel Lossless Data Decompression

(...) accelerating decompression on massively parallel processors like GPUs presents new challenges. Straightforward parallelization methods, in which the input block is simply split into many, much smaller data blocks that are then processed independently by each processor, result in poorer compression efficiency, due to the reduced redundancy in the smaller blocks, as well as diminishing performance returns caused by per-block overheads

Note: they're using compression of html text and sparse matrixes that compress a lot more than textures, which is why zlib reaches 3:1 and 5:1 compression ratios in there, whereas with textures it's usually ~1.8 or less.

nzv1Onf.png


In the end, they came up with a compressor that is indeed much faster at decompressing, but it also has a much lower compression ratio (meaning effective throughput is very far from nvidia's "as fast as 24 cores" claim). With their method they spend a bit less energy than CPU Zlib for the decompression operation, though at the cost of significantly higher disk space and they depend on a very high storage source throughput. There's no free meal here. Kraken is probably much better here, and so should be BCPack for textures.


They couldn't do anything remotely close to the performance of dedicated ASICs or hardware blocks.




2. No-one said RTX-IO would be using zlib or indeed any LZ family based routine. BCPACK which is a block compression algorithm seems more likely but it could easily be something else entirely.
Or they're just using massive amounts of parallel decompression threads of different texture files, which in the end makes the 14GB/s throughput an unrealistic load for any real-life scenario. And although the aggregated throughput is high, the time it takes to decompress one large texture makes it unusable for actual texture streaming in games.
And only future GPUs that actually have dedicated hardware blocks for decompression, will ever make DirectStorage with GPU decompression usable.



3. Nvidia, the worlds foremost GPU maker say they can do decompression on the GPU at >14GB/s output with minimal performance impact.
It's the same nvidia-the-worlds-foremost-gpu-maker who presented on stage a graphics card for (paper-)launch day, which photos later showed was being held together by woodscrews.
Lack of information is usually suspicious, and in this case they're omitting a ton of it.


And he didn't. He spent, what, a couple of minutes?
He starts talking about storage at the 5 minute mark. He starts talking about Kraken and the Custom IO Unit at ~17m. He moves on from the storage talk at ~24m. In a 53 minute presentation.


As mentioned above, modern GPU's have massive INT throughput too. It may even be the case that RTXIO uses the Tensor cores which would explain why it's limited to RTX class GPU's and is described as having a tiny performance hit. You're looking at hundreds of TOPS on offer there which is largely unused.
It's large throughput with very low single-threaded performance. Still not a good match for decompression.


Okay, so Nvidia, aren't lying, they're just being dishonest. And presumably the demonstration they showed to the press was faked.
They didn't show DirectStorage with CPU decompression, they only showed DirectStorage GPU vs. current on CPU. It's apples vs. oranges.
Why didn't they show CPU vs. GPU both on DirectStorage? Ask yourself why they would hide that, if their GPU is so much faster than the CPU at decompression. For all I know the IO overhead reduction alone is responsible for that speedup.
They also didn't say whether those 24 Threadripper cores were concurrently being taxed or not. For all you know, a 6-core Ryzen 3600 using DirectStorage could have achieved faster loading times than the RTX IO result.



PS5 for example uses 256Kb chunks. Each of them could be decompressed simultaniously.
Few textures are going to be 256KB. A 32bit color 4K*4K decompressed texture is ~67MBytes ([4096*4096*32bit] / 8). With lossless delta color compression I think we're looking at about half of that, so 33MBytes, and Kraken compression should put it on the ~20MB mark.
Using 256KB for block size is just a means to guarantee maximum throughput from the SSD controller considering the IO operations limit. It doesn't mean you can do anything out of every isolated block. The PS5's custom IO controller probably needs to gather several 256KB blocks to gather join into a larger compressed texture file (probably inside the ESRAM they mentioned).
Texture decompression can only happen after the large compressed file is put together in one place. In the case of a 20MB compressed 4K texture we're looking at 80x 256KB blocks that you need to put together before starting to decompress the texture.




That's a pretty intense take on his words here and I don't think Alex said anything offensive either, at least not to warrant a direct attack on another forum member.
Well, now I'm just trying to figure out what you interpreted as a pretty intense take on whose words, and what exactly you're implying was a direct attack.
I certainly didn't mean any of what I wrote as an attack, only as presenting a diverging opinion.
 
Back
Top