Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Okay, so you are saying that Nvidia is lying. Or at least being quite deceitful in it's claims about RTX IO performance. And you're basing this conclusion on.... pure speculation. So not a technical discussion anymore then.
No, I'm saying Sony showed specifics on the hardware architecture (zlib+kraken hardware decompressor, dual co-processors, ESRAM, coherency engines, etc.) and how that led to very specific performance figures using specific compression formats, and then they showed videogames taking advantage of the final hardware -> it's working right now.
nVidia showed two slides saying "lulz we'll just do it on GPGPU and use direct storage" and slapping a bar graph showing 2x maximum NVMe 4.0 throughput. Without even mentioning which compression formats they support -> it's not working right now.

So one of them was technical enough to allow a technical discussion over it, the other was not. And the problem is you're taking way too many conclusions from the lack of information nVidia is giving about RTX IO.


The comparison that is being made rightly compares the claimed compressed throughput from each vendor. Neither has actually demonstrated real world throughput measurements / benchmarks (showing short or zero load times in games where we have no idea how much data is being transferred doesn't tell us anything about transfer rates), yet you seem to be taking Sony's claims at face value while assuming Nvidia is being deceitful.
No, but one of them you have seen working and will be shipping in hardware+software within a month and a half. The other you may or may not see working by the end of 2022, depending on Microsoft launching a windows update that enables Direct Storage, and then on PC game developers having the time and resources to implement it.
Your apparent claim that both implementations are somehow on par in what relates to TRL is just wrong. Sony has very little room to be deceitful whereas nvidia is at a point where they can be as deceitful as they want.
In fact, them not disclosing what compression format(s) they'll support to achieve 14GB/s can already be interpreted as a deceitful move.
 
Well, if NV is actually lying, then make those claims elsewhere, as stated before. This isnt the topic for that kind of claims.
 
https://www.cnet.com/news/hands-on-...ackwards-compatibility-and-faster-load-times/
In an unscientific test, loading a Red Dead Redemption 2 save took around two minutes on an Xbox One X and about 30 seconds on the Xbox Series X.
Quick Resume feels like a game-changer. You can jump between games in about 10 seconds and Mike had four different games running at once. Games resume in the exact state you left them in, no reloading saves or returning to menu screens required.

Glad to see prior/current XB games using SSD performance profiles quite nicely.

As a point of reference, on my NVME raid, it takes roughly 35 seconds to access the in-game menu (after starting Rockstar's storefront launcher), and another 27 seconds on actually playing it, after pressing RD2 story-start. So yes, XBSX SSD is hauling ass in this case.:yep2:
 
As a point of reference, on my NVME raid, it takes roughly 35 seconds to access the in-game menu (after starting Rockstar's storefront launcher), and another 27 seconds on actually playing after pressing RD2 story-start. So yes, XBSX SSD is hauling ass in this case.

But you're comparing game-launch + story-starting game (35s + 27s) to loading a save in case of cnet's 30s test, which may have been measured after reaching the start menu.
Did you try loading a save on the start menu?


Regardless, my guess is both the SeriesX and the PC are bottlenecked by (probably single-threaded) CPU decompression performance. I had hopes that BC games that use Zlib would make use of the hardware decompressor on either console, but if there's a hand-coded decompression thread in the game's code then there's little Microsoft or Sony could do.
 
Wow is this the first time some of you folks seen a PC marketing slide? The joy of the flexibility of the PC platform for the marketeer is that any claim can be theoretically true with just the right parts. NV have long had a habit of showing only the most flattering of benches in slides (as do AMD and Intel PC GPU marketing is a dirty biz).

For me the proof in the PC as ever will be when anyone can test their tech away from NDAs and with release drivers. We're not that far from the days when renaming 3Dmark.exe would cost you 20% in scores. I suspect like most GPU innovations that long term this will be great but frankly you won't see any benefit for at least 2-3 years until it's a part of the stable Win10 releases outside TWIMTBP doing it for devs

I don't understand why some are viewing Nvidia's claims as so controversial. What's so outlandish about what they're claiming?

They've simply said that with a 7GB/s SSD (already available) and a 2:1 compression ratio (the same as claimed by both Sony and Microsoft) you will see a 14GB/s output by doing the decompression on the GPU.

Are people really so doubtful that multi-terraflop GPU's can do the same job in compute shaders as the consoles likely quite cheap hardware blocks? And if so, where's the evidence supporting these doubts?
 
But you're comparing game-launch + story-starting game (35s + 27s) to loading a save in case of cnet's 30s test, which may have been measured after reaching the start menu.
Did you try loading a save on the start menu?

If you're able to use Quick Resume then it's quite the difference, about 10 seconds versus (35+27) seconds.

Regardless, my guess is both the SeriesX and the PC are bottlenecked by (probably single-threaded) CPU decompression performance. I had hopes that BC games that use Zlib would make use of the hardware decompressor on either console, but if there's a hand-coded decompression thread in the game's code then there's little Microsoft or Sony could do.

Yeah, it's going to be difficult for Microsoft or Sony to be able to improve older titles that may be coded that way.
 
But you're comparing game-launch + story-starting game (35s + 27s) to loading a save in case of cnet's 30s test, which may have been measured after reaching the start menu.
Did you try loading a save on the start menu?

I was giving numbers for both. Numbers for game launcher to in-game menu. And numbers for actually starting the game once the in-game menu appeared. And starting a game IS actually resuming from a saved point (at least for me anyhow).
 
If you're able to use Quick Resume then it's quite the difference, about 10 seconds versus (35+27) seconds.
You can get a Quick Resume equivalent on the PC if you just pause and use alt+tab out of the game, and then set windows to sleep. I guess you could even do that with more than one game, assuming you have enough RAM to do so.
 
I don't understand why some are viewing Nvidia's claims as so controversial. What's so outlandish about what they're claiming?

They've simply said that with a 7GB/s SSD (already available) and a 2:1 compression ratio (the same as claimed by both Sony and Microsoft) you will see a 14GB/s output by doing the decompression on the GPU.

Are people really so doubtful that multi-terraflop GPU's can do the same job in compute shaders as the consoles likely quite cheap hardware blocks? And if so, where's the evidence supporting these doubts?

Because system topology matters and I've spent too long with too many peripherals failing to talk to each other on the USB bus let alone PCIe to believe this is as simple Plug n' Play (and if you're as old as I am you know how hollow the promise of PnP was for years). NV has long term form for taking genuine improvements and overclaiming them into meaningless hype. When you control an entire platform and every line of BIOS code is yours it's easier to do this, when you're working with dozens of m/b manufacturers with varying degrees of quality control it gets significantly harder to guarantee any given level of performance.
 
I find it quite puzzling Nvidia didn't debut or show a demo of their RTX SSD/IO solution in action. They have access to Samsung memory/storage tech and more than likely has early access to Microsoft's DirectStorage API like most developers and/or driver development teams. I have this strange feeling something is missing at the motherboard level (bios) on facilitating or mapping these request.

Nvidia has shown a demo of RXTIO in action in virtual press briefings:

https://hothardware.com/reviews/nvidia-geforce-rtx-30-series-ampere-details

small_rtx-io-demo-1.jpg

small_rtx-io-demo-2.jpg

Hot hardware said:
A demo to show the theoretical benefits of NVIDIA RTX IO, that works in conjunction with Microsoft's DirectStorage API, was also shown. During the demo, handling the level load and decompression took about 4X as long on a PCIe Gen 4 SSD using current methods and used significantly more CPU core resources. The demo was run on a 24-core Threadripper system and the standard load / decompress took over 5 seconds. With RTX IO, that time was cut to just 1.61 seconds. We won’t even talk about the hard drive’s performance here. Ouch – it hurts just to look at the chart.

All Zen based and newer AMD chips and platforms support P2P DMA between any 2 capable devices so if there is a CPU/platform requirement, the Threadripper would obviously meet it.
 
Are people really so doubtful that multi-terraflop GPU's can do the same job in compute shaders as the consoles likely quite cheap hardware blocks? And if so, where's the evidence supporting these doubts?

I don't think anyone stated as such. I think the problem is that PC/Console warriors are to quick to dismiss the positives that's happening on both sides. Soon, Nvidia, AMD, and possibly Intel will start showing actual demos/games of their actual SSD/IO tech performing. As of now, we're getting a good glimpse of it within the console space.
 
Last edited:
No, I'm saying Sony showed specifics on the hardware architecture (zlib+kraken hardware decompressor, dual co-processors, ESRAM, coherency engines, etc.) and how that led to very specific performance figures using specific compression formats, and then they showed videogames taking advantage of the final hardware -> it's working right now.
nVidia showed two slides saying "lulz we'll just do it on GPGPU and use direct storage" and slapping a bar graph showing 2x maximum NVMe 4.0 throughput. Without even mentioning which compression formats they support -> it's not working right now.

It is working right now. See my previous post.

Granted it won't be coming to market as soon as Sony's solution but nor did anyone claim that it was.

And yes, I take your point that details around the compression scheme have been pretty light. However besides that they've provided largely the same level of detail as Microsoft has around their velocity architecture. I wouldn't be surprised at all to learn that Nvidia aren't at liberty to discuss compression schemes at the moment if they're in any way tied into Direct Storage standards. Either that or they're not mentioning a specific scheme because they're not limited to one specific scheme thanks to this running on compute shaders as opposed to fixed function hardware.

So one of them was technical enough to allow a technical discussion over it, the other was not. And the problem is you're taking way too many conclusions from the lack of information nVidia is giving about RTX IO.

Again, please clarify which conclusions I'm taking that you disagree with an why? The only conclusions I've taken from Nvidia's presentations are those that they've explicitly stated, and I've assumed they're telling the truth about them. Which part do you disagree with?

No, but one of them you have seen working and will be shipping in hardware+software within a month and a half. The other you may or may not see working by the end of 2022, depending on Microsoft launching a windows update that enables Direct Storage, and then on PC game developers having the time and resources to implement it.

As above, Nvidia have already demonstrated this to the press, and no-one has claimed that it's going to be available in the same time scales as the next gen consoles.

Your apparent claim that both implementations are somehow on par in what relates to TRL is just wrong.

What is wrong is your assertion that I have claimed that. I have not, and would not since it patently not true. How could anyone make such a claim when Sony's console releases in a few weeks and Direct Storage doesn't arrive on the PC until next year?

However just because Nvidia's solution is not as close to market as Sony's is not sufficient reason to doubt it's feasibility or their truthfulness. Lets not forget that Sony's SSD tech was announced way back in March and wouldn't launch to market for another 8 months. So Nvidia have until July next year to bring something to market by those same standards.

Sony has very little room to be deceitful whereas nvidia is at a point where they can be as deceitful as they want.
In fact, them not disclosing what compression format(s) they'll support to achieve 14GB/s can already be interpreted as a deceitful move.

I'm not sure I necessarily agree here. While I have no reason to believe Sony have been in any way deceitful in their claims, it's much easier to do so in a console environment where benchmarking the claimed performance is very difficult. In the PC space there are storage benchmarks that can be used to test these claims so Nvidia are much more open to scrutiny in the long run.

As I said though, I don't doubt either companies claims at this point as I see no reasonable basis for doing so.
 
They've simply said that with a 7GB/s SSD (already available) and a 2:1 compression ratio (the same as claimed by both Sony and Microsoft) you will see a 14GB/s output by doing the decompression on the GPU.
Neither Sony nor Microsoft are claiming decompression on the GPU.

Are people really so doubtful that multi-terraflop GPU's can do the same job in compute shaders as the consoles likely quite cheap hardware blocks?

Well for starters, what use are floating point operations for file decompression?


And if so, where's the evidence supporting these doubts?

I wrote this several times, but let's try it again.

How Zlib works.


Zlib (or ZIP, the most popular compression algorithm), like the vast majority of compression formats uses single threaded decompression.

k3OZoQF.png


It basically counts known sequences of bits and groups them together by giving them a different "name". As a very simple example of compression:
"0000000 111111 0000 1111" can be compressed into sequences of "number of zeros"-"number of ones", of which you could say it's 7x0 + 6x1 + 4x0 + 4x1, or "'111 110 100 100".
With this, I "compressed" 21 digits into 12.
But the result is one sequential file of which you can't change the order or take random blocks out of, otherwise it becomes unreadable. Or as explained in the blog post:
Old fashioned compressors like Zip parsed the compressed bit stream serially, acting on each bit in different ways, which requires lots of branches in the decoder - does this bit tell you it's a match or a literal, how many bits of offset should I fetch, etc. This is also creates an inherent data dependency, where decoding each token depends on the last, because you have to know where the previous token ends to find the next one. This means the CPU has to wait for each step of the decoder before it begins the next step.

To make ZIP compression levels higher, you need bigger files. To make ZIP decompression parallel, you'd need to split the original file into smaller files. So to make ZIP decompression more parallel, you lose compression ratio, and effective IO throughput in the process.
So at least with Zlib or anything else that uses ZIP, you don't gain by adding more threads to it.


GPUs aren't better than CPUs at running single-threaded code. They excel at highly parallel tasks. Which is why making a GPGPU decompressor for ZIP makes no sense, other than to save CPU cycles if the raw I/O is slow enough so the GPU decompression doesn't become a bottleneck. The only way to make Zlib decompression faster than a CPU is to make a dedicated hardware block for it.

Kraken is a very different compression format that was developed for decompressing on 2 threads. It's not great for parallel work, but it's better than Zlib's one thread. But it's still not going to gain ridiculous amounts of performance from a GPU.
Which is why, to get crazy high Kraken performance like an >8GB/s output, once again a dedicated hardware block is needed. Unless the game engine is consistently trying to load tens or hundreds of textures at the same time (which I don't think it happens). But decompressing one texture is always going to be faster on a 3GHz CPU than a 1.9GHz shader processor.
Cerny wasn't going to spend a third of the PS5 hardware presentation talking about the importance of their high-performance decompressor if the problem could have been solved with a couple more CUs on the GPU.




This is why nVidia, by claiming "we'll just use our many parallel TFLOPs on this mostly single-threaded problem that uses INT operations" is sounding shady as hell.
And the fact that they're not even disclosing what compression format is making their GPUs so damn effective at decompression makes it even shadier.

And it's not below nVidia to be shady (or lie or be dishonest) on features that are promised years in advance. Nor is it below AMD or Intel BTW. This isn't vendor-specific.
They're not lying if their GPGPU decompressor only outputs 14GB/s if it's loading 10000 zlib-compressed textures in parallel, even if it never happens in a real-life scenario. They're just being dishonest.
Their marketing team knows how much popularity and attention the fast decompression features on Microsoft and Sony's consoles have garnered, and it would be harder to sell their $700-$1500 graphics cards if they had nothing to say about it.
 
@PSman1700, I hope Nvidia's RTX I/O solution is good, but for PC gaming it needs to be widely adopted to be more effective. Hopefully, MB manufacturers, Nvidia, AMD and Intel are creating some type of next-generation SSD/IO standards that aren't necessarily tied to one company's particular product's IP (that never works out well).
Isn't RTX IO just nVidia's hook into DirectStorage? I would think DirectStorage has a good chance of wide adoption and AMD and Intel will have their own fancy marketing names for hooking into the DirectStorage API.
 
We honestly don't know if RTX I/O is simply a branding of parts of the DirectStorage API, something unique or a superset of it with additional Nvidia specific hooks. I agree when it becomes standardised everyone will use it but we don't know how far off that is and depending on how many vendors MS has to interact with to make it work with might take longer given that MS has largely designed DirectX with only three major hardware vendors over the past decade or more (AMD, Intel, Nvidia). In that case perhaps NV will start offering their own badge program "designed for RTX I/O" ahead of broader adoption might offer them an edge. I just don't know if DirectStorage or RTX I/O requires additional BIOS hooks or support for novel I/O commands to work.
 
Isn't RTX IO just nVidia's hook into DirectStorage? I would think DirectStorage has a good chance of wide adoption and AMD and Intel will have their own fancy marketing names for hooking into the DirectStorage API.

That's probably a decent assumption, but we still don't have anything substantial to say that's absolutely the situation.

Even if it is a hook into DirectStorage, I could see Nvidia providing some of the functionality outside of it, that their RTX GPUs can handle. I don't know if that's fully possible via driver extensions or if it would have to be more of an assembly aimed at GPU I/O routines. Routines that game engines start using with the latest versions being a part of the GPU Drivers. Kind of like how Physx libraries were packaged.

It depends on how important it is to Nvidia to offer this functionality outside of DirectX, such as Vulkan or OpenGL.

For that matter, is DirectStorage able to be used outside of DirectX?
 
You can get a Quick Resume equivalent on the PC if you just pause and use alt+tab out of the game, and then set windows to sleep. I guess you could even do that with more than one game, assuming you have enough RAM to do so.

You can quick resume any game on the XSX while playing any other game. The PC can't.
 
Isn't RTX IO just nVidia's hook into DirectStorage? I would think DirectStorage has a good chance of wide adoption and AMD and Intel will have their own fancy marketing names for hooking into the DirectStorage API.
RTX IO takes advantage of DirectStorage which allows you to bypass the CPU and load directly into the GPU memory. But DirectStorage doesn't provide decompression hardware. That's up to the hardware vendor. The GPU vendor needs to provide the decompression scheme whether offering it's using the GPU's shaders or an ASIC. Otherwise, decompression has to happen on the CPU which defeats the purpose of DirectStorage.

DirectStorage offers a direct pathway from the SDD to the GPU's memory while RTX IO offers a way to widen that pathway.
 
Last edited:
RTX IO takes advantage of DirectStorage which allows you to bypass the CPU and load directly into the GPU memory. But DirectStorage doesn't provide decompression hardware. That's up to the hardware vendor. The GPU vendor needs to provide the decompression scheme whether offering it's using the GPU's shaders or an ASIC. Otherwise, decompression has to happen on the CPU which defeats the purpose of DirectStorage.

DirectStorage offers a direct pathway from the SDD to the GPU's memory while RTX IO offers a way to widen that pathway.

If RTX gpus since 2018 support it in hardware, i assume AMD will atleast with RDNA2 in some way too.
 
Last edited:
Back
Top