Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

And moving everything to the CPU via dedicated fixed function hardware would be a better option as it would offer a more efficient approach.
No, it wouldn't. Fixed function hardware is efficient, yes. But compression algorithms tend to evolve over time, especially on PC. E.g. LZ or kraken are nice, but even they evolve and when that happens, the fixed function hardware is useless. You can do something like that in a closed environment like consoles but on PCs it just don't makes sense. PCs can be upgraded with faster CPUs, more cores, ... that can do the work needed. Yes it is a bit brute-forcing through the stuff, but at least you can always use what is best for the situation and always use the newest stuff.
It is a bit different with things like tensor cores (or something like that). Those are not really fixed function units, but good at one thing and bad at other things. I guess we might see something like this in future, just accelerate compression. Some general purpose cores that are especially fast with de/compression stuff.

This compression thing somehow reminds me at the early Q2/3 engine games that had compressed files to save bandwidth but as the bandwidth of the HDDs grew over time, it was faster to decompress all files and directly to use those.
 
No, it wouldn't. Fixed function hardware is efficient, yes. But compression algorithms tend to evolve over time, especially on PC. E.g. LZ or kraken are nice, but even they evolve and when that happens, the fixed function hardware is useless. You can do something like that in a closed environment like consoles but on PCs it just don't makes sense. PCs can be upgraded with faster CPUs, more cores, ... that can do the work needed.

If you put the hardware in the CPU itself then if you upgrade your CPU you upgrade your fixed function decompression hardware and algorithms a long with it.

Putting fixed function hardware in the CPU also means that those that can't afford 20core CPU's would still get the same decompression performance even if they don't get the same core count.
 
No, it wouldn't. Fixed function hardware is efficient, yes. But compression algorithms tend to evolve over time, especially on PC. E.g. LZ or kraken are nice, but even they evolve and when that happens, the fixed function hardware is useless.

Whilst compression algorithms do evolve, decompression standards (like DEFLATE) change very rarely. You can use any number of different zlib and Lzw compressors on any platform and some will produce much better compressed data, or data the decompresses much faster, but because of the universal way that compressed data in these standards is stored, any decompressor will unpack it just fine - even decompressors from fifteen years ago.

The same is true with Oodle and Kraken. Most updates to compression utilities are improving the quality of the compressed data and adding utilitarian features like cross-filesystem metadata support, encryption. I can't even remember the last significant compression algorithm change that resulted in a genuinely new feature change decompression library - probably dropping 32kb as the default table size. Everything else is just routine software maintenance. zlib, which has been around since 1996, is on stable release 1.2.11 and that was updated in 2017 to fix a bug.
 
If you put the hardware in the CPU itself then if you upgrade your CPU you upgrade your fixed function decompression hardware and algorithms a long with it.

Putting fixed function hardware in the CPU also means that those that can't afford 20core CPU's would still get the same decompression performance even if they don't get the same core count.
it makes the logic more complicated, bigger and you have spend die space for just one specific case. Investing in comon-purpose stuff is more "efficient" in that way ;) as the parts of the CPU can also be used for other stuff.

Whilst compression algorithms do evolve, decompression standards (like DEFLATE) change very rarely. You can use any number of different zlib and Lzw compressors on any platform and some will produce much better compressed data, or data the decompresses much faster, but because of the universal way that compressed data in these standards is stored, any decompressor will unpack it just fine - even decompressors from fifteen years ago.

The same is true with Oodle and Kraken. Most updates to compression utilities are improving the quality of the compressed data and adding utilitarian features like cross-filesystem metadata support, encryption. I can't even remember the last significant compression algorithm change that resulted in a genuinely new feature change decompression library - probably dropping 32kb as the default table size. Everything else is just routine software maintenance. zlib, which has been around since 1996, is on stable release 1.2.11 and that was updated in 2017 to fix a bug.
Even decompression changes over time. Yes, you have some standard things that aren't very efficient but are used as they are old standard. But you still always get new compression algorithms in those things that can't be decompressed by old decompression software because something was changed.
 
And that is based on seeing one game (Foresaken) built to run on HDD's?

What does it run like on the average PC with a 6 core CPU and SATA III SSD?

And moving everything to the CPU via dedicated fixed function hardware would be a better option as it would offer a more efficient approach.

As explained by someone else, the fixed function decompressors are quite limited. GPU's are excellent at just that kind of work (parallel) Nvidia showcased two years what GPU's can do.
RTX-IO can scale to just whatever new nvme drive speeds would be.
 
Even decompression changes over time. Yes, you have some standard things that aren't very efficient but are used as they are old standard. But you still always get new compression algorithms in those things that can't be decompressed by old decompression software because something was changed.
This argument is true for video compression too. Yet we have h.264 and then h.265 decompression blocks on the GPU because it's soooo much more efficient than playing video on general purpose silicon. I think standard decompression hardware makes sense. Even if some new tech appears, it'll only have marginal improvements and the gains in using a hardware block for an older technique will generally outweigh the losses. Then if something completely revolutionary comes along, you just update the silicon in new hardware. eg. h.264 came out in 2004. h.265 came out in 2013. h.264 accounted for 91% of video content as of 2019 according to this.

We have exactly the same thing for support for compressed textures. You build in texture decompression into the hardware pipeline, and then update the silicon when a new format appears.
 
True, though nvme/SSD speeds are increasing the further we go in time. In which case non-fixed function like GPU decompression could and probably can scale accordingly. At one point we might be seeing DDR4-like performance or faster in consumer nvme drives. Would also be handy if/when newer formats appears, there wouldnt be a need for new sillicon development as the GPU could adapt to basically any new workload (in its class).
 
This argument is true for video compression too. Yet we have h.264 and then h.265 decompression blocks on the GPU because it's soooo much more efficient than playing video on general purpose silicon. I think standard decompression hardware makes sense. Even if some new tech appears, it'll only have marginal improvements and the gains in using a hardware block for an older technique will generally outweigh the losses. Then if something completely revolutionary comes along, you just update the silicon in new hardware. eg. h.264 came out in 2004. h.265 came out in 2013. h.264 accounted for 91% of video content as of 2019 according to this.

We have exactly the same thing for support for compressed textures. You build in texture decompression into the hardware pipeline, and then update the silicon when a new format appears.
Video compression is much more cpu intensive as much must be reconstructed (analysed, ...). A lossless data compression normally isn't. It is only a question, how much data must be decompressed. With a current CPU/GPU data compression is really not a big problem. Maybe an efficiency problem, but nothing more.
 
As explained by someone else, the fixed function decompressors are quite limited. GPU's are excellent at just that kind of work (parallel) Nvidia showcased two years what GPU's can do.
RTX-IO can scale to just whatever new nvme drive speeds would be.

And yet the one in PS5 received support for a format (Oodle texture) that it wasn't designed to have.

RTX I/O will scale to what ever the bus speeds and CPU will allow it to do.

This is taken from the Forespoken DirectStorage GDC talk

The current implementation of DirectStorage in Forspoken is only removing one of the big I/O bottlenecks — others exist on the CPU.

And

But, says Ono, “I/O is no longer a bottleneck for loading times” — the data transfer speeds of DirectStorage are clearly faster for SSDs, and they could improve it in future if they figure out other CPU bottlenecks and take full advantage of GPU asset decompression.

The CPU will ALWAYS be involved because of how the hardware is linked and how Windows works.

RTX I/O will only remove so much of the bottlenecks and the lower end PC's tend have slower PCIEX standards which reduces lane bandwidth even further for data transfer - And it's these systems that will benefit the most from DirectStorage.
 
And yet the one in PS5 received support for a format (Oodle texture) that it wasn't designed to have.

RTX I/O will scale to what ever the bus speeds and CPU will allow it to do.

This is taken from the Forespoken DirectStorage GDC talk



And



The CPU will ALWAYS be involved because of how the hardware is linked and how Windows works.

RTX I/O will only remove so much of the bottlenecks and the lower end PC's tend have slower PCIEX standards which reduces lane bandwidth even further for data transfer - And it's these systems that will benefit the most from DirectStorage.

Did you follow this topic the last few days?
 
And yet the one in PS5 received support for a format (Oodle texture) that it wasn't designed to have.

RTX I/O will scale to what ever the bus speeds and CPU will allow it to do.

This is taken from the Forespoken DirectStorage GDC talk

I already linked to a PCIE "Raid" card that does not directly connect to any drives. IE - the drives are still connected via PCIE directly to the I/O on the CPU die.

With Raid-0, basically just streaming data, data transfer rates in excess of 80 GB/s (significantly faster than PS5) incurred a CPU cost of 1-3%. At that point the CPU cost is negligible.

Data transfers in Windows using the current Window's APIs incur a significantly higher CPU cost. That same situation without using that "Raid" card which is just a bog standard low end NV GPU, would incur multiple 10's of percent of CPU cost.

Why? Because with the "Raid" card the GPU is doing almost all of the heavy lifting WRT handing data transfers basically leaving the CPU to do very minor bookkeeping.

DirectStorage isn't at that point yet because Microsoft isn't currently leveraging the GPU to handle data transfer tasks. Something they could potentially do in the future just as that "Raid" card which is just a NV GPU is doing.

Hell, who's to say that an unused iGPU on the CPU die couldn't then be used for that purpose instead of sitting there doing virtually nothing in a gaming system with a dedicated GPU? Granted this is mostly relevant for Intel CPUs which include an iGPU on all consumer CPU products. AMD only has an iGPU on their "G" desktop processors.

Considering that RTX I/O only exists as a demo, it isn't out of the question that it could be doing something similar to SupremeRAID (the software used by that "Raid" card) which incurs negligible CPU costs for data movement.

Regards,
SB
 
Last edited:
How does that fit in with nVidia's slide showing RTX I/O bypassing the CPU completely?

How does that slide fit in to any currently available gaming PC where an NVME drive connects via an NIC?

How does doubling the throughput from 7Gb/s to 14Gb/s (With compression) require a 12x jump in CPU performance?

They using Intel Atoms?

I would take that slide with a pinch of salt.
 
Because cpu performance needed may not scale linearly, doubling thruput from 7-14GBs may require 4 or 8x times more cpu power to decompress data.
 
Even decompression changes over time. Yes, you have some standard things that aren't very efficient but are used as they are old standard. But you still always get new compression algorithms in those things that can't be decompressed by old decompression software because something was changed.

This is my point, this scenario is really quite rare. The most popular compression/decompression for about fifteen years has been zlib and the last time the decompression library changed in order to bring about an actual algorithm change was around 2001, which introduced future-proofing by allowing larger table sizes. Anything compressed with the latest zlib compression library, using the very latest techniques to squeeze data as small as possible, will decompress using that 2001 decompression library.

Shifty mentioned video. MPEG-4 and HEVC are both algorithms that were specifically designed to allow efficiency improvements over time by the encoding algorithm working flawlessly with standard decompression profile. Oodle is probably the most recent industry-standard to be widely adopted (if you believe RadGameTools) which necessitated a new decompression algorithm. When they introduced a better compression algorithm (Kraken), Oodle decompression handled that.

Choosing not to implement popular decompression techniques in hardware because something new will come along at some points means you'll never implement decompression in hardware. Something new will eventually come along. But zlib decompression hardware was standard in last generation consoles and both Microsoft and Sony have really focussed their architectures on making supported decompression have zero impact on CPU or GPU.

I already linked to a PCIE "Raid" card that does not directly connect to any drives. IE - the drives are still connected via PCIE directly to the I/O on the CPU die.
You can solve almost any problem by throwing money at it.

The solutions used by Microsoft and Sony on current generations consoles are simple, effective and cheap to implement. Throwing RAID at the problem, or putting decompression on GPUs that cost more than a whole console (not your post) is a bit of a mental argument.

If money is no object, I don't doubt that both Microsoft and Sony could design a console architecture with literally no load times. No boot times either.
 
Last edited by a moderator:
And moving everything to the CPU via dedicated fixed function hardware would be a better option as it would offer a more efficient approach.
But then you would be moving uncompressed data over the bus the transfer it to the GPU. As it stands now (with RTXIO), you send compressed data to the CPU and GPU, and the data gets decompressed locally.
 
But then you would be moving uncompressed data over the bus the transfer it to the GPU. As it stands now (with RTXIO), you send compressed data to the CPU and GPU, and the data gets decompressed locally.

But by having a decompressor in the CPU you can decompress everything without needing the CPU or if you wish, the GPU.
 
Back
Top