Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Since around 2011, AMD and Intel began incorporating south-bridge and north-bridges controllers on the main CPU die itself. The bus controllers very much still exist and their features still advance filling the need to support various new I/O models. The distinct logic blocks and interconnects continue still exist on-die, i.e. there are still but controllers for the different devices that can be connected. The integration was why CPU pin-counts exploded because the motherboard suddenly had a lot of more signals crowding into one chip.

edit: deleted some errant text.

There is no southbridge between a NVME SSD and the CPU/GPU. AMD FCH/Intel PCH services SATA and different ports but a nvme SSD or a discrete gpu don't go over a southbridge (they have direct links to the memory controller hub).

Or.......Nvidia have just rebranded GPUDirect Storage (From 2019) in to RTX I/O.

The system diagrams are an exact match.

It also explains why that NIC block is present as GPUDirect Storage uses NIC's for server and drive access which is why it's there in the diagram.

GPUDirect Storage also uses a PCIEX switch which is what that PCIEX block in the RTX I/O slide is, it's not to show it doesn't go through the CPU as everyone thought.

It's an actual, physical and separate PCIEX switch.

So it seems that whole slide is a waste of time.

GPUDirect and RTX IO share the same philosophy but aren't the same thing. That being said, RTX IO is probably a derivative of GPUDirect. However, GPUDirect doesn't need DirectStorage for functionality (RTXIO does) and is incompatible with Windows. So, RTX IO used in conjunction with DirectStorage allows GPUDirect functionality on a Windows machine.
 
Last edited:
There is no southbridge between a NVME SSD and the CPU/GPU. AMD FCH/Intel PCH services SATA and different ports but nvme SSD or a discrete gpu don't go over a southbridge.
I know, I set this out clearly in this post - including data paths for PCs using both PCIe connections and those using legacy drive connections, because according to the Steam Harvey survey, there are still people using hardware from that era.
 
All that does is explain the reason the I/O load increase, which myself, you and everyone knows.

It doesn't explain why RTX is so much higher than everything else.

Andrew Goossen from Microsoft when discussing storage, decompression and I/O on XSX.



So that's 3 cores @ 4.8Gb/s

So you're talking 9-10 CPU cores based to get to the 14Gb/s on the RTX I/O slide, a huge difference to the 24 the slide shows :runaway:

So as I said previously, are they using Intel Atom's for the decompression work?

Have they used a slow CPU to make RTX I/O seem better than it is? Marketing your GPU does the same work as 24 cores is better than saying it beats 9-10 cores.

So to me that slide is not right and isn't something I'm using for anything.

Or they can be using a lossless format that provides high compression but makes for slow decode on a cpu. Who knows? Its not like the statements were meant to be comparable when made by MS and Nvidia.
 
I said last year that I wouldn't be surprised if we see something like PS5's I/O complex included in CPU's in the future.

Microsoft have stated they are considering a hardware based solution as a next step after GPU decompression but I'm personally far from convinced that it's needed. GPU's are already way faster than is needed to keep up with the fastest SSD decompression requirements and the remaining load on the CPU should be negligible. We're already looking at sub 2 second load times even before GPU decompression arrives.

For me this fits into the category of dedicated sound cards and PhysX accelerators. Cool in principle but next to useless in practice.

And moving everything to the CPU via dedicated fixed function hardware would be a better option as it would offer a more efficient approach.

And is also counter to the general direction of the industry in making things less fixed function and more programmable. If the capability already exists with GPU's, why add complexity and cost, while also sacrificing flexibility by adding a dedicated hardware unit to the CPU? To say nothing of the fact that you're locked out of this functionality until you buy a new CPU when your existing GPU can handle it just fine.

On PC, if commits to do decompression on the CPU then you are stuck with the situation that you are still routing all compressed data via the CPU. I'd argue it makes sense sense to have a smarter controller elsewhere, more like the traditional northbridge,

Traffic to main memory or the GPU already goes via the CPU because that's where the traditional Northbridge functions now reside. So if it made sense to do this in hardware (I don't think it does) then the CPU is a reasonable place to put that block. But that has the disadvantage compared to the existing iteration of RTX-IO of sending uncompressed data over the PCIe bus to the GPU. I'd argue there are greater advantages in the current solution of sending that data to the GPU in compressed form which would save considerable PCIe bandwidth (not that that is really needed either).

The CPU will ALWAYS be involved because of how the hardware is linked and how Windows works.

I think you may be misinterpreting what they mean by CPU bottlenecks. They are most likely talking about:

1. The decompression which in Forsaken still happens on the CPU (and obvious bottleneck)
2. All of the other world setup and initialisation tasks the CPU has to deal with when loading a game. These impact consoles just as much as PC and there's really good info out there about how they are usually the bottleneck even on PS5.

How does that slide fit in to any currently available gaming PC where an NVME drive connects via an NIC?

How does doubling the throughput from 7Gb/s to 14Gb/s (With compression) require a 12x jump in CPU performance?

The 7GB is with uncompressed data. So no CPU work is required for decompression at all. The 14GB is the CPU doing realtime decompression at 7GB/s, hence the CPU requirement balloons.
 
Last edited:
Microsoft have stated they are considering a hardware based solution as a next step after GPU decompression but I'm personally far from convinced that it's needed. GPU's are already way faster than is needed to keep up with the fastest SSD decompression requirements and the remaining load on the CPU should be negligible. We're already looking at sub 2 second load times even before GPU decompression arrives.

We've seen one game with limited testing so it's very unclear exactly how much GPU and CPU performance is actually needed until the game is out and tested.

And is also counter to the general direction of the industry in making things less fixed function and more programmable.

So as per above, why have Microsoft stated they are considering a hardware based solution?

Fixed function is far from being dead and if Microsoft are considering going that route they obviously feel it's worthwhile.

If the capability already exists with GPU's, why add complexity and cost, while also sacrificing flexibility by adding a dedicated hardware unit to the CPU? To say nothing of the fact that you're locked out of this functionality until you buy a new CPU when your existing GPU can handle it just fine.

Capability already exists on certain GPU's.

RTX I/O requires an RTX GPU, plenty of people are yet to purchase one so do not have already have a capable GPU.

And you're locked out of RT until you buy an RT GPU, you're locked out of mesh shaders until you buy a mesh shader capable GPU.....

You're always locked out of something on PC until you upgrade.

I think you may be misinterpreting what they mean by CPU bottlenecks. They are most likely talking about:

1. The decompression which in Forsaken still happens on the CPU (and obvious bottleneck)
2. All of the other world setup and initialisation tasks the CPU has to deal with when loading a game. These impact consoles just as much as PC and there's really good info out there about how they are usually the bottleneck even on PS5.

There's no confusion, they were pretty clear there are other CPU related bottlenecks that need to be addressed.

PS5 is possibly not the best machine to use to prove your point as it seems that the I/O complex deals with everything the CPU would normally do when used correctly.

The 7GB is with uncompressed data. So no CPU work is required for decompression at all. The 14GB is the CPU doing realtime decompression at 7GB/s, hence the CPU requirement balloons.

You've missed my point which was discussed on the last few pages.
 
nvidia-rtx-io-cpu-skipped-diagram-1024x574.jpg
You've missed my point which was discussed on the last few pages.

I think you misunderstood what people are trying to explain. The pc/RTX/DS solution on 2018+ hardware is comparable to the PS5 solution, with the ability to be (much) faster.
Unless NV, DS and everyone is lying ofcourse, but i feel thats for a different topic.

https://samagame.com/blog/en/nvidias-rtx-io-will-give-pc-capabilities-comparable-to-ps5-ssds/

https://cdn.thefpsreview.com/wp-con...vidia-rtx-io-cpu-skipped-diagram-1024x574.jpg
 
nvidia-rtx-io-cpu-skipped-diagram-1024x574.jpg


I think you misunderstood what people are trying to explain. The pc/RTX/DS solution on 2018+ hardware is comparable to the PS5 solution, with the ability to be (much) faster.
Unless NV, DS and everyone is lying ofcourse, but i feel thats for a different topic.

https://samagame.com/blog/en/nvidias-rtx-io-will-give-pc-capabilities-comparable-to-ps5-ssds/

https://cdn.thefpsreview.com/wp-con...vidia-rtx-io-cpu-skipped-diagram-1024x574.jpg

I've not mis-understood anything, it's pretty clear that chart is incorrect.

And it's also a recycled system diagram from a 2019 Nvidia technology meant for servers, which used hardware PCIEX switches (Visible in that slide)

There's literally not a gaming motherboard set-up like that diagram.

But you keep on believing such a chart instead of acknowledging and challenging it's obvious flaws.
 
Last edited:
I've not mis-understood anything, it's pretty clear that chart is incorrect.

And it's also a recycled system diagram from a 2019 Nvidia technology meant for servers, which used hardware PCIEX switches (Visible in that slide)

There's literally not a gaming motherboard set-up like that diagram.

But you keep on believing such a chart instead of acknowledging and challenging it's obvious flaws.

Well, for lies and other such theories by manufacturers i advice you to take that elsewhere than this topic. Il just believe what NV, MS and other manufacturers claim untill disproven.
 
Well, for lies and other such theories by manufacturers I advice you to take that elsewhere than this topic. Il just believe what NV, MS and other manufacturers claim until disproven.

I will keep it in this topic thank you as it's relevant to the discussion and I also think I did a pretty good job at proving Nvidia's claims (And that slide) are not accurate.
 
I will keep it in this topic thank you as it's relevant to the discussion and I also think I did a pretty good job at proving Nvidia's claims (And that slide) are not accurate.

Do you have any evidence that NV is lying? You'd have to come with something else than your own humble guesses and claims.

If claiming manufacturers and corporations are lying is ok, that opens up the doors for more such discussions.
 
Do you have any evidence that NV is lying? You'd have to come with something else than your own humble guesses and claims.

If claiming manufacturers and corporations are lying is ok, that opens up the doors for more such discussions.

I think the fact they're marketing a gaming centric feature using a hardware diagram that doesn't exist in a single gaming PC is a pretty big give away.

Unless you know of a motherboard that connects an NVME drive to an NIC which then connects to a dedicated PCIEX switch? Because I don't know of any.

Do you not find it suspicious that the hardware diagram looks nothing like a current gaming PC?

Do you not find it suspicious that the hardware diagram is exactly like the one they showed in 2019 for GPUDirect Storage?

Also, Microsoft stated that to reach the equivalent 4.8Gb/s of decompressed data of XSX via a software route it would take two Zen 2 CPU cores.

But yet Nvidia are claiming that 14Gb/s of decompressed data (a 2.9x increase over XSX) requires a 12x increase in CPU cores over what Microsoft claim?

Do you not find that highly suspicious?

Do you not wonder how their figure is much higher than what Microsoft (And even Sony) have claimed?

Are they using Core 2 Duo's for the decompression?
 
Unless you know of a motherboard that connects an NVME drive to an NIC which then connects to a dedicated PCIEX switch? Because I don't know of any.

Is it probable that such motherboards are in the pipeline? Or is it way to expensive?
 
Also, Microsoft stated that to reach the equivalent 4.8Gb/s of decompressed data of XSX via a software route it would take two Zen 2 CPU cores.

That's not actually what MS said though. MS said that handling the file IO without Direct Storage would take 2 Zen cores to cover the overhead.

"The final component in the triumvirate is an extension to DirectX - DirectStorage - a necessary upgrade bearing in mind that existing file I/O protocols are knocking on for 30 years old, and in their current form would require two Zen CPU cores simply to cover the overhead, which DirectStorage reduces to just one tenth of single core."

They are not including CPU decompression when they talk about those two Zen 2 cores. MS then went on to say this about decompression:

""Plus it has other benefits," enthuses Andrew Goossen. "It's less latent and it saves a ton of CPU. With the best competitive solution, we found doing decompression software to match the SSD rate would have consumed three Zen 2 CPU cores. When you add in the IO CPU overhead, that's another two cores. So the resulting workload would have completely consumed five Zen 2 CPU cores when now it only takes a tenth of a CPU core. So in other words, to equal the performance of a Series X at its full IO rate, you would need to build a PC with 13 Zen 2 cores. That's seven cores dedicated for the game: one for Windows and shell and five for the IO and decompression overhead.""

So that's 5 cores in total.

But yet Nvidia are claiming that 14Gb/s of decompressed data (a 2.9x increase over XSX) requires a 12x increase in CPU cores over what Microsoft claim?

It's actually closer to a 2x difference rather than 12x when you use MS's figures.

There's also a lot we don't know that could influence the figures presented:

- Type of compression Nvidia were referring to (.zip? BCPack? Something else?)
- Were MS talking about .zip or BCPack?
- Were MS talking about game guaranteed 2 GB/s or peak device 2.4 GB/s?
- What CPU cores were Nvidia referring to?

There a difference for sure between MS talking about Xbox and Nvidia's slide, but it's not remotely as big as you're saying.

There might also be other factors on PC. For example, maybe there is additional CPU overhead on PC for decompressing to main ram and then transferring to GPU memory, and then setting assets up for GPU use using a less efficient version DirectX.

We could really do with more detailed information.
 
Is it probable that such motherboards are in the pipeline? Or is it way to expensive?

Nothing to do with expense, it just makes no sense.

If they went this route not only would they be expecting a user to purchase an RTX GPU but also a whole new system, or at the very least a new motherboard.

My theory is RTX I/O is GPUDirect Storage from 2019 that Nvidia have simply modified to work with Direct Storage.

And these slides are likely a half arsed and lazy attempt of Nvdia's to just re-use the old marketing material they made for GPUDirect Storage.
 
That's not actually what MS said though. MS said that handling the file IO without Direct Storage would take 2 Zen cores to cover the overhead.

"The final component in the triumvirate is an extension to DirectX - DirectStorage - a necessary upgrade bearing in mind that existing file I/O protocols are knocking on for 30 years old, and in their current form would require two Zen CPU cores simply to cover the overhead, which DirectStorage reduces to just one tenth of single core."

They are not including CPU decompression when they talk about those two Zen 2 cores. MS then went on to say this about decompression:

""Plus it has other benefits," enthuses Andrew Goossen. "It's less latent and it saves a ton of CPU. With the best competitive solution, we found doing decompression software to match the SSD rate would have consumed three Zen 2 CPU cores. When you add in the IO CPU overhead, that's another two cores. So the resulting workload would have completely consumed five Zen 2 CPU cores when now it only takes a tenth of a CPU core. So in other words, to equal the performance of a Series X at its full IO rate, you would need to build a PC with 13 Zen 2 cores. That's seven cores dedicated for the game: one for Windows and shell and five for the IO and decompression overhead.""

So that's 5 cores in total.



It's actually closer to a 2x difference rather than 12x when you use MS's figures.

There's also a lot we don't know that could influence the figures presented:

- Type of compression Nvidia were referring to (.zip? BCPack? Something else?)
- Were MS talking about .zip or BCPack?
- Were MS talking about game guaranteed 2 GB/s or peak device 2.4 GB/s?
- What CPU cores were Nvidia referring to?

There a difference for sure between MS talking about Xbox and Nvidia's slide, but it's not remotely as big as you're saying.

There might also be other factors on PC. For example, maybe there is additional CPU overhead on PC for decompressing to main ram and then transferring to GPU memory, and then setting assets up for GPU use using a less efficient version DirectX.

We could really do with more detailed information.

We deffinately need more information but I dissaagree with your numbers.

I get the impression that Nvidia 24 CPU cores are purely for decompression so adding up the extra three cores from what Microsoft said isn't quite right.

But.... I've dug out my old OCZ Revo drives ready for benchmarking.

Curius to see how my x58 system runs Direct Storage.
 
We deffinately need more information but I dissaagree with your numbers.

I get the impression that Nvidia 24 CPU cores are purely for decompression so adding up the extra three cores from what Microsoft said isn't quite right.

But.... I've dug out my old OCZ Revo drives ready for benchmarking.

Curius to see how my x58 system runs Direct Storage.

Here's something else to keep in mind when attempting to correlate the MS presentation to the NV slide.

According to MS...

With the best competitive solution, we found doing decompression software to match the SSD rate would have consumed three Zen 2 CPU cores. When you add in the IO CPU overhead, that's another two cores. So the resulting workload would have completely consumed five Zen 2 CPU cores when now it only takes a tenth of a CPU core

So, that's given them a 50x increase in, let's call it "efficiency". 5 cores (standard) versus 0.1 cores (improved).

For the NV presentation we only get a 48x increase in "efficiency" going by their slide. 24 cores (standard) versus 0.5 cores (improved).

So, they're basically roughly equivalent or at least in the same playground. Although even then they aren't "directly comparable" as we don't know what CPUs, what workloads, what compression, what file mix (small versus large files, one monolithic file versus multiple files), etc. that they are using for their calculations.

It's obvious by that, that while the increase in efficiency is similar, they're obviously looking at it differently and it basically comes down to what CPU they are using for their comparison and what tests they are running. MS is obviously using a much more powerful CPU core for their comparison (The ones used in the XBS consoles) while NV are using far less capable CPU cores to make a more "dramatic" looking graph. It's also likely that MS are using multiple files while NV may be using a single monolithic file as that would help explain why NV's graph uses less CPU cores for just a dumb file transfer.

There's a lot of weirdness with the NV slide, of course, as it's primarily meant as a PR slide, whereas Microsoft's presentation is more technical and likely more accurate.

Regards,
SB
 
Last edited:
Well, for lies and other such theories by manufacturers i advice you to take that elsewhere than this topic. Il just believe what NV, MS and other manufacturers claim untill disproven.
It's valid discussion for this thread, limited to the scope of interpreting that one slide as it's a key piece of evidence. Without really understanding what it is and where it fits in, we can't make meaningful predictions from it.

The primary issue davis.anthony identifies is that no gaming rig passes storage over a NIC, thereby making the diagrams moot. Although true, I don't think that negates the message which is what happens after the PCIe bus. There are two other slides:

geforce-rtx-30-series-rtx-io-games-bottlenecked-by-traditional-io.jpg


geforce-rtx-30-series-rtx-io-compressed-data-needed.jpg

These are accurate in topology to the current requirements of gaming PCs - CPU core numbers could be overinflated for PR purposes. Introduce the third slide as relative to these other two...

geforce-rtx-30-series-rtx-io-announcing-rtx-io.jpg


...and we see a clear message that the routing of data is going from PCIe to GPU in RTX IO. Datapoints aside, which might well be exaggerated worst/best-case marketing numbers for PR purposes, that's the whole message of RTX IO that it bypasses the CPU involvement in getting data from storage to GPU and 'GPU RAM' which I take is VRAM.

I think at this juncture it's better to take this at face value and then consider how its achieved rather than dismiss it all as smoke and mirrors. Either that or don't involve oneself in the discussion until further info arises. Even if RTX IO turns out to be smoke and mirrors, the technical discussion favours the concept of working around Windows' present IO limitations. "Are nVidia doing it this way?" is a better question to ask, only to find out "they weren't doing that at all," then to ask, "is this all bullshit?" only to find out it wasn't. ;-)
 
It's valid discussion for this thread, limited to the scope of interpreting that one slide as it's a key piece of evidence. Without really understanding what it is and where it fits in, we can't make meaningful predictions from it.

The primary issue davis.anthony identifies is that no gaming rig passes storage over a NIC, thereby making the diagrams moot. Although true, I don't think that negates the message which is what happens after the PCIe bus. There are two other slides:




This are accurate to the current requirements of gaming PCs. Introduce the third slide as relative to these other two...



And we see a clear message that the routing of data is going from PCIe to GPU. Datapoints aside, which might well be exaggerated worst-case marketing numbers for PR purposes, that's the whole message of RTX IO that it bypasses the CPU involvement in getting data from storage to GPU and 'GPU RAM' which I take is VRAM. RTX IO is different to DirectX storage as DXS works with RTX IO.

I think at this juncture it's better to take this at face value and then consider how its achieved rather than dismiss it all as smoke and mirrors. Either that or don't involve oneself in the discussion until further info arises. Even if RTX IO turns out to be smoke and mirrors, the technical discussion favours the concept of working around Windows' present IO limitations

Based on what I've found that PCIEX block on the diagram is actually a completely separate PCIEX switch which has nothing to do with the CPU.

Motherboards don't have this PCIEX either but it's on the GPUDirect Storage diagrams.

So the diagrams are nothing like we have now, they could represent a very future version of RTX I/O and Direct Storage where eventually they will transition to have more hardware on the board.

Having a PCIEX switch like in the diagram would actually be a very good solution.

You could have say an 8 lane PCIEX switch with a 4+6 set-up.

So you would connect your 8 lane NVME drive to this switch, it would send 4 lanes worth of bandwidth the CPU/RAM and 6 lanes worth of bandwidth to the GPU..... or the Switch could be configured to send varying levels of bandwidth to where it's needed, for example, if you're not gaming it could only give the GPU 1 lane and the OS the remaining 9.
 
Back
Top