Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Through the CPU die, although there still should be some CPU intervention no?

Decompression is only one part and the others such as 'file check in' and other tasks Mark Cerny talked about will still be handled CPU side?

RTX I/O isn't going to completely remove the CPU from having to do work.

What decompresses the files that are not graphics related? Audio for example, is that still done CPU side as normal or does it go GPU for decompression and then CPU > RAM?

The benefits of PS5's set-up is the decompression hardware and I/O complex decompress all game related data in the most straight forward way possible.

Watch the videos posted above, it is quite similar to the PS5's idea, as far we can believe NV, their slides and the various explanations done.
 
What decompresses the files that are not graphics related? Audio for example, is that still done CPU side as normal or does it go GPU for decompression and then CPU > RAM?

If it does go to the GPU for decompression, then it doesn't have to go back to the CPU before going to system RAM. The only reason that it has to hit the CPU die when reading from a storage device is that all storage PCIE lanes go through the I/O complex on the CPU. Data from GPU can be written directly to system RAM.

If we were to think of the GPU as the point where everything is decompressed then the only difference to the PS5 I/O chain is that when reading data needs to be "bounced" off the CPU. Once that is accomplished then it's roughly the same from the decompression step forwards, except that the GPU is acting as the decompressor. On PS5 it would go from the decompressor to RAM. On PC, it would go from the decompressor to either VRAM or system RAM.

But yes, we currently don't know whether the GPU would handle all game related decompression tasks or not.

Regards,
SB
 
If it does go to the GPU for decompression, then it doesn't have to go back to the CPU before going to system RAM. The only reason that it has to hit the CPU die when reading from a storage device is that all storage PCIE lanes go through the I/O complex on the CPU. Data from GPU can be written directly to system RAM.

If we were to think of the GPU as the point where everything is decompressed then the only difference to the PS5 I/O chain is that when reading data needs to be "bounced" off the CPU. Once that is accomplished then it's roughly the same from the decompression step forwards, except that the GPU is acting as the decompressor. On PS5 it would go from the decompressor to RAM. On PC, it would go from the decompressor to either VRAM or system RAM.

But yes, we currently don't know whether the GPU would handle all game related decompression tasks or not.

Regards,
SB

Then wouldn't the bandwidth from the drive to the cpu and the cpu to the gpu be the bottleneck at some point ?

I think the next step is to just add the storage to the gpu instead of the cpu
 
So your positions is that Sandybridge was obviously "orders of magnitudes" faster and better than Westmere in ways that neither manifested themselves in measurable ways? And you you think it's sensible to measure the transition point of the northbridge logic being incorporated into the CPU die with modern AMD technologies, like there was no evolutions of technology in the last eleven years? Rather than Westmere to Sandybridge - the transition point.

Please quote where I referenced Sandybridge and Westmere in that way? That's right, I did not. In the context of discussing modern PC architectures you claimed, and I quote "the way the processor part of the die connects to the external bus part of the die is not that different from when the separate processor chip was connected to the separate northbridge via the FSB". To which I responded that the bandwidth and latency of that connection is orders of magnitude better than the old FSB, and thus it is very different. I made absolutely no claims about the real word performance implication of that interconnect improvement because it had absolutely no bearing on the core argument.

You trying to twist that into me claiming that "Sandybridge was obviously "orders of magnitudes" faster and better than Westmere" is disingenuous at best. It's also entirely irrelevant to the core argument which was your incorrect implication that the Northbridge was some obstacle that the PC had to contend with which the consoles did not.

Nobody, myself included, said the northbridge - or the need for data to move over buses was an obstacle - just the path the data must take.

Here's what you said:

DSoup said:
but Microsoft's API cannot change the fundamental data flow across hardware on your average PC for which there are two effective setups:

2) for drives using NVMe/PCIe connections - your data is read off the storage cells by the drive controller, passed to the bus controller in the north-bridge, then has to be routed to either main memory or the graphics card memory. If the GPU is decompressing data it's doing that from GDDR then writing it back to the GDDR for graphics use or redirecting it across the north bridge controller to main memory for use by the CPU.

Current generation consoles have very simple (and limited) architectures. They read data off the storage cells by a single I/O controller which decompresses automatically - and is written to one pool of shared memory. So even where PC components and drives are much faster, they are still moving data around a lot more.

You're clearly framing the journeys over the Northbridge as an additional step the PC has the manage that the console does not. Again. This is wrong. You misrepresent both data flows above to make one seem significantly more complicated than the other. Here is what the console flow would look like if written in exactly the same context as you used for the PC flow:

"your data is read off the storage cells by the drive controller, passed to the decompression unit where it's written to the local cache, decompressed and then written back to that cache before being directed across the north-bridge, to main memory"

Why the inconsistency? If you're going to mention the Northbridge in one description, why miss it from the other?

I get that you like to pretend things that aren't on separate chips don't exist, despite all lithography analysis of Intel CPUs and Intel's own logic diagrams very clearly showing discrete logic functions still existing and there being clear logic path (the bus) twee the CPU and those blocks, but you.. ok. You can die on that hill.

Again, stop trying to wildly misrepresent what I've said. Clearly I've never tried to claim that the functions that used to be handled by the Northbridge "no longer exist". I have referenced multiple times over the past several posts how they were integrated into the CPU, including in the very first post on this subject. The issue here is you framing the Northbridge as something the PC has to deal with, and the console does not. Which once again, is wrong. Why are you so resistant to just admitting this and moving on? These last several pages of argument have been completely unnecessary. A simple acknowledgment that the "Northbridge" functionality/requirement is the same in both console and modern PC is all that was required rather than endless posts arguing about increasingly off topic minutia.

All people are saying is that on consoles, the I/O controller decompresses data (without any need to read and write data to RAM) during the process of transferring from. The data undergoes two stages;. first to the I/O controller, where decompression happens in the on-die cache, then it's written direct to memory. It doesn't get any simpler, smarter or more efficient than that. Having to move data around a bunch of places, reading of compressed data from RAM and writing decompressed data to RAM - is a less efficient approach. But the only one that exists on PC right now.

No, it isn't written direct to memory. In your own framing it is directed across the Northbridge to main memory. I realise that's just semantics but it's the framing that's important because you're trying to represent the simplicity of one data flow vs the complexity of another.

To illustrate this, take a PC that's using an APU instead. By your description above the data undergoes the same 2 stages, simply in reverse. Data is written direct to memory from the SDD where it is decompressed by the GPU (or CPU) ready for use. Just as simple as the console route you describe above. For PC's with a dGPU there is just one additional step - moving the GPU data from main memory to the GPU. And RTX-IO may even address that. Far from "moving data around a bunch of places" as you describe it above.

Also, what evidence do you have to support your suggestion that writing the data to GPU memory for decompression introduces any kind of performance penalty compared to doing it in the local memory of the consoles hardware decompressor? And if you're not saying that then why are you mentioning at all? We've been told that the GPU decompression is more than capable of keeping up with the fastest available NVMe drives so why does it matter what memory pool is being used for that? Except of course to frame it as somehow more complicated/inferior/likely to mitigate the benefits of faster components.
 
Through the CPU die, although there still should be some CPU intervention no?

Decompression is only one part and the others such as 'file check in' and other tasks Mark Cerny talked about will still be handled CPU side?

As far as I understand it these are handled by dedicated cores in the SSD controller itself. There was a blog post by someone from RAD Game Tools a while back which explained all these extra bits in the PS5 as basically standard components of a regular NVMe drive, just with a custom firmware. We do know for example that regular NVMe controllers can feature multiple ARM cores.

What decompresses the files that are not graphics related? Audio for example, is that still done CPU side as normal or does it go GPU for decompression and then CPU > RAM?

Based on the Foresaken GDC presentation it seems CPU and GPU data will be separated and sent to their respective memory pools for parallel decompression. So I think it's fair to say there will probably still be more load on the PC CPU compared to the console CPU. But as the CPU side is much smaller than the GPU side (MS claims about 80% of streamed data is textures) then the CPU load is still massively reduced.

The benefits of PS5's set-up is the decompression hardware and I/O complex decompress all game related data in the most straight forward way possible.

But is there a real world benefit to this apart from being easier to draw on paper? If 80+% of the decompression workload has now been removed from the CPU, and the IO overhead has been reduced to a tiny fraction of what it is today, would the CPU even be a bottleneck anymore? Could it be argued that the higher decompression capacity of the combined CPU and GPU capabilities is a better trade off because if gives headroom to work with much faster NVMe drives?

Then what exactly has all this fuss been over? If the Northbridge has no impact on the IO performance, why have people spent so many words arguing over what and where it is? :???:

This is pretty much the crux of my argument. But going more fundamental than this, even if the Northbridge did have an impact on performance, it would be having the same impact on the console too. Because what we're referring to as the Northbridge here is actually just the CPU's integrated memory controller. And obviously to copy data from an SSD to main memory, you have to go via the memory controller.
 
So from what I can see it would seem that RTX I/O is GPUDirect Storage that Nvidia debuted in 2019, or it's based on that work as some of the diagrams are very similar.
 
Last edited:
As far as I understand it these are handled by dedicated cores in the SSD controller itself. There was a blog post by someone from RAD Game Tools a while back which explained all these extra bits in the PS5 as basically standard components of a regular NVMe drive, just with a custom firmware. We do know for example that regular NVMe controllers can feature multiple ARM cores.

But is there a real world benefit to this apart from being easier to draw on paper? If 80+% of the decompression workload has now been removed from the CPU, and the IO overhead has been reduced to a tiny fraction of what it is today, would the CPU even be a bottleneck anymore? Could it be argued that the higher decompression capacity of the combined CPU and GPU capabilities is a better trade off because if gives headroom to work with much faster NVMe drives?

The decompression work can be removed from the CPU from what about everything else that follows?

As Mark Cerny pointed out, handling the memory writes and file 'check ins' for 100Mb worth of data coming off the disk is nothing for a CPU, but change that to several gigabytes and it becomes a huge task.

How has that part improved, do we all need to be rocking 10 core CPU's for next gen?

Forespoken is hugely impressive but it's still a game that can work on a SATA III SSD so it's not going to show how hard the CPU and GPU will get hit by I/O in the next 2-3 years.

Do we have CPU utilisation results for Forespoken?
 
The decompression work can be removed from the CPU from what about everything else that follows?

As Mark Cerny pointed out, handling the memory writes and file 'check ins' for 100Mb worth of data coming off the disk is nothing for a CPU, but change that to several gigabytes and it becomes a huge task.

This is already dealt with by direct storage. The overhead associated with this IO management is reduced from multiple CPU cores to 10% of a single core according to Microsoft.

Forespoken is hugely impressive but it's still a game that can work on a SATA III SSD so it's not going to show how hard the CPU and GPU will get hit by I/O in the next 2-3 years.

Forspoken isn't the best example anyway as its not using GPU decompression yet. So CPU utilisation is going to be much higher that it would be when that is used.
 
The decompression work can be removed from the CPU from what about everything else that follows?

As Mark Cerny pointed out, handling the memory writes and file 'check ins' for 100Mb worth of data coming off the disk is nothing for a CPU, but change that to several gigabytes and it becomes a huge task.

How has that part improved, do we all need to be rocking 10 core CPU's for next gen?

Forespoken is hugely impressive but it's still a game that can work on a SATA III SSD so it's not going to show how hard the CPU and GPU will get hit by I/O in the next 2-3 years.

Do we have CPU utilisation results for Forespoken?

Wouldn't the next step be to add dedicated dps / co processors just for these steps. MS created dedicated hardware to do this on the xbox series. I would imagine that the next step would be ms liscensing it out to interested parties so it can be intergrated into the cpu or I would image it be more useful at the disk drive or disk drive controller.
 
Based on the Foresaken GDC presentation it seems CPU and GPU data will be separated and sent to their respective memory pools for parallel decompression. So I think it's fair to say there will probably still be more load on the PC CPU compared to the console CPU. But as the CPU side is much smaller than the GPU side (MS claims about 80% of streamed data is textures) then the CPU load is still massively reduced.
Yeah, what I'm getting from all of this is that the 2 main benefits of RTXIO is the (theoretical) saving from only transferring compressed data over the bus, and decompression of data needed by the GPU can be done on the GPU, thus saving CPU cycles for those specific assets. These are both positive things, of course, but I don't know if we are really at a point where bus bandwidth is really limiting things in a way that user facing. I'm not sure this is a magic bullet for eliminating loading like some people hope it will be, but it's a step in the right direction.
 
This is already dealt with by direct storage. The overhead associated with this IO management is reduced from multiple CPU cores to 10% of a single core according to Microsoft.

What that on the Series consoles or Windows?

Wouldn't the next step be to add dedicated dps / co processors just for these steps. MS created dedicated hardware to do this on the xbox series. I would imagine that the next step would be ms liscensing it out to interested parties so it can be intergrated into the cpu or I would image it be more useful at the disk drive or disk drive controller.

I said last year that I wouldn't be surprised if we see something like PS5's I/O complex included in CPU's in the future.
 
Yeah, what I'm getting from all of this is that the 2 main benefits of RTXIO is the (theoretical) saving from only transferring compressed data over the bus, and decompression of data needed by the GPU can be done on the GPU, thus saving CPU cycles for those specific assets. These are both positive things, of course, but I don't know if we are really at a point where bus bandwidth is really limiting things in a way that user facing. I'm not sure this is a magic bullet for eliminating loading like some people hope it will be, but it's a step in the right direction.
I don't think it's about doubling the bandwidth of the bus, but rather sending the data in compressed form in around half the time.

Any way you look at it.. it just makes sense to send compressed data over any buses when you can.
 
I said last year that I wouldn't be surprised if we see something like PS5's I/O complex included in CPU's in the future.

Tim Sweeney seems to think the PS5 architecture will have a strong influence on PCs in the future, so I wouldn't be surprised.

In the meantime, I think MS are doing an admirable job of improving where they can.. given the existing state of things.
 
I said last year that I wouldn't be surprised if we see something like PS5's I/O complex included in CPU's in the future.

Maybe, but an evolution of some sort of it if anything. With the things as they are now its not really needed either, the pc IO is competitive enough for this generation, probably faster.
 
Then what exactly has all this fuss been over? If the Northbridge has no impact on the IO performance, why have people spent so many words arguing over what and where it is? :???:
You tell me. It began with my post on how compressed data read off SSD becomes decompressed data usable by GPU and CPU on a PC. Whether you want to refer to the logic block as the 'northbridge', 'system agent' (nomenclature since Sandy Bridge) and whether it's on-or-off die is neither here nor there and it changes nothing. ¯\_(ツ)_/¯
 
Maybe, but an evolution of some sort of it if anything. With the things as they are now its not really needed either, the pc IO is competitive enough for this generation, probably faster.

And that is based on seeing one game (Foresaken) built to run on HDD's?

What does it run like on the average PC with a 6 core CPU and SATA III SSD?

And moving everything to the CPU via dedicated fixed function hardware would be a better option as it would offer a more efficient approach.
 
I said last year that I wouldn't be surprised if we see something like PS5's I/O complex included in CPU's in the future.
Well, that's another debate. Even on consoles, where there is unified memory and an APU containing both CPU and GPU, both Microsoft and Sony chose to put the decompression block off-chip.

On PC, if commits to do decompression on the CPU then you are stuck with the situation that you are still routing all compressed data via the CPU. I'd argue it makes sense sense to have a smarter controller elsewhere, more like the traditional northbridge, and data read off the SSD can be decompressed there then routed directly to main RAM for the CPU, or GDDR for the GPU directly.

Having to route data for the CPU via the GPU, or data for the GPU via the GPU, for the purpose of doing basic decompression, is less efficient. People here are thinking about gaming and whether there is enough bandwidth but PCs are used for incredibly heavy data I/O tasks and just moving data around to get to the right place is just inefficient.
 
Back
Top