Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

There will will some nuance to the DirectStorage implementation in software (the game) but Microsoft's API cannot change the fundamental data flow across hardware on your average PC for which there are two effective setups

Yeah, I guess its the same as with the PS5 supha dupa speeds everybody thought would magically happen. To get better results, you need to tailor for the solution. Will be interesting to see when going forward, how much performance gain they will "scrape" out of it.
 
Yeah, I guess its the same as with the PS5 supha dupa speeds everybody thought would magically happen. To get better results, you need to tailor for the solution. Will be interesting to see when going forward, how much performance gain they will "scrape" out of it.
To be fair to anyone who believed this, Sony's pre-launch messaging included statements by an unnamed Sony rep claiming loading screens were going to be a "thing of the past" and Cerny claiming you could load in data as fast as you could turn your character around.
 
To be fair to anyone who believed this, Sony's pre-launch messaging included statements by an unnamed Sony rep claiming loading screens were going to be a "thing of the past" and Cerny claiming you could load in data as fast as you could turn your character around.

Cerny’s claims stand though as far as I am concerned.
 
To be fair to anyone who believed this, Sony's pre-launch messaging included statements by an unnamed Sony rep claiming loading screens were going to be a "thing of the past" and Cerny claiming you could load in data as fast as you could turn your character around.

Its PR just like any other, it was up to each one to believe it all or not. Anyway, great to see the pc platform being up there with the consoles in nvme/io tech. Forspoken loading in under 2seconds is quite a good start, aswell as the 5000mb/s read speeds test using DS.
 
There will will some nuance to the DirectStorage implementation in software (the game) but Microsoft's API cannot change the fundamental data flow across hardware on your average PC for which there are two effective setups:

2) for drives using NVMe/PCIe connections - your data is read off the storage cells by the drive controller, passed to the bus controller in the north-bridge, then has to be routed to either main memory or the graphics card memory. If the GPU is decompressing data it's doing that from GDDR then writing it back to the GDDR for graphics use or redirecting it across the north bridge controller to main memory for use by the CPU.

Current generation consoles have very simple (and limited) architectures. They read data off the storage cells by a single I/O controller which decompresses automatically - and is written to one pool of shared memory. So even where PC components and drives are much faster, they are still moving data around a lot more.

The thing is that the latency involved in passing the data from one bus to the next is so small compared with the time it takes to transfer the data that it makes virtually no difference to the real world result. i.e. if you're waiting a second to load your data, what's a few extra microseconds (or less) to pass it between different controllers? The bottleneck remains how quickly you can pull the data off the SSD, or how quickly you can do all of the other processing on the CPU - by a very wide margin.

And you're correct that GPU decompression would be using the GPU memory and caches for that work. But I'm not clear how that is disadvantageous to the hardware block of the consoles which is effectively doing the same thing in hardware with local caches. The key factor is whether that acts as a bottleneck to the throughput or not. Which in both cases we're told that it won't.
 
There will will some nuance to the DirectStorage implementation in software (the game) but Microsoft's API cannot change the fundamental data flow across hardware on your average PC for which there are two effective setups:

1) for drives connected over legacy (e.g IDE) buses - your data is read off the storage cells by the drive controller, passed to the bus controller in the south-bridge, which is then routed to either main memory or the graphics card memory via the north-bridge. If the GPU is decompressing data it's doing that from GDDR then writing it back to the GDDR for graphics use or redirecting it across the north-bridge controller to main memory for use by the CPU.

2) for drives using NVMe/PCIe connections - your data is read off the storage cells by the drive controller, passed to the bus controller in the north-bridge, then has to be routed to either main memory or the graphics card memory. If the GPU is decompressing data it's doing that from GDDR then writing it back to the GDDR for graphics use or redirecting it across the north bridge controller to main memory for use by the CPU.

Current generation consoles have very simple (and limited) architectures. They read data off the storage cells by a single I/O controller which decompresses automatically - and is written to one pool of shared memory. So even where PC components and drives are much faster, they are still moving data around a lot more.
GPU will only decompress things like texture and geometry data. CPU will continue to decompress other things like audio.

The thing is that the latency involved in passing the data from one bus to the next is so small compared with the time it takes to transfer the data that it makes virtually no difference to the real world result. i.e. if you're waiting a second to load your data, what's a few extra microseconds (or less) to pass it between different controllers? The bottleneck remains how quickly you can pull the data off the SSD, or how quickly you can do all of the other processing on the CPU - by a very wide margin.

And you're correct that GPU decompression would be using the GPU memory and caches for that work. But I'm not clear how that is disadvantageous to the hardware block of the consoles which is effectively doing the same thing in hardware with local caches. The key factor is whether that acts as a bottleneck to the throughput or not. Which in both cases we're told that it won't.

There shouldn't be any issue with throughput. During "load times" you're not rendering anything and the GPU can go full out on decompression maximizing throughput.. and during streaming, you're not going to be requiring that bandwidth at a constant rate.

I see no reason why it would be worse on the GPU than it already is on the CPU.
 
The thing is that the latency involved in passing the data from one bus to the next is so small compared with the time it takes to transfer the data that it makes virtually no difference to the real world result. i.e. if you're waiting a second to load your data, what's a few extra microseconds (or less) to pass it between different controllers?

Bandwidth is also important. Being able to move smaller amounts of data quickly is what a general bus is designed for, but when you're trying to move large amounts of data quickly then you may hit bus arbitration issues. The PCI bus is good at moving data really quickly one way, but if you're shuttling data from storage to GPU for decompression to move to RMA data in an asynchronous flow then arbitration may be the cause of the numbers here. Maximum PCI bandwidth is almost always quotes in burst mode, and burst mode is largely predicated on periods of synchronous I/O.

And you're correct that GPU decompression would be using the GPU memory and caches for that work. But I'm not clear how that is disadvantageous to the hardware block of the consoles which is effectively doing the same thing in hardware with local caches.
It's not disadvantageous, it's just different. Current generation consoles can pull compressed data from the drive and decompress a number of formats supported by the I/O controller. There is no moving data into RAM local to the GPU or CPU or any need to potentially move data elsewhere after decompression. You may read 2GB of data from the SSD and have 6GB delivered into RAM without the GPU or CPU having not been involved at all. This is all managed using cache on the I/O controller itself. One unified RAM pool also simplified the model.

GPU will only decompress things like texture and geometry data. CPU will continue to decompress other things like audio.
This is part of the issue, surely? If you have a .PAK (zip) file that includes a mix of data types, like textures for the GPU and audio and geometry data for the world, where does it go? This is why loading takes a while now. The data needs to be loaded, picked up and re-directed.
 
Last edited by a moderator:
Bandwidth is also important. Being able to move smaller amounts of data quickly is what a general bus is designed for, but when you're trying to move large amounts of data quickly then you may hit bus arbitration issues. The PCI bus is good at moving data really quickly one way, but if you're shuttling data from storage to GPU for decompression to move to RMA data in an asynchronous flow then arbitration may be the cause of the numbers here. Maximum PCI bandwidth is almost always quotes in burst mode, and burst mode is largely predicated on periods of synchronous I/O.


It's not disadvantageous, it's just different. Current generation consoles can pull compressed data from the drive and decompress a number of formats supported by the I/O controller. There is no moving data into RAM local to the GPU or CPU or any need to potentially move data elsewhere after decompression. You may read 2GB of data from the SSD and have 6GB delivered into RAM without the GPU or CPU having not been involved at all. This is all managed using cache on the I/O controller itself. One unified RAM pool also simplified the model.


This is part of the issue, surely? If you have a .PAK (zip) file that includes a mix of data types, like textures for the GPU and audio and geometry data for the world, where does it go? This is why loading takes a while now. The data needs to be loaded, picked up and re-directed.

I remember two years ago how you discussed IO on the pc platform. Didnt really turn out to be that way.

Anyway, we will see how this all compares to the PS5 (or xbox) when we can test, play and benchmark things against eachother. And then you can conduct on how bad it all is.
 
I remember two years ago how you discussed IO on the pc platform. Didnt really turn out to be that way.
Can you provide some quotes? I'm not sure what posts you are referring too or what data you think contradicts it? I do recall being skeptical that a Windows API could negate innate PC architecture design, i.e. how individual components are connected via hardware buses.
 
This is part of the issue, surely? If you have a .PAK (zip) file that includes a mix of data types, like textures for the GPU and audio and geometry data for the world, where does it go? This is why loading takes a while now. The data needs to be loaded, picked up and re-directed.
Don't we know where it goes? There will be a new class of compression technology which will likely require things to be packaged differently. It will continue to go to system memory first, but now the CPU will only decompress things such as audio assets into RAM while it copies the compressed geometry and texture data to VRAM for decompression by the GPU.
 
Bandwidth is also important. Being able to move smaller amounts of data quickly is what a general bus is designed for, but when you're trying to move large amounts of data quickly then you may hit bus arbitration issues. The PCI bus is good at moving data really quickly one way, but if you're shuttling data from storage to GPU for decompression to move to RMA data in an asynchronous flow then arbitration may be the cause of the numbers here. Maximum PCI bandwidth is almost always quotes in burst mode, and burst mode is largely predicated on periods of synchronous I/O.

Yes bandwidth is important. This was my point, Bandwidth for moving data is more important than the relatively miniscule latencies involved with passing that data between different buses. Are you suggesting that PCIe bandwidth is somehow a limiting factor here though? Because if you are I'm really not following. On the PC the data will move from NVMe to root complex over a PCIe 4x bus - same as the PS5. After that it moves on to the GPU over a 16x bus. Why would the 16x PCIe bus from root complex to GPU in the PC be a limiting factor when the 4x PCIe bus from NVMe to root complex in the PS5 wouldn't be?

And for that matter, If a PCIe 4x interface (let alone a 16x interface) were a limiting factor in data transfer speeds then why do benchmarks demonstrate throughput improvements, even with small block sizes in NVMe drives all the way up to 7GB/s and beyond?

It's not disadvantageous, it's just different. Current generation consoles can pull compressed data from the drive and decompress a number of formats supported by the I/O controller. There is no moving data into RAM local to the GPU

Yes there is. The shared memory of the console is RAM local to the GPU. The only difference here is that on PC the data moves over more busses to get there. The narrowest bus on the PC is still equivalent in width to the single bus on the PS5 though. The rest are much wider. So while this adds latency, it's insignificant to the overall result of x MB transferred in x ms.

You may read 2GB of data from the SSD and have 6GB delivered into RAM without the GPU or CPU having not been involved at all. This is all managed using cache on the I/O controller itself.

This is true, but why does it matter if the CPU or GPU are not a bottleneck? All that matters is that the processing is completed without holding anything else up. The problem with PC's sans DirectStorage is that too much pressure is put on the CPU in terms of IO management and decompression at a time when those CPU cycles are needed for there things like world setup. But if you move much of that work to the GPU that has more than enough capacity to pick it up then this has no performance penalty vs doing it on a dedicated hardware block - which itself is also going to add latency to the process.
 
Cerny’s claims stand though as far as I am concerned.
I'm not saying any of it was impossible, but I never expected it to be what is always being done. And that's sort of the point I was making. People who are perhaps a bit less PR literate than the average B3D poster could easily read those comments and expect there to never be loading and that RAM was only being populated with what's on screen.
 
Yes bandwidth is important. This was my point, Bandwidth for moving data is more important than the relatively miniscule latencies involved with passing that data between different buses. Are you suggesting that PCIe bandwidth is somehow a limiting factor here though?
To be clear, I'm not suggesting that PCIe bandwidth is the issue. In your reply to me you mentioned only latency. I am making clear that both latency and bandwidth are required for moving a lot of data quickly. PCIe bandwidth is finite, however but the limiting factor in Forsaken using DirectStorage is on the CPU, because that is where the decompression occurs. On current generation consoles, decompression of supported data formats takes place in realtime as data is read. Supported data formats include zlib (which is what most games use for .PAK files), kraken (PS5) and BCPack (Xbox Series).

Yes there is. The shared memory of the console is RAM local to the GPU. The only difference here is that on PC the data moves over more busses to get there. The narrowest bus on the PC is still equivalent in width to the single bus on the PS5 though. The rest are much wider. So while this adds latency, it's insignificant to the overall result of x MB transferred in x ms.
Did you seriously just quote half a sentence just to make it appear like I'm wrong? Come on, man that really disingenuous and not the kind of thing you expect to see in the technical forums. Quoting what I wrote in full:

It's not disadvantageous, it's just different. Current generation consoles can pull compressed data from the drive and decompress a number of formats supported by the I/O controller. There is no moving data into RAM local to the GPU or CPU or any need to potentially move data elsewhere after decompression. You may read 2GB of data from the SSD and have 6GB delivered into RAM without the GPU or CPU having not been involved at all. This is all managed using cache on the I/O controller itself. One unified RAM pool also simplified the model.

The difference between the PC approach compared to the consoles, is that on the PC you need to read the compressed data and write that into one of the RAM pools. Then the CPU or the GPU needs to read that data and write decompressed data. Data that needs to be in the other RAM pool then needs copying there.

At the risk of repeating myself, the current generations consoles have cache built into the I/O controller. Compressed data doesn't need to be put isn't read into RAM, it is temporarily read into super-fast on-chip cache on the I/O controller and written out in uncompressed when written to the single RAM pool.

This is true, but why does it matter if the CPU or GPU are not a bottleneck?

The CPU is still the bottleneck. Have you readread the Verge article which quotes the Forsaken developer?.
 
To be clear, I'm not suggesting that PCIe bandwidth is the issue. In your reply to me you mentioned only latency. I am making clear that both latency and bandwidth are required for moving a lot of data quickly. PCIe bandwidth is finite, however but the limiting factor in Forsaken using DirectStorage is on the CPU, because that is where the decompression occurs. On current generation consoles, decompression of supported data formats takes place in realtime as data is read. Supported data formats include zlib (which is what most games use for .PAK files), kraken (PS5) and BCPack (Xbox Series).


Did you seriously just quote half a sentence just to make it appear like I'm wrong? Come on, man that really disingenuous and not the kind of thing you expect to see in the technical forums. Quoting what I wrote in full:


The difference between the PC approach compared to the consoles, is that on the PC you need to read the compressed data and write that into one of the RAM pools. Then the CPU or the GPU needs to read that data and write decompressed data. Data that needs to be in the other RAM pool then needs copying there.

At the risk of repeating myself, the current generations consoles have cache built into the I/O controller. Compressed data doesn't need to be put isn't read into RAM, it is temporarily read into super-fast on-chip cache on the I/O controller and written out in uncompressed when written to the single RAM pool.



The CPU is still the bottleneck. Have you readread the Verge article which quotes the Forsaken developer?.


Yes there is SRAM in PS5 where they decompress the data, Fabian Giesen from RAD tool game did a tweet on it; aying the data is decompressed in chunk of 256KB but on PS5 this is transparent to the dev out of how they package the game data.

In R&C Rift Apart the CPU is the bottleneck too for the moment because of the way they initialize the entity in a level.
 
Yes there is SRAM in PS5 where they decompress the data, Fabian Giesen from RAD tool game did a tweet on it; aying the data is decompressed in chunk of 256KB but on PS5 this is transparent to the dev out of how they package the game data.
Don't tell me, tell the other guy. I think there is generally a lack of appreciation of how complex the PC is architecturally, and this is not a flaw or a design oversight, this is necessary for the PC to be extensible and scalable.

You could definitely architect a PC to work like a modern console, but not if you want to use an external graphics card because the I/O controller will not have a direct hardware bus to the GPU's RAM pool. Different choices, different designs, different pros and cons. It does feel like it's becoming a platform war argument which why I am really coming to fucking hate these forums. :yep2:
 
Well, actually this presentation shows more that it isn't really DirectStorage that makes the big loading difference. It is more how you design your game to load the stuff. Yes, DirectStorage reduced the CPU-usage a bit (and allows to effetively use more bandwidth), but the loading times are already quite short. So they already optimized their engine for that stuff.
E.g. even the HDD loading times with Win32 API are quite good. That happens if you optimize your engine for loading just the stuff you need.
 
I don’t know anymore where I posted it and who replied but I stand corrected on those DirectStorage API benefits …
DirectStorage will definitely bring benefits but it's not going to negate the need to move data from this bit of a PC to this bit to this bit. I have not been following it that closely either and I thought what had recently been soft-launched was the final implementation but that is yet to come. That will bring loading compressed data over to the GPU where it can be decompressed faster than on the CPU.

What will continue to be thing on PC is that there are two places where you can you undertake decompression; the CPU and GPU, and each has its own pool of RAM. Ultimately you will still need to load compressed data into some RAM, and decompress it into some RAM and for any data that is needed in the other pool of RAM, you'll need to copy it there.

The ultimate endgame is to have a programable I/O controller that decompresses compressed-data as it hits the north-bridge and directs it automatically to main RAM and GDDR. That will bring greater benefits but the need for loaded data to be in a usable form, e.g. to generate a massive game world populate theNPCs, vehicles, critters, physics, weather, effects, audio and so on - is still going to be mostly on the CPU. DirectStorage seems to be a bid that moving decompression ultimately to the GPU - including all the to'ing and fro'ing of data - freeing up the CPU, means that latter stage is faster faster and overall, things are better.

Forsaken definitely does show improvement in load times, but it's not earth-shattering. And ultimately, like everything with the PC, what you will experience will depend on your combination of OS, hardware drivers, graphics card, PCI chipset, CPU and software. Will DirectStorage be smart enough to not have the GPU decompress if the CPU is the faster option (calling for all the data movement)?

Well, actually this presentation shows more that it isn't really DirectStorage that makes the big loading difference. It is more how you design your game to load the stuff.

Which is the same situation as games built for current generation consoles. You have Spider-Man loading in around six seconds on PS5, and Astro's Big Adventure in about two seconds, because both have obviously been designed to make the most of the I/O pipeline. You're not seeing that with Assassin's Creed or GTA V.
 
Last edited by a moderator:
Again, we will see how much better the consoles are compared to the PC platform in loading/IO performances. Believing the NV presentation it arguably has the numbers going for it, which probably has more impact than some small 'additional latency routes' and other claims of disadvantage. You need to see the advantages aswell, the decompression on the GPU is scalable/programmable and much more capable at the same time.
Also, remember that games that are fast loading on your PS5 are probably designed around the IO aswell, improvements are being done and seen on the PS4 too in that regard.

As said, we will see how bad the PC performs relative to the consoles. All this platform warring might be for nothing if they end up loading/streaming about as fast, even though the architectural differens in IO.
 
Back
Top