Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Which is the same situation as games built for current generation consoles. You have Spider-Man loading in around six seconds on PS5, and Astro's Big Adventure in about two seconds, because both have obviously been designed to make the most of the I/O pipeline. You're not seeing that with Assassin's Creed or GTA V.
I wouldn't go that far. Just look at the already quite good loading time of the HDD. E.g. some PS4 games were optimized for loading times after PS5 was out. So it was already possible before to make loading times better. But that was just never the focus before.
E.g. in GTA V someone in the community found an easy fix for the loading times years ago but even that easy fix was never implemented in the console versions. Now with next-gen version they finally seem to have implemented that fix so loading times are much better now only because of a fix for that bad implementation of loading a config file before.
Before this generation loading times were always just "good enough" to play the game. They were never in the focus. That GDC video above actually shows what happens if loading times get into the focus. Even HDD benefit from those optimizations.

Yes, must more could be done with the big bandwidth improvement, but already the SATA SSD values show, there is not much more to win here as the bandwidth is just no longer the bottleneck. Even without the direct storage API.

Again, we will see how much better the consoles are compared to the PC platform in loading/IO performances. Believing the NV presentation it arguably has the numbers going for it, which probably has more impact than some small 'additional latency routes' and other claims of disadvantage. You need to see the advantages aswell, the decompression on the GPU is scalable/programmable and much more capable at the same time.
Also, remember that games that are fast loading on your PS5 are probably designed around the IO aswell, improvements are being done and seen on the PS4 too in that regard.

As said, we will see how bad the PC performs relative to the consoles. All this platform warring might be for nothing if they end up loading/streaming about as fast, even though the architectural differens in IO.
The PC already has a big advantage over time. Additional central memory. So even if you need a bit more startup-time you can cache much data into main memory which is much, much faster than any SSD. But as the SSD speed is not really the bottleneck anymore, this might not be a to big advantage.
E.g. the AMD driver with a 8GB Vega card already shows, that the main memory can be enough to store texture there to not bottleneck the GPU. E.g. FarCry 6 + Texture pack runs quite well on Vega cards (when the feature in the driver was activated) while it is stuttering on RTX3070 cards.
But I guess this is more a software problem with the games engine than anything else. More memory just fixes the bad implementation.
 
Last edited:
I wouldn't go that far. Just look at the already quite good loading time of the HDD. E.g. some PS4 games were optimized for loading times after PS5 was out. So it was already possible before to make loading times better. But that was just never the focus before.
E.g. in GTA V someone in the community found an easy fix for the loading times years ago but even that easy fix was never implemented in the console versions. Now with next-gen version they finally seem to have implemented that fix so loading times are much better now only because of a fix for that bad implementation of loading a config file before.
Before this generation loading times were always just "good enough" to play the game. They were never in the focus. That GDC video above actually shows what happens if loading times get into the focus. Even HDD benefit from those optimizations.

Yes, must more could be done with the big bandwidth improvement, but already the SATA SSD values show, there is not much more to win here as the bandwidth is just no longer the bottleneck. Even without the direct storage API.

I tried to explain that aswell but you present it much better. Anyway, maybe it is better to leave it at that, it is nothing more than platform warring at this point, ment or accidential. Like with FSR2 vs dlss just await the results, sometimes people get surprised by what can be done.
 
I wouldn't go that far. Just look at the already quite good loading time of the HDD. E.g. some PS4 games were optimized for loading times after PS5 was out. So it was already possible before to make loading times better. But that was just never the focus before.
Agreed. In terms of the PS4 games benefiting from faster drives, I recall - quick googling - this Digital Foundry article of them replacing the stock HDD with an SSD which demonstrated - for the most part - not as massive a difference as you might expect, and which highlighted that what is oft considered 'loading' is actually waiting for the CPU to build bits of the game world.

The Verge article quotes the Forsaken developer who also mentions that when you remove I/O bottlenecks you reveal others.

There are things developers can be proactive about, including arranging and storing data in a format that facilitates quick loading and the ability of the engine to swiftly use the loaded. data A question that remains is how much effort is required for the developer to leverage modern technologies. Based on the disparity of loading times between some first party current generation console titles and third party efforts, I assume some effort is required to leverage this.

The PC has the tremendous ability to adopt new techniques faster than consoles but if the goal it to move decompression to entirely to GPUs, that means aa bite of existing GDDR needs to be reserved for ephemeral storage of compressed data and decompressed data. This may not be that much an issue on on modern cards but if we accept the Steam hardware surveys, the vast majority of gamers are using more modest hardware.
 
The PC has the tremendous ability to adopt new techniques faster than consoles but if the goal it to move decompression to entirely to GPUs, that means aa bite of existing GDDR needs to be reserved for ephemeral storage of compressed data and decompressed data. This may not be that much an issue on on modern cards but if we accept the Steam hardware surveys, the vast majority of gamers are using more modest hardware.

You sure about that? The point of DirectStorage is the direct access to GPU memory from the SSD. Meaning, the compressed data will likely be decompressed on the fly by the GPU while still being on the SSD. So only the decompressed data would be GDDR where the data gets used as textures for example. Is the GPU able to do that though, to use the SSD as extended GDDR? I am not sure.

Otherwise this technology would be unuseable on every GPU that is not a 12 GB VRAM+ card by AMD. I can't imagine the DirectX engineers were not thinking about that.

How is it handled on console then, especially on Series S with its low amount of memory? The decompression blocks do not have their own memory subsystem, do they?
 
You sure about that? The point of DirectStorage is the direct access to GPU memory from the SSD. Meaning, the compressed data will likely be decompressed on the fly by the GPU while still being on the SSD. So only the decompressed data would be GDDR where the data gets used as textures for example. Is the GPU able to do that though, to use the SSD as extended GDDR? I am not sure.
I am relying on what Microsoft have published, which according to their devblog is:

Microsoft devblog said:
What’s Next?
This release of DirectStorage provides developers everything they need to move to a new model of IO for their games, and we’re working on even more ways to offload work from the CPU. GPU decompression is next on our roadmap, a feature that will give developers more control over resources and how hardware is leveraged

When you say "the compressed data will likely be decompressed on the fly by the GPU while still being on the SSD", how would that happen using PC hardware? How do you envisage the GPU decompressing data on the SSD? How do you explain how the GPU accessing (reads/writes) the data on storage via the graphics driver, north-bridge, possibly the south-bridge, the drive controller and the NAND data?

It's already slow enough that games avoid having the GPU access data on main RAM (and vice-versa) because of the latency. Data can, of course, be moved across the bus but accessing it instantaneously is not something that can be done. This is what differences local bus and central bus technologies.

¯\_(ツ)_/¯
 
Agreed. In terms of the PS4 games benefiting from faster drives, I recall - quick googling - this Digital Foundry article of them replacing the stock HDD with an SSD which demonstrated - for the most part - not as massive a difference as you might expect, and which highlighted that what is oft considered 'loading' is actually waiting for the CPU to build bits of the game world.

The Verge article quotes the Forsaken developer who also mentions that when you remove I/O bottlenecks you reveal others.
I've been saying this since Sony first talked about PS5's SSD speed. My PC has a SATA SSD, and NVME Gen 3 SSD, and a 5400RPM platter drive. I've tested games on all of those drives, and there are differences in loading in some games, but not in others. And... and I do not know exactly why, but I used to have a SATA SSD in there that one game would load slower on that it does on the 5400RPM drive. I don't know why exactly, but if I turned off real time protection in windows defender then it would load faster. That makes sense when comparing that drive to itself but I don't know why it affected that SSD and not the HDD or my other SSDs, as they showed little to no difference with realtime protection on/off.

This isn't a new thing, either. I remember when I was playing Half-Life back in the day, I had a k6-2 and my cousin got a pre-built Pentium 3 machine. His drive was a 5400rpm while mine was 7200 but his game would load faster. I/O and drive speed are only part of the equation.
 
I honestly do not believe that 2 seconds vs 1 second matters at all. PS5 could be 2x as efficient... and I don't believe it will matter in the end. Look at the Win32 results already... What we're seeing here is a developer ACTUALLY design around having a fast efficient SSD to begin with. That was the biggest bottleneck previously holding loading performance back on PC. Their goal of literally 1 second of loading is commendable.. and they're already extremely close.. regardless of API they use.

You sure about that? The point of DirectStorage is the direct access to GPU memory from the SSD. Meaning, the compressed data will likely be decompressed on the fly by the GPU while still being on the SSD. So only the decompressed data is in GDDR where the data gets used as textures for example. Is the GPU able to do that though, to use the SSD as extended GDDR? I am not sure.

Otherwise this technology would be unuseable on every GPU that is not a 12 GB VRAM+ card by AMD. I can't imagine the DirectX engineers were not thinking about that.

How is it handled on console then, especially on Series S with its low amount of memory? The decompression blocks do not have their own memory subsystem, do they?

No, we already know that for the moment data must still go to RAM, and that the CPU must still copy the compressed data to GPU for decompression. Games which properly utilize DirectStorage+GPU decompression will actually likely have VRAM requirements reduced.
 
I am relying on what Microsoft have published, which according to their devblog is:



When you say "the compressed data will likely be decompressed on the fly by the GPU while still being on the SSD", how would that happen using PC hardware? How do you envisage the GPU decompressing data on the SSD? How do you explain how the GPU accessing (reads/writes) the data on storage via the graphics driver, north-bridge, possibly the south-bridge, the drive controller and the NAND data?

It's already slow enough that games avoid having the GPU access data on main RAM (and vice-versa) because of the latency. Data can, of course, be moved across the bus but accessing it instantaneously is not something that can be done. This is what differences local bus and central bus technologies.

¯\_(ツ)_/¯
Hopefully one of these days we move to a newer solution as per Nvidia GTC
https://developer.nvidia.com/gpudirect

But somehow a direct bus from IO to GPU storage would need to be a specification for all mobos or something like that.
 
Don't we know where it goes? There will be a new class of compression technology which will likely require things to be packaged differently. It will continue to go to system memory first, but now the CPU will only decompress things such as audio assets into RAM while it copies the compressed geometry and texture data to VRAM for decompression by the GPU.

Yes, in theory since work is already required to implement DirectStorage, then it would seem logical that CPU compressed resources would be package differently than GPU compressed resources.

At that point data would then go from SSD -> CPU (modern CPUs with Northbridge basically integrated into the CPU) then CPU data is routed to main memory while GPU data is routed to GPU memory. This of course, assumes that CPUs are capable to routing data streams prior to them hitting main memory.

Regards,
SB
 
Well, actually this presentation shows more that it isn't really DirectStorage that makes the big loading difference. It is more how you design your game to load the stuff. Yes, DirectStorage reduced the CPU-usage a bit (and allows to effetively use more bandwidth), but the loading times are already quite short. So they already optimized their engine for that stuff.
E.g. even the HDD loading times with Win32 API are quite good. That happens if you optimize your engine for loading just the stuff you need.

This was already known as Star Citizen already showed that massive increases in loading speed is achievable with the existing I/O on Windows (when using an SSD) just by re-architecting how the game handles I/O to take advantage of SSDs. DirectStorage just streamlines that more as well as removing or reducing some inefficiencies within the Windows storage stack and perhaps gives developers guidelines on how to achieve similar or perhaps better results to what the Star Citizen engineers accomplished via changes to their engine or data structures.

It's the same on PS5 and XBS-X. Just using the faster I/O isn't enough to reap the full benefits of faster I/O. The developer must still architect their game and game assets in such a way as to be able to take advantage of the faster I/O. If you look at loading on PS5 when not specifically coding for the faster I/O you see basically similar loading speeds to PC games which aren't specifically coded to take advantage of SSDs.

Regards,
SB
 
To be clear, I'm not suggesting that PCIe bandwidth is the issue. In your reply to me you mentioned only latency. I am making clear that both latency and bandwidth are required for moving a lot of data quickly. PCIe bandwidth is finite, however but the limiting factor in Forsaken using DirectStorage is on the CPU, because that is where the decompression occurs. On current generation consoles, decompression of supported data formats takes place in realtime as data is read. Supported data formats include zlib (which is what most games use for .PAK files), kraken (PS5) and BCPack (Xbox Series).

Yes we're i agreement here. I only made the comment as you seemed to be suggesting that there would be bandwidth issues over PCIe on the PC specifically. Which obviously wouldn't be the case relative to the PS5 given that the narrowest of those PCIe busses is also present in the PS5.

Did you seriously just quote half a sentence just to make it appear like I'm wrong? Come on, man that really disingenuous and not the kind of thing you expect to see in the technical forums. Quoting what I wrote in full:
I certainly didn't mean to take you out of context. If I misunderstood what you were trying to say with the statement I quoted then that's my bad.

The difference between the PC approach compared to the consoles, is that on the PC you need to read the compressed data and write that into one of the RAM pools. Then the CPU or the GPU needs to read that data and write decompressed data. Data that needs to be in the other RAM pool then needs copying there.

At the risk of repeating myself, the current generations consoles have cache built into the I/O controller. Compressed data doesn't need to be put isn't read into RAM, it is temporarily read into super-fast on-chip cache on the I/O controller and written out in uncompressed when written to the single RAM pool.

That's all obviously correct. My argument is with the level of importance you're assigning to it. Certainly having to copy data multiple times between different RAM pools comes with a CPU overhead which is undesirable. But that's what DIrect Storage is designed to address (insofar as it reduces the cost of those operations to the point where they're trivial as opposed to removes the need for them). But the actual added latency from those operations which is measured in nanoseconds or at worst the low microseconds are insignificant compared to the multiple milliseconds of an average frame. So this really isn't something that's going to prevent PC's with "much faster drives" from actually realising much higher data throughput. What it may do though, particularly in systems that aren't using Direct Storage is increase CPU requirements. And if the CPU is the actual bottleneck for the load time, then that could be a disadvantage for the PC compared with a console. Although faster CPU's etc...

The CPU is still the bottleneck. Have you readread the Verge article which quotes the Forsaken developer?.

Yes it is at the moment and that absolutely could give the PS5 an advantage until GPU decompression is in place. But my statement was in the context of the full implementation of DirectStorage where the decompression is taken off the CPU and moved to the GPU where it would not be a bottleneck. At that point you only have the IO management on the CPU which is fairly trivial under DirectStorage with modern multicore CPU's. Even then the CPU may still be the bottleneck, but the workload on the CPU will be very similar between PC and PS5. PS5 will liekly still have a small advantage in that area, but faster CPU's etc....

Don't tell me, tell the other guy. I think there is generally a lack of appreciation of how complex the PC is architecturally

I think there's a general over emphasis on how impactful these differences are on real world scenario's. At least under a full Direct Storage implementation which is specifically designed to mitigate the issues by significantly reducing the CPU overhead. The differences aren't really as significant as you seem to think. The main components are largely the same, the 2 primary differences are that on PC the GPU has it's own memory pool which sits at the other end of a very wide PCIe bus meaning data copies are required back and forth over that bus to main memory, and that the decompression is done within that memory pool rather than prior to hitting main memory like it is in the consoles. Mapping the routes the data takes looks something like this:

PC
NVMe -> Main Memory -> GPU Memory (decompression) -> Main memory (CPU data only)
or

NVMe -> Main Memory (CPU data decompression) -> GPU Memory (GPU data decompression)

PS5
NVMe -> Decompression Block -> Main Memory

EDIT: Just watched the GDC presentation for Forespoken which clarifies the above. Some great info in there!

So in the PC's case you're talking either 1 or 2 extra hops (depending on whether CPU and GPU data is split up for decompression by those units or it's all done on GPU and then the CPU data sent back to main memory), but the bus those hops go over is much wider than the consoles NVMe->Main memory bus, and the added latency incurred by those hops is miniscule.

The PC has the tremendous ability to adopt new techniques faster than consoles but if the goal it to move decompression to entirely to GPUs, that means aa bite of existing GDDR needs to be reserved for ephemeral storage of compressed data and decompressed data.

Chris already pointed out that the PS5 hardware decompression block uses 256KB for this function. So it's a complete non-issue for any GPU.
 
Last edited:
Hopefully one of these days we move to a newer solution as per Nvidia GTC
https://developer.nvidia.com/gpudirect. But somehow a direct bus from IO to GPU storage would need to be a specification for all mobos or something like that.
Not only that, adding a direct bus between the GPU and storage that would bypassing the north/south-bridges and the operating system adds a whole host of problems, not least the OS has no visibility of any of this activity. If you want to involve the OS then the CPU needs to be involved as well. Unless Nvidia are also proposing to have something like a secure satellite I/O processor on the GPU and have Microsoft re-engineer the Windows kernel to delegate certain I/O processes to the that processor, which will also keep the operating systems in the loop. ¯\_(ツ)_/¯

There is a reason Microsoft and Sony designed the hardware architecture of their consoles the way they do - virtually identically - to achieve the same performance goals. That is the best way to do it. But it's a closed architecture with very limited expansion and scaleability options which would not suite the PC.
 
Not only that, adding a direct bus between the GPU and storage that would bypassing the north/south-bridges and the operating system adds a whole host of problems, not least the OS has no visibility of any of this activity. If you want to involve the OS then the CPU needs to be involved as well. Unless Nvidia are also proposing to have something like a secure satellite I/O processor on the GPU and have Microsoft re-engineer the Windows kernel to delegate certain I/O processes to the that processor, which will also keep the operating systems in the loop. ¯\_(ツ)_/¯

There is a reason Microsoft and Sony designed the hardware architecture of their consoles the way they do - virtually identically - to achieve the same performance goals. That is the best way to do it. But it's a closed architecture with very limited expansion and scaleability options which would not suite the PC.
Yes, there is a reason... it's the most efficient way to do it, considering how cost effective consoles have to be. On PC there are other ways of mitigating those differences.. by actually programming games for the PC architecture's strengths... wide buses, and high capacities. PCs will never have console level efficiencies in design.. we know that. There's more latencies getting data off the disk, and more latencies having to process that data to get it to its destination ready for use... but they can move larger amounts of data at one time, and both pools of memory can hold far more data.. requiring less fetching from disk.

Short of not having a unified memory architecture... it's actually a bonus having a much LARGER pool of high bandwidth lower latency system RAM connected to much larger amounts of VRAM. On PC they'll make due with what they have.. I have no doubts about that. We're a long way away from needing absolutely INSTANT loading. Using Forspoken as an example.. if PC is 2 seconds and PS5 is 1 second.. I'm good on that. It will make literally no difference. Developers can, and will, design their game loading/transitions/streaming around what the hardware can do.. and thus if my black fade out and back in takes 1 second longer.. I'll deal with it.
 
Both pools can hold far more data? That depends on how you look at it. At its most extreme the PS5 basically has 14GB of VRAM. Not that many GPUs have that yet.
 
Yes, there is a reason... it's the most efficient way to do it, considering how cost effective consoles have to be. On PC there are other ways of mitigating those differences.. by actually programming games for the PC architecture's strengths... wide buses, and high capacities. PCs will never have console level efficiencies in design.. we know that. There's more latencies getting data off the disk, and more latencies having to process that data to get it to its destination ready for use... but they can move larger amounts of data at one time, and both pools of memory can hold far more data.. requiring less fetching from disk.

Short of not having a unified memory architecture... it's actually a bonus having a much LARGER pool of high bandwidth lower latency system RAM connected to much larger amounts of VRAM. On PC they'll make due with what they have.. I have no doubts about that. We're a long way away from needing absolutely INSTANT loading. Using Forspoken as an example.. if PC is 2 seconds and PS5 is 1 second.. I'm good on that. It will make literally no difference. Developers can, and will, design their game loading/transitions/streaming around what the hardware can do.. and thus if my black fade out and back in takes 1 second longer.. I'll deal with it.

That last part is one of the reasons why AF still isnt used all too much on consoles (DF). Also, shared memory can induce its own disadvantages too, latency, memory BW contention, amount of memory (quite limited compared to 16GB Vram gpus etc etc). Some are only seeing the disadvantages of one platform, ignoring the other platforms disadvantages.
We will see soon enough what the differences are between PS5 and pc loading/streaming performance. I think some are going to be surprised.
 
Not only that, adding a direct bus between the GPU and storage that would bypassing the north/south-bridges and the operating system adds a whole host of problems, not least the OS has no visibility of any of this activity. If you want to involve the OS then the CPU needs to be involved as well. Unless Nvidia are also proposing to have something like a secure satellite I/O processor on the GPU and have Microsoft re-engineer the Windows kernel to delegate certain I/O processes to the that processor, which will also keep the operating systems in the loop. ¯\_(ツ)_/¯

That isn't what is being described in the Nvidia link and the capability in hardware already exists to so this in modern consumer PC's.

Nvidia are describing the transfer of data from SDD to GPU memory over the existing PCIe fabric and via the CPU's root complex. Most modern CPU's (from Zen upwards for AMD, and I'm not sure of Intel) already implement the hardware capability which is an optional requirement of the PCI Express spec. The difference from how it works now it that the data doesn't get copied to main memory first, and I believe the data copy is handled by the NVMe drives own DMA engine rather than the CPU. The CPU (and OS) would still send the request to the DMA engine though.

This seems to be what Nvidia's RTX-IO is proposing which in turn appears to be based on their GPUDirect Storage described in the link iroboto posted above. I'm pretty sure it's also what the consoles, or at least the PS5 is doing - using of course, the Zen2's root complex.
 
Not only that, adding a direct bus between the GPU and storage that would bypassing the north/south-bridges and the operating system adds a whole host of problems, not least the OS has no visibility of any of this activity. If you want to involve the OS then the CPU needs to be involved as well. Unless Nvidia are also proposing to have something like a secure satellite I/O processor on the GPU and have Microsoft re-engineer the Windows kernel to delegate certain I/O processes to the that processor, which will also keep the operating systems in the loop. ¯\_(ツ)_/¯

There is a reason Microsoft and Sony designed the hardware architecture of their consoles the way they do - virtually identically - to achieve the same performance goals. That is the best way to do it. But it's a closed architecture with very limited expansion and scaleability options which would not suite the PC.

To add to my previous response, the North and South bridges play no part here. The North Bridge because it doesn't exist in modern systems, and the South Bridge because a systems primary NVMe drive should already bypass the SB and link directly to the CPU - just like the consoles.
 
That last part is one of the reasons why AF still isnt used all too much on consoles (DF). Also, shared memory can induce its own disadvantages too, latency, memory BW contention, amount of memory (quite limited compared to 16GB Vram gpus etc etc). Some are only seeing the disadvantages of one platform, ignoring the other platforms disadvantages.
We will see soon enough what the differences are between PS5 and pc loading/streaming performance. I think some are going to be surprised.

Amount of memory and memory BW contention is not a limit of shared memory but of console cost. This is possible from a technical point of view to have 32 GB of faster GDDR6 with 512 bits bus but the cost is too high for a console.
 
That isn't what is being described in the Nvidia link and the capability in hardware already exists to so this in modern consumer PC's. Nvidia are describing the transfer of data from SDD to GPU memory over the existing PCIe fabric and via the CPU's root complex.

Nvidia's description seems to describe a different model:

GPUDirect Storage enables a direct data path between local or remote storage, such as NVMe or NVMe over Fabric (NVMe-oF), and GPU memory. It avoids extra copies through a bounce buffer in the CPU’s memory, enabling a direct memory access (DMA) engine near the NIC or storage to move data on a direct path into or out of GPU memory — all without burdening the CPU.​

To add to my previous response, the North and South bridges play no part here. The North Bridge because it doesn't exist in modern systems, and the South Bridge because a systems primary NVMe drive should already bypass the SB and link directly to the CPU - just like the consoles.

Since around 2011, AMD and Intel began incorporating south-bridge and north-bridges controllers on the main CPU die itself. The bus controllers very much still exist and their features still advance filling the need to support various new I/O models. The distinct logic blocks and interconnects continue still exist on-die, i.e. there are still but controllers for the different devices that can be connected. The integration was why CPU pin-counts exploded because the motherboard suddenly had a lot of more signals crowding into one chip.

edit: deleted some errant text.
 
Last edited by a moderator:
Back
Top