Velocity Architecture - Limited only by asset install sizes

I think this bolded part is a bit misleading. On the XSX there is no system RAM. So the decompression block sends data first into RAM then the GPU can access it. The diagram you attached shows how the decompression sw/hw on the NVIDIA cards bypasses system RAM to place it into VRAM. After that it is then used again. The GPU can only use byte addressable data from RAM. So it cannot read directly from the SSD.

I've taken both Nvidia's presentation and comments about the Xbox to mean that these GPUs can effectively work on data that is being accessed from SSD before it's touched GPU memory. If Nvidia are actually using sw decompression, then I don't see how they could avoid this. The slides I've seen don't fit with writing compressed data to GPU memory, and then reading it back so they could decompress it with software, and then writing it back uncompressed.

From the Hotchips talk it seems that the XSX GPU can see assets on the SSD as being mapped into virtual memory - once they've been added to the page table (doesn't appear to be automatic for everything in the install). To me that sounds like the GPU can see what's where, and can understand virtual memory addresses. Though if I'm wrong in my interpretation, corrections are always welcome!
 
Just going through some of the Hotchips video presentation again - the GPU focused presenter amongst the pair talks about there being a number of GPU cache modes, and mentions a streaming mode. Could there be a way to stream from Virtual Memory directly into an area of GPU cache marked for streaming? Seems like that could be mighty useful ...
 
Last edited:
Just going through some of the Hotchips video presentation again - the GPU focused presenter amongst the pair talks about there being a number of GPU cache modes, and mentions a streaming mode. Could there be a way to stream from Virtual Memory directly into an area of GPU cache marked for streaming? Seems like that could be mighty useful ...
Maybe related!
pXcpzMM.png

2j4kmvV.png
 
Just going through some of the Hotchips video presentation again - the GPU focused presenter amongst the pair talks about there being a number of GPU cache modes, and mentions a streaming mode. Could there be a way to stream from Virtual Memory directly into an area of GPU cache marked for streaming? Seems like that could be mighty useful ...

Someone here posted the FlashMap papers a while ago; haven't read them in a while but maybe it's possible this stuff was also addressed in those papers? Also I recall maybe you, iroboto and dsoup having some interesting speculation on things derived out of those papers.

Have been really curious to what extent the R&D in FlashMaps is being leveraged into XvA components like DirectStorage, would be surprised if it weren't a hefty chunk of it. Just wished MS did a more detailed breakdown of the SSD I/O similar to what Sony did back in March, but I understand why they might be waiting. DS (at least on the PC side) isn't finished yet and not coming until sometime in 2021. Though that seems to be in regards to the general PC space; both Nvidia and AMD have cards coming this year with some version of DirectStorage on them for example (maybe not 100% at release though).
 
Blimey, this is all getting beyond me now! I only ever dabbled with APIs like OGL and D3D and never fiddled with the GPU proper. Hopefully someone with more knowledge can chip in.

I hadn't seen that RDNA ISA paper before, so thanks for bringing that up. Wish I knew what most of it meant. :D

https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf
I think this doc is meant for compiler writers. Nonetheless it’s interesting to see GPU instruction encodings!
 
I've taken both Nvidia's presentation and comments about the Xbox to mean that these GPUs can effectively work on data that is being accessed from SSD before it's touched GPU memory. If Nvidia are actually using sw decompression, then I don't see how they could avoid this. The slides I've seen don't fit with writing compressed data to GPU memory, and then reading it back so they could decompress it with software, and then writing it back uncompressed.

From the Hotchips talk it seems that the XSX GPU can see assets on the SSD as being mapped into virtual memory - once they've been added to the page table (doesn't appear to be automatic for everything in the install). To me that sounds like the GPU can see what's where, and can understand virtual memory addresses. Though if I'm wrong in my interpretation, corrections are always welcome!

AFAIK RTX I/O is a combination of DirectStorage Software that enables the GPU to do decompression. It is not reading the data since processors can only read byte addressable data. If the compressed data was coming from persistent memory you would have been right, but afaik processors can only read byte addressable data from dynamic or persistent memory into the processor caches not SSDs. In the slide you showed the GPU is not reading the data but decompressing it into VRAM. I don't even see a benefit tbh because the virtual address space contains the address of the data that hypothetically would be read into the caches directly from storage. So it would be sitting in RAM for nothing(Unless if under your explanation any data that is directly put into the caches isn't sent into memory).




The whole point of SFS working with the decompression block and GPU MMU is to ensure that only data thats immediately needed by the GPU is loaded into RAM and the virtual address space is accurately reflecting that. Thats where most of the benefit is. So on PCs with RTX I/O the GPU, in order to directly read from SSD, they would have to decompress the data, convert it into byte addressable data, coordinate with the MMU to discard some of the data. Seems inefficient tbh. On consoles, there is dedicated decompression and texture streaming hw directly into RAM and as efficiently as possible.

It is also much more efficient to simply have larger TLB buffers for the caches and a larger portion of RAM for virtual Address tables to cope with the presence of an SSD. Thats what Xbox has said in their statements. "Then we will see how the address space will increase immensely" ~Phil Spencer. I think he was speaking about the virtual address space and the aforementioned is what they did. Because of the speed and instant access of pages on the SSD, they can increase the size of virtual address space to include more physical addresses in storage. This would be much more efficient than trying to get the GPU to read non byte addressable data directly from storage. The only thing I have seen is persistent memory where the processor can access certain data directly from DRAM and Persistent memory. But even then, the games would have to be programmed differently. With what Xbox did, the developer can simply do things as normal but consider their app has access of say 113.5GB of memory when in reality it is 13GB. This in itself is amazing for open world games.
 
Can the PS5 do something similar?

With regards to SFS, I don't think so(efficient texture streaming). In terms of a larger virtual address space, I would be absolutely shocked if they do not match the Series X considering both systems have 16GB of RAM. Both companies should have adjusted the size of the swap file on the SSD, the portion of RAM saved for virtual addresses, custom MMUs, custom DMAs to match SSD speed, larger TLB buffers for larger processor caches. These would be standard for the quality of consoles MS and Sony produces. So yes, the PS5 should have larger virtual RAM as well
 
Last edited:
Not according to recent interviews. Apparently the XSX/XSS GPUs can read data directly from the SSD.
Besides an equivocal statement, I don't understand how that is possible unless there is some form of persistent storage from which the GPU can read data.

Let me give you an example of DirectStorage API for PC working with RTX I/O. The software enables the CPU to simply request for data(offloaded to the GPU and DMAs) and it is not decompressed by the CPU. It is done by the GPU and all the data is stored in VRAM.

"On top of that, many of these assets are compressed. In order to be used by the CPU or GPU, they must first be decompressed."
In other words, the GPU would have to first decompress the data, put it into VRAM, then use it. It cannot use the data compressed nor can it read it directly from memory. It simply decompresses it, puts it into VRAM, from where the MMU has created virtual Addresses that are stored in RAM.
 
I’m curious if the I/O subsystem is low-latency enough, is it theoretically plausible to have memory mapped files for textures in place of SFS? e.g. texture unit touching an area of a certain mip level would generate an page fault, which is handled transparently by the I/O subsystem.

Also I don’t think MS would deploy something as radical as FlashMap on the Xbox Series, although there can still be possibility: even with SFS, the latency should need to be very low, and they haven’t detailed DirectStorage yet, the last piece of the puzzle.
 
AFAIK RTX I/O is a combination of DirectStorage Software that enables the GPU to do decompression. It is not reading the data since processors can only read byte addressable data. If the compressed data was coming from persistent memory you would have been right, but afaik processors can only read byte addressable data from dynamic or persistent memory into the processor caches not SSDs. In the slide you showed the GPU is not reading the data but decompressing it into VRAM. I don't even see a benefit tbh because the virtual address space contains the address of the data that hypothetically would be read into the caches directly from storage. So it would be sitting in RAM for nothing(Unless if under your explanation any data that is directly put into the caches isn't sent into memory).

If the GPU in RTX IO isn't able to work on data coming from the SSD, how would it be able to do software decompression on it before it's reached memory? I think I might be missing something here. I'm interpreting the slides as meaning that data hits the GPU after coming across the PCIe bus, it gets decompressed, and then its first time in vram is after decompression.

With XSX I'm suggesting that the GPU has the ability to access data on the SSD via the virtual address space, once it's been added to the virtual address space. If you can see it, and read, and want to processes it, with no benefit from it hitting vram first, I think it would make a lot of sense to be able to do that. Would reduce latency, BW, power and all that. For example, a software decompression system for assets you want to then place in vram and use repeatedly.




The whole point of SFS working with the decompression block and GPU MMU is to ensure that only data thats immediately needed by the GPU is loaded into RAM and the virtual address space is accurately reflecting that. Thats where most of the benefit is. So on PCs with RTX I/O the GPU, in order to directly read from SSD, they would have to decompress the data, convert it into byte addressable data, coordinate with the MMU to discard some of the data. Seems inefficient tbh. On consoles, there is dedicated decompression and texture streaming hw directly into RAM and as efficiently as possible

At Hotchips MS said that SFS starts by allocating virtual memory space for the entire texture. So both the pages in vram and the pages still on the SSD should be, by my understanding, visible to the GPU.

With RTX IO, I don't know how they're going to do it. Storing the compressed data on the SSD in a format that can be added to a windows managed virtual address space might be one way to go though. Perhaps then the GPU could just access it like it was in memory on the other side of the PCIe bus.

It is also much more efficient to simply have larger TLB buffers for the caches and a larger portion of RAM for virtual Address tables to cope with the presence of an SSD. Thats what Xbox has said in their statements. "Then we will see how the address space will increase immensely" ~Phil Spencer. I think he was speaking about the virtual address space and the aforementioned is what they did. Because of the speed and instant access of pages on the SSD, they can increase the size of virtual address space to include more physical addresses in storage. This would be much more efficient than trying to get the GPU to read non byte addressable data directly from storage. The only thing I have seen is persistent memory where the processor can access certain data directly from DRAM and Persistent memory. But even then, the games would have to be programmed differently. With what Xbox did, the developer can simply do things as normal but consider their app has access of say 113.5GB of memory when in reality it is 13GB. This in itself is amazing for open world games.

Yeah, I'm pretty sure the virtual address range is expanded to encompass selected assets on the SSD. At which point, shouldn't any device with access to the contents of the virtual address range (including what's in "virtual memory" on the SSD) be able to see it and access it?

Perhaps we've got our wires crossed on a couple of things. I think adding selected contents of the SSD to the memory address rage, by mapping memory addresses directly or indirectly to flash locations, would be the entire basis of the GPU being able to see and access data on the SSD without it needing to be copied into vram first.

Edit: I'm thinking that once you add something on the SSD to the address space, it should effectively become like contents of a rom cart of old. That contents doesn't change, it has an address, and you can read from it or copy it into system ram.
 
Last edited:
Someone here posted the FlashMap papers a while ago; haven't read them in a while but maybe it's possible this stuff was also addressed in those papers? Also I recall maybe you, iroboto and dsoup having some interesting speculation on things derived out of those papers.

Have been really curious to what extent the R&D in FlashMaps is being leveraged into XvA components like DirectStorage, would be surprised if it weren't a hefty chunk of it. Just wished MS did a more detailed breakdown of the SSD I/O similar to what Sony did back in March, but I understand why they might be waiting. DS (at least on the PC side) isn't finished yet and not coming until sometime in 2021. Though that seems to be in regards to the general PC space; both Nvidia and AMD have cards coming this year with some version of DirectStorage on them for example (maybe not 100% at release though).

That's a good point. I should really go back to that FlashMap paper and try and get my head around more of it!
 
If the GPU in RTX IO isn't able to work on data coming from the SSD, how would it be able to do software decompression on it before it's reached memory? I think I might be missing something here. I'm interpreting the slides as meaning that data hits the GPU after coming across the PCIe bus, it gets decompressed, and then its first time in vram is after decompression.

With XSX I'm suggesting that the GPU has the ability to access data on the SSD via the virtual address space, once it's been added to the virtual address space. If you can see it, and read, and want to processes it, with no benefit from it hitting vram first, I think it would make a lot of sense to be able to do that. Would reduce latency, BW, power and all that. For example, a software decompression system for assets you want to then place in vram and use repeatedly.



At Hotchips MS said that SFS starts by allocating virtual memory space for the entire texture. So both the pages in vram and the pages still on the SSD should be, by my understanding, visible to the GPU.

With RTX IO, I don't know how they're going to do it. Storing the compressed data on the SSD in a format that can be added to a windows managed virtual address space might be one way to go though. Perhaps then the GPU could just access it like it was in memory on the other side of the PCIe bus.



Yeah, I'm pretty sure the virtual address range is expanded to encompass selected assets on the SSD. At which point, shouldn't any device with access to the contents of the virtual address range (including what's in "virtual memory" on the SSD) be able to see it and access it?

Perhaps we've got our wires crossed on a couple of things. I think adding selected contents of the SSD to the memory address rage, by mapping memory addresses directly or indirectly to flash locations, would be the entire basis of the GPU being able to see and access data on the SSD without it needing to be copied into vram first.

Edit: I'm thinking that once you add something on the SSD to the address space, it should effectively become like contents of a rom cart of old. That contents doesn't change, it has an address, and you can read from it or copy it into system ram.

"With XSX I'm suggesting that the GPU has the ability to access data on the SSD via the virtual address space"
This doesn't mean the GPU is getting the data directly from the SSD. You should read up on virtual Addresses. They are simply abstractions of physical addresses differences(i.e physical addresses in RAM and physical addresses in disk) such that the OS can page in and out blocks of data from SSD to RAM as they are needed by the application. The processor never actually has access to the data on the SSD but thinks it is part of physical RAM yet the data is till on storage.

For the RTX I/O, the blocks of data from SSD are probably(my guess) stored somewhere on system memory compressed but unusable for either CPU and GPU, from where they are decompressed by the GPU instead of the CPU into VRAM. GPUs can access I think up to 256MB of system RAM. As Microsoft clearly stated, the CPU and GPU can only use the data when its decompressed(i.e in this case when they are in RAM). After it is decompressed into VRAM, the GPU can then use it.

.

Then the whole point of a virtual Address space is to abstract away the fact that the app only has a smaller amount of RAM(In this case 13.5GB). This is done for performance but also security such that the games don't have access to sensitive OS information in RAM or other information of other apps with data in RAM. The OS instantly swaps in and out of RAM anything thats on the SSD and the app never knows. The app actually thinks it has all 16GB. With the next gen consoles and Series X in particular it will think it has 100GB of RAM but in reality, the GPU will be working with the CPU, MMUs, MSP to swap in and out pages from the SSD into RAM, and then into processor caches. All this is abstracted away. But at any one scene, from my understanding only 13.5GB will be actually available so devs will have to keep that in mind. So no, the processor cannot instantly get data into the processor caches from the SSD and here are the biggest reasons:

1.) The data has to be placed in byte addressable state in DRAM or a persistent memory solution for the processor to use it, you cant simply get block data from SSD direct into a processor cache. So there would have to be extra hw to convert that data into this form(byte addressable) and store it before being used by the processor. So you're looking at a small cache for this direct access. This is why persistent memory solutions enable the processor to access some data instantly and bypass RAM. But even then the whole hw stack is re-engineered to ensure there are no bottlenecks.

2.) If somehow the processor tries to bypass the DMA engine to access data directly from SSD, it will bottleneck the processes trying to get data through the normal route(SSD->DRAM->Processor Cache). This will not only waste processor cycles but it would be so slow because those processes have to wait for the direct to processor route. Its why even when persistent memory is added, its added alongside memory so that requests for data pass through the traditional route. SSD->persistent memory->DRAM-> processor cache but processor can instantly get some data from persistent memory. The MMUs create virtual address space containing three physical addresses.

3.) The bandwidth of the SSD and access times(higher latency) are much slower than DRAM. You're looking at 25-80 microseconds for SSD and less than 20 nano seconds for RAM. Unless the processor was getting data from persistent memory(has much lower latency), the required data would reach the processor caches faster using the traditional route. There would be no benefit getting data directly from SSD.
 
Besides an equivocal statement, I don't understand how that is possible unless there is some form of persistent storage from which the GPU can read data.

Let me give you an example of DirectStorage API for PC working with RTX I/O. The software enables the CPU to simply request for data(offloaded to the GPU and DMAs) and it is not decompressed by the CPU. It is done by the GPU and all the data is stored in VRAM.

"On top of that, many of these assets are compressed. In order to be used by the CPU or GPU, they must first be decompressed."
In other words, the GPU would have to first decompress the data, put it into VRAM, then use it. It cannot use the data compressed nor can it read it directly from memory. It simply decompresses it, puts it into VRAM, from where the MMU has created virtual Addresses that are stored in RAM.

"nor can it read it directly from memory" meant to say "but can read it directly from memory."
 
"With XSX I'm suggesting that the GPU has the ability to access data on the SSD via the virtual address space"
This doesn't mean the GPU is getting the data directly from the SSD. You should read up on virtual Addresses. They are simply abstractions of physical addresses differences(i.e physical addresses in RAM and physical addresses in disk) such that the OS can page in and out blocks of data from SSD to RAM as they are needed by the application. The processor never actually has access to the data on the SSD but thinks it is part of physical RAM yet the data is till on storage.

I mean, I do know that virtual addresses are abstractions of various physical addresses (though it never hurts to know more). That's been kind of the basis of how I think the the GPU sees things on the SSD for months now. MS were pretty clear that DirectStorage still uses CPU, but at a vastly reduced level, so it's definitely involved somewhere, maybe in translating the address and then telling the SSD exactly which areas to access (if it's using something like FlashMap).

The key thing for me is that the GPU can see it and request it and have it sent, and it doesn't only need to have the application running code on the CPU to decide what the GPU should have, when it should have it, and to force feed the CPU side copy of the data to the GPU. So I'm comparing XSX to the way things are currently done on PC.

If it's that's not what you'd describe as "direct", fair enough. There's abstraction involved, absolutely. There has to be in order to link assets on the SSD to the address space.

For the RTX I/O, the blocks of data from SSD are probably(my guess) stored somewhere on system memory compressed but unusable for either CPU and GPU, from where they are decompressed by the GPU instead of the CPU into VRAM. GPUs can access I think up to 256MB of system RAM. As Microsoft clearly stated, the CPU and GPU can only use the data when its decompressed(i.e in this case when they are in RAM). After it is decompressed into VRAM, the GPU can then use it.

That sounds like an interesting idea, though the slide from Nvidia doesn't indicate that data is stored in an intermediate location anywhere in ram before it's decompressed by the GPU and written to vram. Of course, the slide may be a simplification of the full process, and there may be some system ram involved. I'm still not sure how if the data is stored in ram in format that is unusable for GPU, you could run software decompression on it though.

Then the whole point of a virtual Address space is to abstract away the fact that the app only has a smaller amount of RAM(In this case 13.5GB). This is done for performance but also security such that the games don't have access to sensitive OS information in RAM or other information of other apps with data in RAM. The OS instantly swaps in and out of RAM anything thats on the SSD and the app never knows. The app actually thinks it has all 16GB. With the next gen consoles and Series X in particular it will think it has 100GB of RAM but in reality, the GPU will be working with the CPU, MMUs, MSP to swap in and out pages from the SSD into RAM, and then into processor caches. All this is abstracted away. But at any one scene, from my understanding only 13.5GB will be actually available so devs will have to keep that in mind. So no, the processor cannot instantly get data into the processor caches from the SSD and here are the biggest reasons:

For performance and optimisation reasons, I think developers on XSX will be able to know which area of memory they are currently addressing. Maybe something like difference address ranges for the fast 10GB, the slower 3.5GB and whatever quantity of assets on SSD are assigned an address another. So I think that the application will be able to work out what's where if it needs to.

RDNA 1 has a 48-bit address range iirc, which is vastly beyond the limits of an SSD, so I think the "100GB" thing from MS about "virtual memory" would be about the amount of ram they're using for storing their translation tables or whatnot, which I figure will be in the system reserved portion.

1.) The data has to be placed in byte addressable state in DRAM or a persistent memory solution for the processor to use it, you cant simply get block data from SSD direct into a processor cache. So there would have to be extra hw to convert that data into this form(byte addressable) and store it before being used by the processor. So you're looking at a small cache for this direct access. This is why persistent memory solutions enable the processor to access some data instantly and bypass RAM. But even then the whole hw stack is re-engineered to ensure there are no bottlenecks.

I'm thinking that the IO block could create machine readable data from assets that have been stored appropriately before compression.The IO block is going to have some cache or scratchpad of its own anyway so you can decompress into it and then copy to memory. If you're only going to use the data once, and the app has selected to do so, my thinking is just allow it to be read straight to cache as if it had actually just been read from system ram (which as far as the GPU is concerned, could be the case). Different parameters for access (read only vs copy to ram) might allow the data to be used appropriately at this stage.

2.) If somehow the processor tries to bypass the DMA engine to access data directly from SSD, it will bottleneck the processes trying to get data through the normal route(SSD->DRAM->Processor Cache). This will not only waste processor cycles but it would be so slow because those processes have to wait for the direct to processor route. Its why even when persistent memory is added, its added alongside memory so that requests for data pass through the traditional route. SSD->persistent memory->DRAM-> processor cache but processor can instantly get some data from persistent memory. The MMUs create virtual address space containing three physical addresses.

This is a good point that I hadn't thought of. I can only think that if MS want to be able to do this, they'll have made the customisations.

If you were going to attempt it on a console where you could customise at will, how would you approach it?

3.) The bandwidth of the SSD and access times(higher latency) are much slower than DRAM. You're looking at 25-80 microseconds for SSD and less than 20 nano seconds for RAM. Unless the processor was getting data from persistent memory(has much lower latency), the required data would reach the processor caches faster using the traditional route. There would be no benefit getting data directly from SSD.

One benefit would be bandwidth - if you only want to use the data once, before writing a modified version to ram, then you'd have saved an unnecessary write and read. Perhaps not such a big issue with the BW figures involved, but at a peak of 4.8 GB/s from the IO block that's still something. Perhaps more of an issue for the XSS, though perhaps still not much of an issue there.

I'm not following how consuming data from the SSD without going off chip to ram first is higher latency than SSD->ram-GPU. It's really hard to find figures on GPU latency (for me anyway), but the numbers I've seen have been as high as 200~500 ns depending on workload. So again still vastly slower than SSD, but not nothing.
 
I mean, I do know that virtual addresses are abstractions of various physical addresses (though it never hurts to know more). That's been kind of the basis of how I think the the GPU sees things on the SSD for months now. MS were pretty clear that DirectStorage still uses CPU, but at a vastly reduced level, so it's definitely involved somewhere, maybe in translating the address and then telling the SSD exactly which areas to access (if it's using something like FlashMap).

The key thing for me is that the GPU can see it and request it and have it sent, and it doesn't only need to have the application running code on the CPU to decide what the GPU should have, when it should have it, and to force feed the CPU side copy of the data to the GPU. So I'm comparing XSX to the way things are currently done on PC.

If it's that's not what you'd describe as "direct", fair enough. There's abstraction involved, absolutely. There has to be in order to link assets on the SSD to the address space.



That sounds like an interesting idea, though the slide from Nvidia doesn't indicate that data is stored in an intermediate location anywhere in ram before it's decompressed by the GPU and written to vram. Of course, the slide may be a simplification of the full process, and there may be some system ram involved. I'm still not sure how if the data is stored in ram in format that is unusable for GPU, you could run software decompression on it though.



For performance and optimisation reasons, I think developers on XSX will be able to know which area of memory they are currently addressing. Maybe something like difference address ranges for the fast 10GB, the slower 3.5GB and whatever quantity of assets on SSD are assigned an address another. So I think that the application will be able to work out what's where if it needs to.

RDNA 1 has a 48-bit address range iirc, which is vastly beyond the limits of an SSD, so I think the "100GB" thing from MS about "virtual memory" would be about the amount of ram they're using for storing their translation tables or whatnot, which I figure will be in the system reserved portion.



I'm thinking that the IO block could create machine readable data from assets that have been stored appropriately before compression.The IO block is going to have some cache or scratchpad of its own anyway so you can decompress into it and then copy to memory. If you're only going to use the data once, and the app has selected to do so, my thinking is just allow it to be read straight to cache as if it had actually just been read from system ram (which as far as the GPU is concerned, could be the case). Different parameters for access (read only vs copy to ram) might allow the data to be used appropriately at this stage.



This is a good point that I hadn't thought of. I can only think that if MS want to be able to do this, they'll have made the customisations.

If you were going to attempt it on a console where you could customise at will, how would you approach it?



One benefit would be bandwidth - if you only want to use the data once, before writing a modified version to ram, then you'd have saved an unnecessary write and read. Perhaps not such a big issue with the BW figures involved, but at a peak of 4.8 GB/s from the IO block that's still something. Perhaps more of an issue for the XSS, though perhaps still not much of an issue there.

I'm not following how consuming data from the SSD without going off chip to ram first is higher latency than SSD->ram-GPU. It's really hard to find figures on GPU latency (for me anyway), but the numbers I've seen have been as high as 200~500 ns depending on workload. So again still vastly slower than SSD, but not nothing.

"I'm not following how consuming data from the SSD without going off chip to ram first is higher latency than SSD->ram-GPU."

By the time you access a certain amount of data on the SSD to send it direct to the processor, you've bottlenecked the traditional route because it has to wait. The SSD controller can only handle a certain amount of requests. Are there multi threaded SSD ICs yes, but they would be trying to accomplish the current request as soon as possible. On the other hand, you would have been able to gain more from parallelism and other fundamental hw principles such as locality by simply taking all the data into byte addressable RAM, then using the 560GB/s bandwidth which is orders of magnitude faster than NAND. The access times are also in nano seconds on DRAM compared to microseconds on SSD, so you're essentially bottlenecking the memory hierarchy to send a small amount of data to the processor.

And without the blocks being placed in byte addressable memory it is unusable to the processor. its like sending 32 bytes of data to a single 64 bit register.

Processors cannot use SSDs as memory because:
1.) They store data in terms of blocks of pages yet the caches need to access bytes.
2.) They are extremely slow i.e low bandwidth 560GB/s vs 6GB/s with decompression HW.
3.) The latency is measured in microseconds compared to nano seconds for DRAM. Even persistent memory can only come close to single digit microsecond access times, so DRAM will always be needed.

So there is no way the XSX GPU is reading data directly from SSD to render games while bypassing RAM. The data that is needed is being sent to unified RAM from where it is then sent to a processor cache. But for the application, all this is abstracted away by the OS which does demand paging of block data from the SSD to the RAM.

"That sounds like an interesting idea, though the slide from Nvidia doesn't indicate that data is stored in an intermediate location anywhere in ram before it's decompressed by the GPU and written to vram. Of course, the slide may be a simplification of the full process, and there may be some system ram involved. I'm still not sure how if the data is stored in ram in format that is unusable for GPU, you could run software decompression on it though."

Yes they're not showing the whole process. But it is still an accurate representation of where the data ends up and which processor does the decompression. After that, as Microsoft stated, the CPU and GPU can use the data. Thats why I highly think and most likely it's the case the compressed data is sent to system RAM or some other from of RAM, from where the GPU accesses it and decompresses it into VRAM. The MMU does all the page address translations and the DMA controllers handle the requests from the CPU and GPU. Remember, if the processor is trying to bypass RAM for a small amount of data at the same time as sending all the other data into RAM, one of the processes is bottlenecked. So how does persistent memory overcome this bottleneck, when the request for data requires the DMA engine to get data from the PMEM, it waits for whatever process is done with DRAM and then PMEM gets its chance. But the advantage to PMEM is the processor can actually use the data unlike the SSD where it would have to be converted into byte addressable data. The other advantage is that the latency on PMEM is orders of magnitude lower than NAND SSDs despite still being considerably slower than DRAM. So during this break, the processor can squeeze extra data from PMEM and use an extra DMA and MMU unit to get extra data.

"If you were going to attempt it on a console where you could customise at will, how would you approach it?"
I think the current architecture they chose is the best. It sounds like a nightmare to design such a system. You'd need an extra memory cache just for this to translate the data to byte addressable.

"MS were pretty clear that DirectStorage still uses CPU, but at a vastly reduced level, so it's definitely involved somewhere, maybe in translating the address and then telling the SSD exactly which areas to access (if it's using something like FlashMap)."

IIRC the MMU handles all page translations.The DMA engine gets the data from the physical address. The CPU overhead is from IO requests and maybe check in and load management.

"I'm thinking that the IO block could create machine readable data from assets that have been stored appropriately before compression.The IO block is going to have some cache or scratchpad of its own anyway so you can decompress into it and then copy to memory."

This is possible as long as its being gotten from memory I can see the GPU MMU working with the DMA engines to make this possible. But the benefits are unknown at this point and from the top of my head, this would be extra work with little benefit. For one big reason:

You'd be foregoing the benefits of the principle of locality by trying to pick out certain data. The inherent design of the memory hierarchy is that certain data that will be needed next time is very close. So if you've discarded that data and not sent it to RAM, that means next time you'll waste processor cycles trying to put it into the RAM or the scratchpad cache you speak of. So better to just send it all the RAM.
 
By the time you access a certain amount of data on the SSD to send it direct to the processor, you've bottlenecked the traditional route because it has to wait. The SSD controller can only handle a certain amount of requests. Are there multi threaded SSD ICs yes, but they would be trying to accomplish the current request as soon as possible. On the other hand, you would have been able to gain more from parallelism and other fundamental hw principles such as locality by simply taking all the data into byte addressable RAM, then using the 560GB/s bandwidth which is orders of magnitude faster than NAND. The access times are also in nano seconds on DRAM compared to microseconds on SSD, so you're essentially bottlenecking the memory hierarchy to send a small amount of data to the processor.

And without the blocks being placed in byte addressable memory it is unusable to the processor. its like sending 32 bytes of data to a single 64 bit register.

Processors cannot use SSDs as memory because:
1.) They store data in terms of blocks of pages yet the caches need to access bytes.
2.) They are extremely slow i.e low bandwidth 560GB/s vs 6GB/s with decompression HW.
3.) The latency is measured in microseconds compared to nano seconds for DRAM. Even persistent memory can only come close to single digit microsecond access times, so DRAM will always be needed.

So there is no way the XSX GPU is reading data directly from SSD to render games while bypassing RAM. The data that is needed is being sent to unified RAM from where it is then sent to a processor cache. But for the application, all this is abstracted away by the OS which does demand paging of block data from the SSD to the RAM.



Yes they're not showing the whole process. But it is still an accurate representation of where the data ends up and which processor does the decompression. After that, as Microsoft stated, the CPU and GPU can use the data. Thats why I highly think and most likely it's the case the compressed data is sent to system RAM or some other from of RAM, from where the GPU accesses it and decompresses it into VRAM. The MMU does all the page address translations and the DMA controllers handle the requests from the CPU and GPU. Remember, if the processor is trying to bypass RAM for a small amount of data at the same time as sending all the other data into RAM, one of the processes is bottlenecked. So how does persistent memory overcome this bottleneck, when the request for data requires the DMA engine to get data from the PMEM, it waits for whatever process is done with DRAM and then PMEM gets its chance. But the advantage to PMEM is the processor can actually use the data unlike the SSD where it would have to be converted into byte addressable data. The other advantage is that the latency on PMEM is orders of magnitude lower than NAND SSDs despite still being considerably slower than DRAM. So during this break, the processor can squeeze extra data from PMEM and use an extra DMA and MMU unit to get extra data.


I think the current architecture they chose is the best. It sounds like a nightmare to design such a system. You'd need an extra memory cache just for this to translate the data to byte addressable.



IIRC the MMU handles all page translations.The DMA engine gets the data from the physical address. The CPU overhead is from IO requests and maybe check in and load management.



This is possible as long as its being gotten from memory I can see the GPU MMU working with the DMA engines to make this possible. But the benefits are unknown at this point and from the top of my head, this would be extra work with little benefit. For one big reason:

You'd be foregoing the benefits of the principle of locality by trying to pick out certain data. The inherent design of the memory hierarchy is that certain data that will be needed next time is very close. So if you've discarded that data and not sent it to RAM, that means next time you'll waste processor cycles trying to put it into the RAM or the scratchpad cache you speak of. So better to just send it all the RAM.

Well it certainly sounds like you know an awful lot more about this than me, and with the points you're making I'm inclined to think what you're saying is right, so I guess I'm a convert. There's just a couple of things I'd like to try and clear up about what I was saying.

Processors cannot use SSDs as memory because:
1.) They store data in terms of blocks of pages yet the caches need to access bytes.
2.) They are extremely slow i.e low bandwidth 560GB/s vs 6GB/s with decompression HW.
3.) The latency is measured in microseconds compared to nano seconds for DRAM. Even persistent memory can only come close to single digit microsecond access times, so DRAM will always be needed.

These three points I think I do understand. I was assuming there was something on the SoC to allow data to be presented to the GPU in a form it could use after the GPU had requested it - it wouldn't need to know (or care) about the format in which it was actually stored on the SSD as that was all abstracted away. And while BW of an SSD is low and its latency is high, I figured especially if the developer knew they only wanted to use the data once e.g. software decompression of assets as they are streamed in, they might as well get it done while it's on chip (but as you say this presents practical problems). I wasn't proposing, say, pulling the same texture samples off the SSD for multiple frames.

I am curious now about the different ways in which a system might bypass ram and stay on SoC, however impractical and unlikely it might be. Something fun to think about in bed at night, anyway.
 
Can someone chime in briefly here? So about the point of bandwidth differences between SSDs and main system memory, that's 1000% the case. However, is it possible to view this from a per-frame POV? I remember a DiRT 5 dev, when describing the Series X SSD, saying the could basically swap and replace a texture mid-frame.

Now they could've been referring to that texture being in main memory but IIRC they were describing a feature of the storage with that response. And, it's possible (very possible, actually) that it was in line with what rtongo is describing, i.e the process is virtually invisible to the developer in terms of requesting the data from the SSD because some slice of system memory is being reserved as a cache of sorts to map and stream in data from the SSD (in Series X's case, the 100 GB partition (not using it as describing a hard-segmented cluster of memory of the drive like drive partitions generally work, I just lack of a better term for it), so it's automatically handling those operations for the developer in the background.

I guess this is getting hypothetical now but, supposing there was ever a way for a processor to convert/break-up block data into byte data on the fly and then drop it within a small high-level of on-board cache (an L4$ for example), then pull that in to the lower-level caches of the processor....even knowing the gargantuan bandwidth difference between main memory and SSDs, if data of a given size, say 10 MB, were expected to be used for a few frames by the processor, does that come to a point where there isn't much a difference between just doing it through the (not currently in existence) method I'm describing here, vs. the more traditional process?

Because if a game has a 60 FPS target for example, if it were doing nothing but pulling in data from storage and that storage were even, say, 2 GB/s for uncompressed data, on those frames that'd come to about 34 MB per frame. That's ignoring latency of SSDs though, but let's say the game in question just needs that 10 - 34 MB of data by x amount of time; do you think it would be worth this hypothetical implementation for storage data access to processor cache (let alone if it'd be possible even with changes in memory access schemes) if it meant simply accounting for the latency ahead of time and that processor buffering a read of that 10 - 34 MB of data from storage into a local L4$ so that the data is there by the time it's needed by that processor?

I get that while this is happening other processor components wouldn't be able to access the SSD, but that's still just a few cycles at most. Maybe not all of the 10 MB - 34 MB of the data in this hypothetical example need be buffered into this hypothetical cache (a cache that size would be expensive anyway, even at a slower L4), just enough to offset latency by buffering a slice of it beforehand, conversion into byte-addressable state (maybe some type of hardware or feature to do this?), and write to the L4$.

Honestly wonder if that type of memory access design is feasible or desirable, or if it'd just be better to stick with increasing the bandwidth of components to boost the traditional memory access schemes as they already exist. Because I know that this hypothetical I'm describing could also complicate cache coherency, but I figure the offset benefit would be less bus contention on main system memory in hUMA designs by making use of SSD storage more flexible and less reliant on main memory acting as a middleman. Plus I've been thinking of technological features for 10th-gen consoles and this has been something on my mind.

But to bring it back on topic....forgetting the whole conversion stuff, assuming the GPU could work with the block data, or supposing the SSD storage were a persistent memory solution but at SSD bandwidths and lower-quality persistent memory (so higher latency)....could the raw bandwidth difference between that and main memory be offset through considering what size of data would be needed on a per-frame basis, and ensuring the processor have a prefetch window large enough to simply read in that small amount of memory (we're talking like 10 or so MBs) for a given range of cycles, streaming it in, to work with it directly through its caches, and be a valid means of accessing data, versus still needing to copy the data to main memory and then do such?

I hope that was clear x3.
 
Back
Top