Digital Foundry Article Technical Discussion [2022]

Status
Not open for further replies.
In X years, this functionality will definitely be part of PCs I/O chipset which is where is should be. Data is read, decompressed and routed to main RAM or video RAM with no CPU or GPU intervention. But that's a load long. Intel need to drive this, working OS vendors to build support for this into the OS

I suspect you're right, in that ultimately I think more custom silicon in general from both AMD and Intel is where we'll see more advancement than general X86 performance, as that will become increasingly difficult in the coming years. I guess the question is what applications outside of gaming can readily benefit from this (or the cost of implementing such an engine onto the I/O chipset eventually becomes negligible even if the benefits are not readily apparent to everyday apps).
 
I'm not sure what the future holds for dedicated decompression hardware on the PC.

On the one hand, we've seen various video codecs supported on integrated and discrete GPUs, but those are typically for industry standardised formats that use a limited range of bandwidths. So they're ideal for dedicated hardware.

PC apps on the other hand can have wildly different demands, with wildly different streaming / loading patterns. Despite video codecs being so widely supported in GPUs for so many years, the most widely used compressed video format for games seems to be *drumroll* .... Bink? And using software decompression too.....

I'm not sure where the balance of flexibility and speed vs cost and complexity (and industry battles to standardise and pick free or licenced formats) will need to be to make hardware decompression blocks worthwhile. But I don't think it'll be driven primarily by performance at first. I think it could be driven by power tho. You wouldn't think so looking at Intel's comically false TDPs, but before panic sets in and power budgets go up to the moon, a huge focus of the engineers is power efficiency and battery life.

I'm not sure about decompressing before you send over the PCIe bus though. I know that's how DS is currently doing it in the absence of GPU decompression, but with 2:1 compression, and increasing loads on the PCIe bus for things like updating BVH trees for ray tracing (e.g. Spiderman), I think you'd ideally want to be decompressing on the GPU side, where you're past the narrow bit and into vast golden meadows of GPU bandwidth. It would probably be faster and less latent to do it that way to - not only is time over PCIe halved, but GPUs can decompression into the hundreds of GB/s.

It's all very exciting. Meanwhile, in reality land, I'm thinking of trying to get a TPM 1.2 module and bypassing some of the Windows 11 checks to get full speed DS on my crusty old 4790K.
 
You don't get full speed on DS without a TSM on Windows 11? I did not know that, looks like a session on Amazon.

I don't know the details but it looks like there's only so much they can do without the under the bonnet changes in Windows 11:


"Storage stack optimizations: On Windows 11, this consists of an upgraded OS storage stack that unlocks the full potential of DirectStorage, and on Windows 10, games will still benefit from the more efficient use of the legacy OS storage stack "

"This means that any game built on DirectStorage will benefit from the new programming model and GPU decompression technology on Windows 10, version 1909 and up. Additionally, because Windows 11 was built with DirectStorage in mind, games running on Windows 11 benefit further from new storage stack optimizations. "


I don't know if the difference will be small or big, but it'll be another interesting one. I'm lacking CPU support and TPM 2.0 for Win 11, but I've read there's (for now) a registry work around that MS have acknowledged - it requires at least TPM 1.2 however. My mobo has TPM 1.2 support, but finding a compatible module might be a bit of a crap shoot. ¯\_(ツ)_/¯
 
Meanwhile, in reality land, I'm thinking of trying to get a TPM 1.2 module and bypassing some of the Windows 11 checks to get full speed DS on my crusty old 4790K.

What the above poster says. I have W11 running from everything from a i7 920 to the 3900x system. TPM has little to do with Direct Storage, its just a security thing.

Anyway, not seeing it happen either, dedicated asic for decompression in the pc space. Atleast i hope not, GPU's are too damn fast and more flexible. Something you'd not want to give up.
 
I'm running Windows 11 on my rig without any TPM at all, I just made 3 registry changes before the install.

Yeah, I've seen a few ways that seem to have worked out okay for folks, but I'm hoping to go via the method on the official MS support page. Have you noticed any performance regression in any areas (particularly gaming) after the shift, or is everything going okay so far? Some older CPU users have reported some apps struggling vs Win 10, but that could be down to any number of things.

I'd like to get my old machine up to full DS and Win12U so I can chin rub and chip in during the Digital Foundry cross platform discussions.
 
Yeah, I've seen a few ways that seem to have worked out okay for folks, but I'm hoping to go via the method on the official MS support page. Have you noticed any performance regression in any areas (particularly gaming) after the shift, or is everything going okay so far? Some older CPU users have reported some apps struggling vs Win 10, but that could be down to any number of things.

I'd like to get my old machine up to full DS and Win12U so I can chin rub and chip in during the Digital Foundry cross platform discussions.

No performance regression on any of my systems (all upgraded to W11). On the contrary, i think its the better gaming OS vs W10. Even the 920 system does just as nice as it did with W7 and w10. Booted Planetside 2 the other day on the 920 (oc to 3.6ghz) for a LAN, which actually gave better stability than it did with the W10 install.

I'd say go for it, you'l like it.

Edit: theres no reason to not to use the registry hacks to get W11 installed.
 
Yeah, I've seen a few ways that seem to have worked out okay for folks, but I'm hoping to go via the method on the official MS support page. Have you noticed any performance regression in any areas (particularly gaming) after the shift, or is everything going okay so far? Some older CPU users have reported some apps struggling vs Win 10, but that could be down to any number of things.

I'd like to get my old machine up to full DS and Win12U so I can chin rub and chip in during the Digital Foundry cross platform discussions.
I've had no issues at all.
 
For this generation of consoles, this was entirely about chasing performance and not keeping costs down (money). Engineering and manufacturing this custom silicon and creating APIs for it work, cost time and money. Both Microsoft and Sony could have let the APUs in the consoles do the decompression as the always have done, by the CPU or via the zlib decompressors that they had in the previous generation.

In X years, this functionality will definitely be part of PCs I/O chipset which is where is should be. Data is read, decompressed and routed to main RAM or video RAM with no CPU or GPU intervention. But that's a load long. Intel need to drive this, working OS vendors to build support for this into the OS.
I think I’m going to disagree with you, but at the same time say that you raised some great points. Though I think they are apples to orange comparisons here.


If we look at the PC space specifically they are designed to operate entirely without a GPU, whereas consoles must have a GPU. The inclusion of HUMA and shared memory generates a scenario on console that could never occur on PC, quite simply I/O to memory is as direct as it gets. You run into a particular challenge with this setup if you perform decompression on the APU, in particular your bandwidth contention goes up, negating the ability to keep graphics performance up while streaming and decompressing at the same time. They needed a solution here because compute alone wouldn’t be able to solve this issue, whether they decompressed on the CPU side or GPU side, there just wouldn’t be enough bandwidth to keep streaming data into the memory and decompress and store it while running game code and rendering. But simultaneously this is where we are headed to continue to bolster graphical fidelity, we need significantly higher I/O if we want diverse worlds and setups while maintaining high texture, geometry and lighting fidelity. And because there is a HUMA setup, a dedicated ASIC to decompress the data on the way to RAM is both an elegant and simple solution at least from an architecture perspective.

Combined with increasing video game file sizes to improve fidelity, the need for compression is further exacerbated by the lack of available hard drive space, the PS5 comes with only 50% more capacity than PS4 did. Series X doubled it, but game sizes have nearly tripled since last gen. Without compression there would be no way players would be able to fit more than a handful of games on their consoles without expansion.

In the console space, I see this as a solution to a very large problem. It’s elegant, cheap and it works. I don’t see this as spending more money to change the industry however, the price of consoles still need to be kept down and they are still quite cheap comparatively.

On the PC space, there will always be split pools, and NVME drives continue to drop in price while increase in drive speed and capacity. As long as bandwidth continues to outpace the decompression units on console, you technically don’t need to compress, though, it’s just significantly more efficient to (for the very same reasons). Direct Storage needs to accomplish 2 things in particular that would generate significant I/O speed ups in the PC space, firstly if they can bring I/O directly to the GPU ALU to decompress bypassing copying to memory first; opposed IO to memory to ALU and and then writing the results back to VRAM; there would be real savings going straight to ALU, and secondly, to just bypass the CPU memory entirely. If the future of Direct Storage can accomplish this, then I don’t see any reason for there to be a dedicated chip to decompress. Quite simply, over time that chip will become a bottleneck as there is more than sufficient ALU available on PC graphics cards see 20-40TFs. PC gamers will continue to push the envelope further in terms of graphical settings well beyond what consoles are capable of including IO. It's just taking a long time to happen here on PC because there still has yet to be a reason for us to get an nvme drive. People want information on how it's going to work, commitments that they aren't buying hardware that won't be used etc. All terrible things that console gamers don't need to worry about.
 
I'm not sure what the future holds for dedicated decompression hardware on the PC.

On the one hand, we've seen various video codecs supported on integrated and discrete GPUs, but those are typically for industry standardised formats that use a limited range of bandwidths. So they're ideal for dedicated hardware.

PC apps on the other hand can have wildly different demands, with wildly different streaming / loading patterns. Despite video codecs being so widely supported in GPUs for so many years, the most widely used compressed video format for games seems to be *drumroll* .... Bink? And using software decompression too.....

I'm not sure where the balance of flexibility and speed vs cost and complexity (and industry battles to standardise and pick free or licenced formats) will need to be to make hardware decompression blocks worthwhile. But I don't think it'll be driven primarily by performance at first. I think it could be driven by power tho. You wouldn't think so looking at Intel's comically false TDPs, but before panic sets in and power budgets go up to the moon, a huge focus of the engineers is power efficiency and battery life.

I'm not sure about decompressing before you send over the PCIe bus though. I know that's how DS is currently doing it in the absence of GPU decompression, but with 2:1 compression, and increasing loads on the PCIe bus for things like updating BVH trees for ray tracing (e.g. Spiderman), I think you'd ideally want to be decompressing on the GPU side, where you're past the narrow bit and into vast golden meadows of GPU bandwidth. It would probably be faster and less latent to do it that way to - not only is time over PCIe halved, but GPUs can decompression into the hundreds of GB/s.

It's all very exciting. Meanwhile, in reality land, I'm thinking of trying to get a TPM 1.2 module and bypassing some of the Windows 11 checks to get full speed DS on my crusty old 4790K.

You took the words right out of my mouth!

Also, to add to this, we know Xbox can decompress files direct from main memory as well as from the SSD before they hit memory. If PC Direct Storage can do the same then it potentially opens up the option of caching assets in VRAM in a compressed form to be decompressed on demand (much like happens already with the lossy GPU native texture formats) which could greatly increase the effective VRAM capacity.
 
For this generation of consoles, this was entirely about chasing performance and not keeping costs down (money).

This and every generation of consoles is and always has been about keeping costs down. Otherwise we wouldnt have had 2019 midrange hardware in them.
 
This and every generation of consoles is and always has been about keeping costs down. Otherwise we wouldnt have had 2019 midrange hardware in them.
Well we're talking about IO here and not GPU performance, that's not particularly fair.
But when concerning IO PS5 however:

Copying data from internal PS5 SSD to nvme slot is very fast.
The reverse is not true, about 7x longer.

It's just something I don't think PC would want as a solution. If the future is compressed and encrypted everything on PC for security and space saving, this is just too slow.
 
The inclusion of HUMA and shared memory generates a scenario on console that could never occur on PC, quite simply I/O to memory is as direct as it gets. You run into a particular challenge with this setup if you perform decompression on the APU, in particular your bandwidth contention goes up, negating the ability to keep graphics performance up while streaming and decompressing at the same time.

That's a great point. Consoles could potentially be saving themselves a write and then a read from main ram - up to several GB/s (+ contention and all that) when things are busy. From that perspective it's a relatively affordable way of effectively increasing bandwidth at busy IO times and possibly reducing unpredictable* stuttering.

*If you're doing dynamic res based on last frame's rendering time for example.

You took the words right out of my mouth!

Also, to add to this, we know Xbox can decompress files direct from main memory as well as from the SSD before they hit memory. If PC Direct Storage can do the same then it potentially opens up the option of caching assets in VRAM in a compressed form to be decompressed on demand (much like happens already with the lossy GPU native texture formats) which could greatly increase the effective VRAM capacity.

That's an interesting idea. Keep textures compressed in GPU memory, and only uncompress what sampler feedback says you definitely need. And when you no longer need it - just flush the uncompressed copy.

Would be cool if DS ends up allowing for this down the line.
 
I think I’m going to disagree with you, but at the same time say that you raised some great points. Though I think they are apples to orange comparisons here.


If we look at the PC space specifically they are designed to operate entirely without a GPU, whereas consoles must have a GPU. The inclusion of HUMA and shared memory generates a scenario on console that could never occur on PC, quite simply I/O to memory is as direct as it gets. You run into a particular challenge with this setup if you perform decompression on the APU, in particular your bandwidth contention goes up, negating the ability to keep graphics performance up while streaming and decompressing at the same time. They needed a solution here because compute alone wouldn’t be able to solve this issue, whether they decompressed on the CPU side or GPU side, there just wouldn’t be enough bandwidth to keep streaming data into the memory and decompress and store it while running game code and rendering. But simultaneously this is where we are headed to continue to bolster graphical fidelity, we need significantly higher I/O if we want diverse worlds and setups while maintaining high texture, geometry and lighting fidelity. And because there is a HUMA setup, a dedicated ASIC to decompress the data on the way to RAM is both an elegant and simple solution at least from an architecture perspective.

Combined with increasing video game file sizes to improve fidelity, the need for compression is further exacerbated by the lack of available hard drive space, the PS5 comes with only 50% more capacity than PS4 did. Series X doubled it, but game sizes have nearly tripled since last gen. Without compression there would be no way players would be able to fit more than a handful of games on their consoles without expansion.

In the console space, I see this as a solution to a very large problem. It’s elegant, cheap and it works. I don’t see this as spending more money to change the industry however, the price of consoles still need to be kept down and they are still quite cheap comparatively.

On the PC space, there will always be split pools, and NVME drives continue to drop in price while increase in drive speed and capacity. As long as bandwidth continues to outpace the decompression units on console, you technically don’t need to compress, though, it’s just significantly more efficient to (for the very same reasons). Direct Storage needs to accomplish 2 things in particular that would generate significant I/O speed ups in the PC space, firstly if they can bring I/O directly to the GPU ALU to decompress bypassing copying to memory first; opposed IO to memory to ALU and and then writing the results back to VRAM; there would be real savings going straight to ALU, and secondly, to just bypass the CPU memory entirely. If the future of Direct Storage can accomplish this, then I don’t see any reason for there to be a dedicated chip to decompress. Quite simply, over time that chip will become a bottleneck as there is more than sufficient ALU available on PC graphics cards see 20-40TFs. PC gamers will continue to push the envelope further in terms of graphical settings well beyond what consoles are capable of including IO. It's just taking a long time to happen here on PC because there still has yet to be a reason for us to get an nvme drive. People want information on how it's going to work, commitments that they aren't buying hardware that won't be used etc. All terrible things that console gamers don't need to worry about.
GPU decompression then for PC? How many % of GPU ressources is that going to take if you want to decompress say 10GB/s of data (compressed)? Remember that the future of gaming is to do that during gameplay (PS5 design), not during loading screens.
 
GPU decompression then for PC? How many % of GPU ressources is that going to take if you want to decompress say 10GB/s of data (compressed)? Remember that the future of gaming is to do that during gameplay (PS5 design), not during loading screens.
unless you're getting 100% saturation out of the GPU which games never are (most are doing between 50%-60%), there's going to be compute available to do decompression. The only challenge on the PC space will be the fact that it can't cut the Just in Time as closely as a console could due to priority on async compute. That being said a larger pool for streaming textures would be required, but technically that shouldn't be an issue as VRAM is dedicated on GPUs and doesn't need to hold game or audio data. There should be space available have a larger streaming pool to counter for the lack of just in time. Regardless the solution would still be significantly better than today's setup, I'm not particularly focused on 10GB/s because that is a single target for a single system. What happens when NVME drives hit 20GB/s+ in data transfer? That decompression hardware will become the bottleneck. Leaving decompression with the GPUs, more powerful GPUs will be able to handle more. As a solution imo, this makes sense in the PC space, provided the APIs are able to directly flow data to the GPU's ALUs without needing to go through a series of other things.
 
That's an interesting idea. Keep textures compressed in GPU memory, and only uncompress what sampler feedback says you definitely need. And when you no longer need it - just flush the uncompressed copy.

Would be cool if DS ends up allowing for this down the line.
If you're streaming them, then they are partially resident. You are decompressing the partially resident texture you need into memory, I don't think keeping that section compressed makes a lot of sense, it's not likely to benefit from compression as a small texture
 
GPU decompression then for PC? How many % of GPU ressources is that going to take if you want to decompress say 10GB/s of data (compressed)? Remember that the future of gaming is to do that during gameplay (PS5 design), not during loading screens.

This has been answered in depth the last couple of times you have brought it up.

Your continued bad-faith posts to prop up your preferred platform have outgrown their stay here. They do not provide for a productive discussion.
 
If you're streaming them, then they are partially resident. You are decompressing the partially resident texture you need into memory, I don't think keeping that section compressed makes a lot of sense, it's not likely to benefit from compression as a small texture

I was thinking in terms of Sampler Feedback Streaming. In the Microsoft talk I think they said each mipmap is split up into 64KB tiles, and that each one can be streamed in on demand based on what you're trying to sample. I was hoping that DS GPU texture decompression would be compatible with streaming in these mipmap tiles.

Taking it further, and following on from what @pjbliverpool was talking about, I was imagining having the immediate area + fast moving assets stored compressed in memory. Of these, only the specific mip tiles you need would be stored uncompressed in memory.

So for example you could have a traditional 16 GB of textures uncompressed in memory (well ... still DXTC but nothing more), or alternatively in this example, you'd keep them compressed using only say 8 GB, with maybe only the 1 GB being used for the current frame, decompressed up to 2 GB for a total of 10GB used. So in a fast moving PC game with higher resolution assets and a higher sensitivity mouse (for 1/10th of a second leet 180 degree spins) you could always have all the necessary data a couple of frames away - even with a frumpy old SATA SSD.

Highly speculative example, and you may well be right about the practicality of compressing something like 64KB mip map tiles for use with something like SFS. A ram buffer for compressed assets, with decompression on demand of only a very small proportion, might be a possible mitigation for GPU ram and SSD speed limitations though...?

Going even further maybe you could chain SFS of a 64KB tile and (decompression + ML upscale) to create a higher res mip map tile. MS have talked about experimenting with ML texture upscaling a while back. As you'd already have the 64KB tile in GPU on-chip memory maybe you could avoid a trip out to main ram.

Probably pie in the sky thinking, but fun to talk about. I'm sure you can find holes in it. :D
 
MS uniting Xbox/windows pc, their teams engineering for the Xbox will bring over to the pc/windows, its basically what we are seeing today. MS's solution is less platform agnostic.
 
Status
Not open for further replies.
Back
Top