Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

iroboto · May 28, 2020

So how does this magical compression pipeline work?
Today:
We take our standard textures that are DXT1/BC1-5 which are block compressed so that the GPU can retrieve any block/tile they want on demand. It's then decompressed on the GPU and shot out for processing.
This works well with streaming virtual texturing.

Future?
We take our block compressed textures, which the GPU can randomly access any tile/portion of the texture from the HDD. We lossless compress the entire texture to save space... (kraken as I understand it is not a block compressor, BCPack I have no clue) then?
We have to send the entire texture to the GPU, decompress the entire texture then grab the tiles/blocks you want from it?
An ideal case would be that you could send portions of those blocks/tiles while compressed in kraken/bcpack, and then the GPU does it's normal decompress processing.

mrcorbo · May 28, 2020

iroboto said:
So how does this magical compression pipeline work?
Today:
We take our standard textures that are DXT1/BC1-5 which are block compressed so that the GPU can retrieve any block/tile they want on demand. It's then decompressed on the GPU and shot out for processing.
This works well with streaming virtual texturing.

Future?
We take our block compressed textures, which the GPU can randomly access any tile/portion of the texture from the HDD. We lossless compress the entire texture to save space... (kraken as I understand it is not a block compressor, BCPack I have no clue) then?
We have to send the entire texture to the GPU, decompress the entire texture then grab the tiles/blocks you want from it?
An ideal case would be that you could send portions of those blocks/tiles while compressed in kraken/bcpack, and then the GPU does it's normal decompress processing.

The compression should be invisible to everything above the storage layer.

iroboto · May 28, 2020

mrcorbo said:
The compression should be invisible to everything above the storage layer.

Okay but you're sending the whole texture over then instead of being able to send a single block if the decompress is on the SOC.

With standard block compression you can send just the blocks of the texture you need.

So if you need like 2-4 blocks, that might be say 20 MB not compressed. But now you're going to send the fully texture 40 MB compressed, decompress it and then chose those same blocks afterwards?

Vhatt · May 28, 2020

ToTTenTranz your post a few pages back about the XSX memory was pretty clear but I have a correction or two and a suggestion for better bandwidth utilization and latency hiding.

ToTTenTranz said:
The "two pools" aren't really two physical pools, that division is virtual. There are 10 chips, but 6 of them have twice the capacity. To achieve maximum bandwidth whenever possible, all data is interleaved among the 10 chips (10*32bit = 320bit). This means a 10MB file will supposedly be split into 10*1MB partitions, one for each chip, so that the memory controller can write/read from all chips in parallel, hence using a 320bit bus.

While I agree that there are 10 chips, the 10MB file should be split into 500kb partitions per channel (10 chips, 2x16bit channels per chip for a total of 20 channels) which then allows the memory controller to write/read from all chips in a parallel for max bandwidth.

ToTTenTranz said:
.But the memory controller can only interleave the data among all 10 chips while all 10 chips have space available to do so. When the 1GB chips are full, then the memory controller can only interleave the data among the 2GB chips that still have space available. There are 6 chips with 1 extra GB after the 1st GB is full, so that leaves us with 6*32bit = 192bit.

Of course, the system knows of this, so it's making a virtual split from the get-go, meaning the memory addresses pointing to the red squares become the "fast pool", and the ones pointing to the orange squares become the "slow pool". This way the devs can determine if a certain data can go to the fast red pool or the slow orange pool, depending on how bandwidth-sensitive the data is. They're not left wondering if e.g. a shadow map is going to be accessed at 560GB/s or 336GB/s, as that could become a real problem.

I think this should read: The memory controller can only interleave the data among all 20 channels and their memory spaces when each channel has space to do so. The addressable memory spaces come in two sizes (500mb/1GB) and the bandwidth they can be utilized at is determined by the client bus used to access it (CPU 192bit bus/GPU320bit bus). A channel/memory space used for CPU access will reduce the overall system bandwidth for the duration of time that the data is stored there. The memory controller uses a marker (c for cpu/g for gpu and an additional marker if data is split between two or more channels. The latency incurred when doing CPU vs GPU ops can then be spread over the 20 channels. The impression that I have gotten is that a whole channel and it's memory space would be allocated for the client that needs it. But I think you limit the system by doing so as if a channel has CPU data and the system completes a cycle arond the memory pool and that data hasn't been used you then lose that channel's bandwidth for GPU work.

Now I am wondering if it is possible to address each channel into virtual partitions, a 1gb chip (2 channels, 500mb memory space per channel) and virtually partition them so the CPU and GPU can each store data. The GPU gets 300mb and the CPU 200mb. The same is done for the 2Gb chips so overall the system has 10Gb ram for the GPU and 6Gb for the CPU/system but across each chip. In practice, you should be able to run the GPU at full bandwidth all the time as you schedule GPU related work first (reads/writes) and once that is completed you can the do CPU tasks before the cycle return to that channel.

mrcorbo · May 28, 2020

iroboto said:
Okay but you're sending the whole texture over then instead of being able to send a single block if the decompress is on the SOC.

With standard block compression you can send just the blocks of the texture you need.

So if you need like 2-4 blocks, that might be say 20 MB not compressed. But now you're going to send the fully texture 40 MB compressed, decompress it and then chose those same blocks afterwards?

Why? The data's all encrypted, too. That doesn't mean that the filesystem has to load every file, decrypt the entire file into memory and then seek inside the decrypted file to find the data it needs.

chris1515 · May 28, 2020

The PC solution is there, tghe guy from Epic China told for the flying sequence it works well on NVMe PC SSD arranging data layout on SSD, here the data is sequential, if you want to do the same things in open world you just need to repeat the data for keeping read as sequential as possible.

It is a good short term solution and like this without compression you can do everything the PS5 SSD is doing but the game size will be bigger than on PS5. This is brute force but this is always how PC wins.

iroboto · May 28, 2020

mrcorbo said:
Why? The data's all encrypted, too. That doesn't mean that the filesystem has to load every file, decrypt the entire file into memory and then seek inside the decrypted file to find the data it needs.

Okay, so something that needs clarification for me, likely my misunderstanding here:

We use block compression to ensure that any part of this texture can be decompressed randomly by the gpu. So the gpu can decompress this texture without regard to order. This is desirable for GPU setups. The result is that it doesn't compress that well, since you're only compressing small 8x8 or 4x4 blocks. So in some cases you can compress further.

BUT

This makes sampling straight forward. On this texture below, the virtual texturing system must select the tiles 32x32 or 64x64, that will be loaded for consumption. With block compression, the GPU can still sample the blocks it's wants because it knows where they are and can retrieve the blocks without needing to decompress the entire texture. Then you send those tiles from the HDD to the GPU. That's how you save bandwidth on memory.

If you use zlib or kraken on top, you have to compress the whole texture because it doesn't do block compression. So how can the GPU sample the blocks directly from the HDD when the whole texture needs to be decompressed first?

Deleted member 11852 · May 28, 2020

Shifty Geezer said:
I think you've actually inferred that argument yourself in past posts - what's the use of open competing protocols if they can't be adopted because they aren't universally standards. MS (or whoever) stepping up and saying, "do it this way," at least gets something usable working, as opposed to sitting around waiting for committee after committee to finally settle on something.

And the "do it this way" is why Microsoft should limit themselves to writing the API and letting others chose to offer, or implement, solutions. I liked the idea of Microsoft providing a CPU-driven BCPack decompressor but of course this whole discussion has been abut lifting the load from the CPU and the need to load data into RAM first for decompression first.

This is one of these very specific problems where adding layers - an API - is contrary to the goal. You want to remove abstraction but by doing so you limit options.

Shifty Geezer said:
MS mandating ZLib and BCPack in hardware and that being adopted by everyone may be a worse possible solution that some ideal other compressor, but will be far better than the realities of conflicting, underused proprietary, competing solutions.

As long as whatever Microsoft mandate is patent free and royalty free and a minimum requirement for a particular DirectX extension, that's fine.

pjbliverpool said:
Is this exactly what they're already doing for GPU hardware standards via Direct3D?

Sure those standards are no doubt set with a level of input from NV, AMD and Intel but there's no reason why the same couldn't apply here.

And that's what should happen. It's suggestion that Microsoft should just mandate X in hardware that I object to.

There is always a bit of chicken an egg here, but really it's Intel and AMD - as architectures of most chipsets used in PCs - to workout how best to implement this then work with Microsoft on an API. You really can't have software dictating hardware design, software is easy to change, hardware is not.

mrcorbo · May 28, 2020

iroboto said:
Okay, so something that needs clarification for me, likely my misunderstanding here:

We use block compression to ensure that any part of this texture can be decompressed randomly by the gpu. So the gpu can decompress this texture without regard to order. This is desirable for GPU setups. The result is that it doesn't compress that well, since you're only compressing small 8x8 or 4x4 blocks. So in some cases you can compress further.

BUT

This makes sampling straight forward. On this texture below, the virtual texturing system must select the tiles 32x32 or 64x64, that will be loaded for consumption. With block compression, the GPU can still sample the blocks it's wants because it knows where they are and can retrieve the blocks without needing to decompress the entire texture. Then you send those tiles from the HDD to the GPU. That's how you save bandwidth on memory.

If you use zlib or kraken on top, you have to compress the whole texture because it doesn't do block compression. So how can the GPU sample the blocks directly from the HDD when the whole texture needs to be decompressed first?

https://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files

It's possible to seek within a compressed file and only decompress the data you need.

Shifty Geezer · May 28, 2020

iroboto said:
So how does this magical compression pipeline work?
Today:
We take our standard textures that are DXT1/BC1-5 which are block compressed so that the GPU can retrieve any block/tile they want on demand. It's then decompressed on the GPU and shot out for processing.
This works well with streaming virtual texturing.

Future?

We'll vary depending on need. Like you, I think VT won't use packing compression; it just can't. You'll want smaller level compression, perhaps sticking at texture-level compression supported in hardware. For other content, you'll compress the hell out of it and decompress the entire asset into RAM. This is what we already do with some artwork - it's better to compress it as a JPEG or PNG and decompress into a texture buffer.

Thus in theory, something like UE5 won't benefit from device-level compression unless that compression can be fine grained enough. I think from what little I saw of BCPack it might be able to, and with MS designing their SFS for PRT, they must be able to address some BCPack directly. Oh, or not, given the 100 GB reference, maybe they unpack content to a 100 GB reserve of DXTC files?

iroboto · May 28, 2020

mrcorbo said:
https://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files

It's possible to seek within a compressed file and only decompress the data you need.

Okay, that makes sense to me. UE4 supports zlib compression for their virtual texturing.
So does kraken and BCPack support random access?

iroboto · May 28, 2020

Shifty Geezer said:
We'll vary depending on need. Like you, I think VT won't use packing compression; it just can't. You'll want smaller level compression, perhaps sticking at texture-level compression supported in hardware. For other content, you'll compress the hell out of it and decompress the entire asset into RAM. This is what we already do with some artwork - it's better to compress it as a JPEG or PNG and decompress into a texture buffer.

Thus in theory, something like UE5 won't benefit from device-level compression unless that compression can be fine grained enough. I think from what little I saw of BCPack it might be able to, and with MS designing their SFS for PRT, they must be able to address some BCPack directly. Oh, or not, given the 100 GB reference, maybe they unpack content to a 100 GB reserve of DXTC files?

Thanks, I thought I was going crazy here. So unless both Kraken and BCPack can support some form of random/block access within the texture; the idea of using it for VT systems seems slim. But I appreciate the discussion around the non-VT use cases, because not every game will use VT or need it. Loading up a game seems fairly powerful though. And all other files that need to be loaded that may not involve virtual texturing.

Okay, perhaps this little tidbit into Kraken Oodle can shed some light on VT with Kraken if possible.
http://www.radgametools.com/oodlekraken.htm

I found this blog on Kraken vs ZLib on booting a game up on PS4/XB1.
Quite an improvement.

https://www.jonolick.com/home/oodle-and-ue4-loading-time
But I haven't been able to find anything on non-random access compression with virtual texturing yet.

MrFox · May 28, 2020

Vhatt said:
Now I am wondering if it is possible to address each channel into virtual partitions, a 1gb chip (2 channels, 500mb memory space per channel) and virtually partition them so the CPU and GPU can each store data. The GPU gets 300mb and the CPU 200mb. The same is done for the 2Gb chips so overall the system has 10Gb ram for the GPU and 6Gb for the CPU/system but across each chip. In practice, you should be able to run the GPU at full bandwidth all the time as you schedule GPU related work first (reads/writes) and once that is completed you can the do CPU tasks before the cycle return to that channel.

We had a thread discussing this, but it seems to be locked.

You would end up with 300MB gpu data on each of the 1GB chip channels, and 600MB on the 2GB, since it has twice the space per channel, causing an imbalance in the requests queues. The problem remains the same.

If you partition the channels at different size, you drop the average bandwidth in theory. So far each alternative proposal of how to partition the data brings a whole can of worms and MS seem to have used a technique which best mitigates any potential issues.

Since full utilization can only happen when all channels are getting an equal amount of requests per seconds, the high speed partition needs to be an equal size on all channels. Otherwise some tasks would request more from one chip than the others. Therefore the only solution is to tightly interleave the data equally across all chips, like a RAID0, and use the excess of the bigger chips space as a separate partition for relatively rarely accessed data, which avoids some chips getting more reqs than others. The OS in not used during gaming, and games should have enough low speed data to put in that area.

Both memory spaces should be accessible by either the cpu or gpu, the distinction is that the cpu tends to be the one with data types accessed slowly or rarely. However there isn't much of that in what the gpu needs. That's it's secret, it's always hungry.

Deleted member 13524 · May 28, 2020

Vhatt said:
ToTTenTranz your post a few pages back about the XSX memory was pretty clear but I have a correction or two and a suggestion for better bandwidth utilization and latency hiding.

While I agree that there are 10 chips, the 10MB file should be split into 500kb partitions per channel (10 chips, 2x16bit channels per chip for a total of 20 channels) which then allows the memory controller to write/read from all chips in a parallel for max bandwidth.

I think this should read: The memory controller can only interleave the data among all 20 channels and their memory spaces when each channel has space to do so. The addressable memory spaces come in two sizes (500mb/1GB) and the bandwidth they can be utilized at is determined by the client bus used to access it (CPU 192bit bus/GPU320bit bus). A channel/memory space used for CPU access will reduce the overall system bandwidth for the duration of time that the data is stored there. The memory controller uses a marker (c for cpu/g for gpu and an additional marker if data is split between two or more channels. The latency incurred when doing CPU vs GPU ops can then be spread over the 20 channels. The impression that I have gotten is that a whole channel and it's memory space would be allocated for the client that needs it. But I think you limit the system by doing so as if a channel has CPU data and the system completes a cycle arond the memory pool and that data hasn't been used you then lose that channel's bandwidth for GPU work.

Yes, you're correct. I explained everything through "number of chips" and not "number of channels" since each GDDR6 effectively has 2x 16bit channels. It's 20 channels in practice, but that would be a bit harder to explain and this way the diagram should match to the PCB pictures showing 10 GDDR6 chips.
But hey, I do show two "traces" between the SoC and each GDDR6 chip

Vhatt said:
Now I am wondering if it is possible to address each channel into virtual partitions, a 1gb chip (2 channels, 500mb memory space per channel) and virtually partition them so the CPU and GPU can each store data. The GPU gets 300mb and the CPU 200mb. The same is done for the 2Gb chips so overall the system has 10Gb ram for the GPU and 6Gb for the CPU/system but across each chip. In practice, you should be able to run the GPU at full bandwidth all the time as you schedule GPU related work first (reads/writes) and once that is completed you can the do CPU tasks before the cycle return to that channel.

AFAIK you can fix any memory address to any client (CPU or GPU), but I think the memory controller can only serve one client at a time, not two clients in parallel. Otherwise it'd be the same as having two distinct memory controllers.

I'm not the most knowledgeable person to comment about this, though. By far.

Kaotik · May 28, 2020

ToTTenTranz said:
AFAIK you can fix any memory address to any client (CPU or GPU), but I think the memory controller can only serve one client at a time, not two clients in parallel. Otherwise it'd be the same as having two distinct memory controllers.

I'm not the most knowledgeable person to comment about this, though. By far.

I have no clue if I've understood it correctly either, but they do have several distinct memory controllers. The question is, can they serve separate sources at the same time (CPU & GPU) or are they all do they act "as one"

edit:
Come to think of it, I can see plethora of issues that could manifest themselves if they could serve different clients at once.

function · May 28, 2020

DSoup said:
Yup, lots of of industry standards began as proprietary APIs but the difference is they evolved naturally and were discussed and adopted by consensus by the industry forum responsible - and Microsoft are a vital part of all of these forums.

Microsoft forcing their standard on everybody else, even if it's free, limits innovation and open competition in this field.
Doing so effectively prevents competing standards from gaining traction. I don't think anybody would argue this is good.

Oh yeah, I'm definitely not advocating MS trying to force anything on PC hardware makers. Infact I'm hoping this is a starting point for discussions - if BCPack is something everyone is happy with then fine, but as it still seems that it's under development hopefully it's the beginning of a journey towards something I really do think the PC would benefit from, and be well placed to take onboard. A common standard, with a software fallback that's brought into hardware as and when hardware providers can fit it in. Kind of like video decoders I suppose.

What about data in main RAM used by the CPU? Having this functionality on the GPU creates a problem. I personally think the logical location for a decompressor is in the modern equivalent of the southbridge because all data pulled from I/O goes through this bus. This is point of a dedicated I/O controller.

If other parts of the system could benefit from this new compression system then definitely. That would probably be worth the cost of ~2x the texture traffic over PCIe to the GPU (PCIe 4 and then 5 should have plenty of headroom). If the compression scheme is only really suitable for graphics, then I still think that perhaps the GPU side of the PCIe bus would be good, as you get the benefits of reduced traffic, reduce memory footprint for streaming pools in main ram, and crucially the GPU vendors can implement it at will, without waiting for mobo chipsets or CPUs to implement it, and crucially the GPU teams can control implementation and drivers.

Tying a graphics / GPU focused decompressor to the GPU vendor's drivers and driver teams would seem to be the fastest and safest way to approach this, I think.

But hopefully whatever comes (and next gen consoles have shown that something is needed) will 1) turn up and 2) be as general purpose for all kinds of different PC uses as possible, assuming the cost of that isn't prohibitive.

Vhatt · May 28, 2020

Mr. Fox I remember that thread and I did ask a few questions in there as well.

MrFox said:
You would end up with 300MB gpu data on each of the 1GB chip channels, and 600MB on the 2GB, since it has twice the space per channel, causing an imbalance in the requests queues. The problem remains the same.

If you partition the channels at different size, you drop the average bandwidth in theory. So far each alternative proposal of how to partition the data brings a whole can of worms and MS seem to have used a technique which best mitigates any potential issues.

Since full utilization can only happen when all channels are getting an equal amount of requests per seconds, the high speed partition needs to be an equal size on all channels. Otherwise some tasks would request more from one chip than the others. Therefore the only solution is to tightly interleave the data equally across all chips, like a RAID0, and use the excess of the bigger chips space as a separate partition for relatively rarely accessed data, which avoids some chips getting more reqs than others. The OS in not used during gaming, and games should have enough low speed data to put in that area.

I did read through the memory thread but may have glossed over or misunderstood that for max possible bandwidth the memory sizes needed to be the same. Working it out now, the ideal memory size would be 500mb per channel. So over 20 channels, you get the 10Gb @ 560Gbs for the GPU MS indicated. So as ToTenTranz said in his initial post the 2Gb chips are the ones with the "slower" memory pool as they have a virtual partition of 500mb.

ToTTenTranz said:
AFAIK you can fix any memory address to any client (CPU or GPU), but I think the memory controller can only serve one client at a time, not two clients in parallel. Otherwise it'd be the same as having two distinct memory controllers.

I'm at the limit of my understanding here so unless someone points me to a video that allows me to visualize this aspect of the memory system I'm good.

Thanks as usual guys for taking the time to reply to my questions.

iroboto · May 28, 2020

Vhatt said:
I'm at the limit of my understanding here so unless someone points me to a video that allows me to visualize this aspect of the memory system I'm good.

Thanks as usual guys for taking the time to reply to my questions.

we all are. Without additional information, what's been provided is probably the best we can do in terms of knowledge we have today.

dobwal · May 28, 2020

iroboto said:
Okay, so something that needs clarification for me, likely my misunderstanding here:

We use block compression to ensure that any part of this texture can be decompressed randomly by the gpu. So the gpu can decompress this texture without regard to order. This is desirable for GPU setups. The result is that it doesn't compress that well, since you're only compressing small 8x8 or 4x4 blocks. So in some cases you can compress further.

BUT

This makes sampling straight forward. On this texture below, the virtual texturing system must select the tiles 32x32 or 64x64, that will be loaded for consumption. With block compression, the GPU can still sample the blocks it's wants because it knows where they are and can retrieve the blocks without needing to decompress the entire texture. Then you send those tiles from the HDD to the GPU. That's how you save bandwidth on memory.

If you use zlib or kraken on top, you have to compress the whole texture because it doesn't do block compression. So how can the GPU sample the blocks directly from the HDD when the whole texture needs to be decompressed first?

You break the texture and its mips into standard size tiles. You block compress (BC6 is probably the mostly likely format as it supports HDR) each tile which produces 4X4 pixel blocks that are typically 16 bytes in size. You lossless compress the block compressed pixels that comprise a tile.

You losslessly decompress the texture tiles into VRAM and the gpu simply grabs the compressed 4X4 pixel blocks it needs.

iroboto · May 29, 2020

dobwal said:
You break the texture and its mips into standard size tiles. You block compress (BC6 is probably the mostly likely format as it supports HDR) each tile which produces 4X4 pixel blocks that are typically 16 bytes in size. You lossless compress the block compressed pixels that comprise a tile.

You losslessly decompress the texture tiles into VRAM and the gpu simply grabs the compressed 4X4 pixel blocks it needs.

I’m pretty sure it won’t do that. There’s nothing to gain or compress if you are compressing individual blocks.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

iroboto

Daft Funk

mrcorbo

Foo Fighter

iroboto

Daft Funk

Vhatt

mrcorbo

Foo Fighter

chris1515

iroboto

Daft Funk

Deleted member 11852

Guest

mrcorbo

Foo Fighter

Shifty Geezer

uber-Troll!

iroboto

Daft Funk

iroboto

Daft Funk

MrFox

Deludedly Fantastic

Deleted member 13524

Guest

Kaotik

Drunk Member

function

None functional

Vhatt

iroboto

Daft Funk

dobwal

iroboto

Daft Funk

Similar threads