Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
What about a PCIE SSD, with a decompression block, that can also be directly connected to a GPU in the fashion of an NVLink type device?

The SSD can function as it currently does in a PC, but it can also be directly accessed like the console SSD's without having to shuttle through a bunch of buses.
 
What about a PCIE SSD, with a decompression block, that can also be directly connected to a GPU in the fashion of an NVLink type device?

The SSD can function as it currently does in a PC, but it can also be directly accessed like the console SSD's without having to shuttle through a bunch of buses.
You'd need a hardware compression standard. You'd need all game content to be compressed with the same standard and then you have to worry about which standard, and how that decompression is handled. Let's say PS5 outsells XBSX for arguments sake - do you choose Kraken as the method of choice because it's the most popular, or ZLib as the most generic, or ZLib+BCPack because that's the MS choice? Or do you just stick on their a whole 4 core i5 to run whatever decompression you want? ;)
 
What about a PCIE SSD, with a decompression block, that can also be directly connected to a GPU in the fashion of an NVLink type device?

The SSD can function as it currently does in a PC, but it can also be directly accessed like the console SSD's without having to shuttle through a bunch of buses.

You'd lose the compression speed, so the full uncompressed data would have to make the travel instead. Seems like Sony and MS both put this at roughly 50% speed (compared to the compressed data).

Edit: unless it's directly connected to the GPU. Not sure how feasible that'd be.
 
As far as I know:
Kraken and ZLib are the generic full disk compression.
Most devs are moving from zlib to kraken (according to cerny)
RDO and BCPack are optimizations specific to BCn textures, applied on top of the generic one
RDO and BCPack are different, it's the same goal but not the same approach
If you ditch compression on PC you ditch not only the generic one, but also BCPack

PC games need to run on a wide variety of hardware, and the CPU is improving quickly and it's a lot easier to require a bit more cpu or generic gpu compute power for compression, than requiring two or three times more storage and bandwidth with uncompressed data. The difference we will see is that everyone will ditch ZLib, which takes a lot of processing power, and replade it with a modern compression which is much less demanding on the CPU.

By far the biggest improvement needed on PC is a standardized raw format on nvme, ditching the whole file system. NTFS sucks at parallel workloads. Going through the cpu is not as big of a deal as it is on console where cpu power gets much more limited as the generation ages.

A game requiring special hardware on PC doesn's work well because it takes many years for something to become ubiquitous, while the CPU is improving continuously and requires no effort to support, no multiple branches of code. Single purpose specialized hardware blocks are reserved for consoles because they need every last drop of BOM efficiency on a platform that will play released games for 6-7 years.
 
The problem is decompress this volume of data in realtime use many CPU cores.

Does decompression scale indefinitely with more cores/threads?

I'd say it doesn't. The kraken page makes a big deal out of the fact that it supports two threads on decompression, so dedicating more than 2 cores to it would be useless.
Looking over the internet on multi-threaded decompression I'm finding only dead-ends, which makes sense considering how the computational operations for decompression works (data changes their order when compressed). Unless of course you purposely and manually chunk the data which then results in lower compression rates, which defeats the purpose of both saving disk space and increasing effective transfer rates.

It seems to me that when a console maker claims "this decompression block is doing the same work as 5 Zen2 cores", they're actually saying "this decompression bock is doing the same work as a single Zen2 core clocked at 17GHz", meaning it's practically impossible get the same decompression speed using software decompression with today's hardware.



As I've said before there is a difference between system level IO compression/decompression of everything coming off the disk, and selective compression/decompression of specific data sets. The consoles do the former, I'm suggesting that PC's can do the latter and therefore the decompression requirements would not have to be as high as the "5 Zen 2 cores" quoted for the PS5. It's a trade off between disk space and CPU requirements that the consoles don't have to make so can simply compress everything.

You get on average 45-64% additional compression with Kraken over the "uncompressed" data feed (which would already contain GPU native compressed textures) according to Sony's own figures.
Only 64% over uncompressed data? Where did Sony mention this?
I have the notion that at least shadowmaps could compress a lot, up to 50:1.


Are you suggesting that in order to not overburden every system with less than 16 CPU cores, developers would not accept an install footprint of 45-64% larger on the PC? Maybe they wouldn't but that doesn't seem like the impossible scenario that you're painting it to be.
This is enough to make a 100GB game turn into 200-300GB. It's way too much, you couldn't even fit 2 games on a 500GB NVMe.
Devs wouldn't want to deal with the backlash of having their game occupy the bulk of a PC's storage, IMO.
But I'm a bit doubtfull of those compression rates, to be honest.



I'm suggesting that is one possible route that could be taken. Actually I find it unlikely that Microsoft would stipulate a dedicated hardware decompression block in order to gain DirectStorage compliance as I think that if drives need DirectStorage compliance at all, it will be more focused on their DMA capabilities which is likely to be more important to get out into the market. That said, they could encourage the use of a hardware decompression block though some sort of "Direct Storage Ultra/Premium" type certification which does include such a block. Or SSD makers could simply do this off their own backs given that it could be a good selling point in terms of capacity, but then less likely to be tied into how games are packaged.
But that way the maximum effective throughput (post-decompression) would be ~7GB/s (theoretical 8GB/s), which is the maximum a 4-lane PCIe 4.0 can do.
Considering how we'll have 7GB/s NVMe SSDs later this year, this seems like a bit of a dead-end.


Such a card would still be limited by a 4x PCIe interface into the CPU on non server based chips unless you were going to use it instead of a GPU which wouldn't make much sense in a gaming system.
Not if you use a board that splits the PCIe 4.0 16x connection to the GPU into two 8x connections, which is what many PCIe 4.0 motherboards can do already. This way you'd get ~14GB/s max from I/O and another 14GB/s for the GPU (i.e. same as 16x PCIe 3.0).



Nope, it's easy for devs to support one standard (Kraken on PS5, BCPack on XSX) but what compression will devs use to package their PC games? Or is the expectation that the hardware will support them all? And new ones in the future?

I think this depends on the effort to adopt one or another compression method.
Do you need a long time to apply special treatment on the data before compressing to Kraken or BCPack? I find that a bit hard to believe.. It doesn't take too much time from either of us to put a bunch of files into a zip, so why is Kraken and BCPack any different? Neither of them sounds very restrictive.

The only problem I'd see here could be licensing costs, but my guess is every PS5 devkit already has a Kraken llicense and BCPack already belongs to Microsoft.

So 3rd party devs studios have their raw data. For the SeriesX they pack the game using zlib + BCPack, for the PS5 they pack the game using Kraken.
It's not like this will take more time than e.g. compiling.
 
That's where the DirectStorage certification could potentially come in. Microsoft could stipulate that DirectStorage compatible games must have a zdlib/BCPACK compressed distributable and lead the way with UWP games. I'm not suggesting this is overly likely, but it's one potential solution to the problem.

Microsoft forcing their standard over alterantive commercial offerings has all sorts of monopolistic considerations. They're just not going to go anywhere near that with a barge pole. :nope: This also destroys the ability of any competing/better standard becoming adopted and really undermines the point of an API - the purpose is to be flexible and extensible.

I think this depends on the effort to adopt one or another compression method. Do you need a long time to apply special treatment on the data before compressing to Kraken or BCPack? I find that a bit hard to believe.. It doesn't take too much time from either of us to put a bunch of files into a zip, so why is Kraken and BCPack any different? Neither of them sounds very restrictive.

I'm not following. I'm talking about the compression format that games come bundled in for a particular platform, both for digital distribution / installation - which are not necessarily the same. There has been no mention of XSX or PS5 having compression libraries, only realtime decompression. Compression is a very different kettle of fish given that the best compression libraries use a lot of memory and a lot of CPU time for deeper/wider/longer heuristic/polymorphic data pattern recognition. On PC there's nothing to stop a generic digital distribution (like what we have now), followed by formal installation phase where data is compressed and re-written to disk but this would likely lead to greater 'free space' requirements for download/installation and compressing tens of gigabytes of data is not going to be instant.
 
Last edited by a moderator:
You'd need a hardware compression standard. You'd need all game content to be compressed with the same standard and then you have to worry about which standard, and how that decompression is handled. Let's say PS5 outsells XBSX for arguments sake - do you choose Kraken as the method of choice because it's the most popular, or ZLib as the most generic, or ZLib+BCPack because that's the MS choice? Or do you just stick on their a whole 4 core i5 to run whatever decompression you want? ;)

Haha true. I'd expect it to be the same as the decoder in the XSX - MS have committed to that decoder for at least as long as the lifetime of their upcoming console, and they've been making increasing moves to integrate the Windows and XBox gaming ecosystems. Matching the decoder in the XSX seems like a logical move.

You'd lose the compression speed, so the full uncompressed data would have to make the travel instead. Seems like Sony and MS both put this at roughly 50% speed (compared to the compressed data).

Edit: unless it's directly connected to the GPU. Not sure how feasible that'd be.

Well, my thinking was that said PCIE card would be able to have its data accessed just as it does now, in which case the decompression block could be bypassed, and decompression then handled by the CPU.

However, with a dedicated link to the GPU too, the GPU could have DMA, utilising the decompression block.

I have no idea how feasible this would be though. A quick Google shows a single NVLink bridge as being capable of 50GB/s. So it's at least possible to better the compressed 4.8GB/s of the XSX, which IMO is the only spec a Windows gaming PC needs to meet or exceed. But there are all kinds of technicalities outside the bounds of my tiny mind.
 
Only 64% over uncompressed data? Where did Sony mention this?
I have the notion that at least shadowmaps could compress a lot, up to 50:1.

Sony have stated the uncompressed throughput of their SSD is 5.5GB/s whereas with compression they expect this to be more like 8-9GB/s effective. That's a 45-64% improvement. I'm assuming that once the data stream has been through the decompression block it's 'ready to use' and doesn't need further software decompression. So that means Kraken is gaining you lets say 55% on average over the 'ready to use' data that I'm suggesting PC's might have to have on the SSD to avoid decompression in software.

This is enough to make a 100GB game turn into 200-300GB. It's way too much, you couldn't even fit 2 games on a 500GB NVMe.

I think something may have gone awry with your maths there. That's going to make the footprint of a 100GB game into 145-164GB. Not ideal obviously, but not a deal breaker either IMO. It would mean that a PC needs to feature a 1.35TB SSD to match the PS5's 825GB one. Such a drive size is not unusual in the PC space.

But that way the maximum effective throughput (post-decompression) would be ~7GB/s (theoretical 8GB/s), which is the maximum a 4-lane PCIe 4.0 can do.
Considering how we'll have 7GB/s NVMe SSDs later this year, this seems like a bit of a dead-end.

Agreed but this isn't a solution to increase bandwidth, purely reduce disk footprint. My suggestion is that 7GB/s is good enough for now, and within a year or so of the consoles launch PCIe 5.0 will deliver all the bandwidth necessary. Decompression may help there depending on controller speed.

Not if you use a board that splits the PCIe 4.0 16x connection to the GPU into two 8x connections, which is what many PCIe 4.0 motherboards can do already. This way you'd get ~14GB/s max from I/O and another 14GB/s for the GPU (i.e. same as 16x PCIe 3.0).

Ageed. Using 8 lanes for each of the GPU and SSD is realistic and would give more than enough data throughput until PCIe 5.0 comes along (using RAID SSD's). The downsides of course are massive expense and you lose half of your CPU<->GPU bandwidth. Although granted that's unlikely to have any real world impact for a long time.

EDIT: Just realized this couldn't work. If you're streaming data off the disk at 14GB/s that means you're streaming it to the GPU at something similar. So if you're only on an 8x PCIe interface with the GPU, that's all your CPU<->GPU bandwidth gone.
 
Last edited:
I'm not following. I'm talking about the compression format that games come bundled in for a particular platform, both for digital distribution / installation - which are not necessarily the same. There has been no mention of XSX or PS5 having compression libraries, only realtime decompression.

Why would developers adopt any compression format other than the ones already supported by the very powerful decompression blocks on either console?
I get that when they were limited by the 40MB/s throughput out of hard disk drives then they could just use CPU decompression whenever they saw fit because they'd almost always be I/O limited.
But with the new consoles the I/O is at least 63x faster, whereas the CPU obviously didn't increase as much.


Sony have stated the uncompressed throughput of their SSD is 5.5GB/s whereas with compression they expect this to be more like 8-9GB/s effective. That's a 45-64% improvement.
They also said the effective throughput with Kraken can go up to 22GB/s, meaning a 4:1 rate..
It seems to me that both platforms are limited either by decompressor performance or decompressor -> RAM bandwidth (ideally both at the same time, which would mean they were engineered to match each other).

I think something may have gone awry with your maths there.
Yes, I reversed the multiplication. I assumed a 64% compression means you're left with a file that is 36% of the original one's size. Isn't this how compression is usually measured?
 
I guess he did 10.3 / 12.2 == 0.844, so it's 84% of XBSX, a 16% deficit, rounded to a nicer sounding 15%
Alternatively, 12.2 / 10.3 = 1.18, so XBSX has 18% more TFs.

How so? You have RT units capable of performing however many intersect tests per clock. 1 unit at 10 GHz should be able to process the same number as 10 units at 1 GHz, I'd have thought. How does parallelism help with AMD's implementation of RT in RDNA2?
If you use this broken ASM that doesn't work.
ldr r1, [A]
ldr r2, [W]
v_add_f32 r0, r1, r2
v_mul_f32 r3, r0, r2
STR [c], r3

If you have say 1 CU that is 2x the clock speed of another setup but they have 2 CUs.
Your inefficiency in work process increases because:
2 CUs will run this program once:
ldr r1, [A]
ldr r2, [W]
v_add_f32 r0, r1, r2
v_mul_f32 r3, r0, r2
STR [c], r3

But the 1 CU will need to run this.
ldr r1, [A]
ldr r2, [W]
v_add_f32 r0, r1, r2
v_mul_f32 r3, r0, r2
STR [c], r3
ldr r1, [A]
ldr r2, [W]
v_add_f32 r0, r1, r2
v_mul_f32 r3, r0, r2
STR [c], r3

So if you look at the number of memory operations for the 2 CUs. You've got 3 memory operations that is happening simultaneously. This is how we divide work over the GPU.
In the second, you need to make 6 memory requests.

So even though you might have 2x clock speed advantage over 2 CUs, you have 2x the number of memory requests. Each command will take up a different number of cycles to complete, but LDR and STR commands will definitely be anywhere between 4-5 or more cycles depending on the data you require, and if your memory subsystem doesn't speed up, then the clock speed advantage goes away. So what takes 4-5 clock cycles on the 2 CUs for LDR/STR commands, may take 8-10 clock cycles or more on the faster CU, just because it's sitting around idle waiting for memory to keep up.

So in large parallel situations where we can assign a both the number of thread blocks and threads per block, I believe massively parallel will dominate lower CU counts. But in a typical fixed function pipeline, where every stage of the pipeline is scheduled, the clockspeed will help to create scenarios where it's advantageous or will equal the large CU differential. But if you're looking at pure compute shader power, compute shaders typically work very well with large parallel work (we just tell it how many threads and how many blocks and it distributes work), as that is what they are designed to do.

FF pipeline will have a harder time keeping things saturated because it needs to schedule the work in and out (this was a major problem with GCN being 4 cycles to complete an instruction). Compute shaders have typically different restrictions, like only being able to see the data of the neighboring CUs. But this is what led us to the design of tile based rendering. We assign a tile of work say 8x8 block of pixels of work to be computed per threadblock.

With RDNA 2. And the introduction of L1 cache being shared among the CUs within a shader array, (not in GCN), the more CUs you have in a shader array the more they'll share with neighbouring CUs. You could theoretically have larger block sizes to work on for instance as a result. Doing work within L1 is ideal as it's the fastest memory request we have, once L1 is done, you'll need to reach out to L2 to fetch and if not L2 then to VRAM and back. So that's where the advantage is going to be for massive parallel designs, compute shaders.


** lol, I was trying to use [ B ] as my second array, and it kept bolding everything, I was like what gives? Anyway switched it up.
 
Last edited:
Why would developers adopt any compression format other than the ones already supported by the very powerful decompression blocks on either console?
They wouldn't. I said that in this post.

I'm talking about the dilemma for PC games. Where having options is the problem. Console = one standard = no choice = easy. PC = many options = dilemma.
 
They wouldn't. I said that in this post.

I'm talking about the dilemma for PC games. Where having options is the problem. Console = one standard = no choice = easy. PC = many options = dilemma.

Ok, sorry I somehow skipped that part. I agree with you then.

On the PC, I think the path to fast decompression (>7GB/s throughput to match the consoles) won't happen anytime soon. It's either gigantic game install sizes to brute-force transfer speeds of up to 7GB/s, or make use of much larger amounts of RAM like 32GB minimum to cache more data, or a mixture of both.
 
They also said the effective throughput with Kraken can go up to 22GB/s, meaning a 4:1 rate..

They actually said the decompressor can operate at up to 22.GB/s if the data compresses particularly well (4:1). Very little data does compress that much so that wouldn't be a sustained throughput. It'd be peak throughput in ideal corner cases. The average throughput across all data types is 8-9GB/s so that should be the basis on which we calculate game footprint. It's not exact obviously as some types of data are likely to be streamed more than others. But it seems a pretty reasonable basis for comparison.

It seems to me that both platforms are limited either by decompressor performance or decompressor -> RAM bandwidth (ideally both at the same time, which would mean they were engineered to match each other).

I would assume they're both scaled to be able to decompress any data types that stream off the disk in real time. i.e. 2.4GB/s zlib/BCPACK in the XSX and 5.5GB/s Kraken in the PS5. The post compression size will be a factor of how well that particular data type was compressed but according to Sony, if everything was compressed at 4:1 the decompression unit could keep up. So I'm not seeing a bottleneck there. For decompressor to RAM bandwidth that's presumably going only limited by IF and main memory controller bandwidth, both hugely faster than anything the SSD can throw out with or without compression.

Yes, I reversed the multiplication. I assumed a 64% compression means you're left with a file that is 36% of the original one's size. Isn't this how compression is usually measured?

If you have a file that's 9GB on the PS5 after decompression then the original compressed size was 5.5GB (on average). So 61%. And 61% of 1350 = ~825.

Ok, sorry I somehow skipped that part. I agree with you then.

On the PC, I think the path to fast decompression (>7GB/s throughput to match the consoles) won't happen anytime soon. It's either gigantic game install sizes to brute-force transfer speeds of up to 7GB/s, or make use of much larger amounts of RAM like 32GB minimum to cache more data, or a mixture of both.

I think we all agree on this (so long as gigantic can be used to describe up to 64% bigger than the PS5 install size).
 
Nope, it's easy for devs to support one standard (Kraken on PS5, BCPack on XSX) but what compression will devs use to package their PC games? Or is the expectation that the hardware will support them all? And new ones in the future?

standards.png
Reminds of that time Titanfall was released with fully uncompressed audio. >_>

Another performance worry was tackled with sheer disk space - the 48GB install has around 35GB of uncompressed audio. Most games use compressed sound files, but Respawn would rather spend CPU time on running the game as opposed to unpacking audio files on the fly. This isn't a problem on Xbox One - and wouldn't be on PlayStation 4 in theory - as the next-gen consoles have dedicated onboard media engines for handling compressed audio.

"On a higher PC it wouldn't be an issue," points out Baker. "On a medium or moderate PC, it wouldn't be an issue, it's that on a two-core [machine] with where our min spec is, we couldn't dedicate those resources to audio."
 
They wouldn't. I said that in this post.

I'm talking about the dilemma for PC games. Where having options is the problem. Console = one standard = no choice = easy. PC = many options = dilemma.

If MS establishes a standard, PC developers would most likely follow their lead. It's likely to be the path of least resistance overall.
 
On the other hand, it could still take a number of years before devs follow suit subject to proliferation of HW-decompression support in the market. Meanwhile, the bar for CPU performance/requirements will have been raised pretty darn high with the impending generation.
 
Last edited:
Well, my thinking was that said PCIE card would be able to have its data accessed just as it does now, in which case the decompression block could be bypassed, and decompression then handled by the CPU.

However, with a dedicated link to the GPU too, the GPU could have DMA, utilising the decompression block.

I have no idea how feasible this would be though. A quick Google shows a single NVLink bridge as being capable of 50GB/s. So it's at least possible to better the compressed 4.8GB/s of the XSX, which IMO is the only spec a Windows gaming PC needs to meet or exceed. But there are all kinds of technicalities outside the bounds of my tiny mind.

Yeah, I suspect that the SSD having direct access to the GPU and the GPU having a decompression unit to send data to VRAM would be the quickest solution.

Thats essentially what the two console manufacturers have done isn't it?
 
On the other hand, it could still take a number of years before devs follow suit subject to proliferation of HW-decompression support in the market. Meanwhile, the bar for CPU performance/requirements will have been raised pretty darn high with the impending generation.

Don't see this as much different than any other transition period brought on by new hardware innovation. In the interim, I expect they'll brute-force it at the high end (high RAM + CPU requirements) and scale down to the low end like always. We really didn't see this last gen because last gen there really wasn't much the consoles were doing that the PCs couldn't. I had always tended to go back and forth from primarily console to primarily PC gaming as new consoles come out and last gen was the first time that didn't happen as I did almost all my gaming on PC throughout.
 
Status
Not open for further replies.
Back
Top