Ratchet & Clank technical analysis *spawn

Flappy Pannus · Aug 15, 2023

techuse said:
Motherboards would seem the best candidate for adding a decompression block considering CPUs and GPUs are already hitting power and transistor extremes.

Then you remove one of the main benefits of GPU decompression - saving PCI/E bandwidth by transmitting the compressed assets over the bus, which you can only do if the GPU is handling the decompression.

For APU's of course, doesn't really matter.

pjbliverpool · Aug 15, 2023

Remij said:
What I want is a dedicated DirectStorage PCIE card with 1 or more NVMe slots which has a dedicated I/O + decompression block which is capable of decompressing data directly to RAM or VRAM depending on the asset.. freeing up the CPU/GPU almost completely.

Yes, I would prefer the data to decompressed and sent over the PCIe directly where it needs to go, instead of compressed data to system RAM and then to GPU.

I think the solution needs to be on the motherboard itself.

If the data were to be decompressed on the same side of the PCIe bus that the NVMe drive sits on as would be the case with the expansion card idea above, then you would lose the doubling effect on bandwidth of decompression. i.e. a 7GB/s drive on a PCIe 4.0 4x interface would be limited to 7GB/s throughput rather than the 14GB/s+ that you can get with decompression.

The motherboard (or perhaps more realistically the IO die on the CPU itself essentially matching the PS5 solution) is one option, but as Flappy noted above, you then lose the bandwidth savings of the GPU decompression scheme.

Also, I'm not sure how well that would work with point to point memory transfers between disk and VRAM as AMD's Smart Access Storage seems to promise. Presumably it wouldn't.

For me the ideal is either an ASIC on the GPU, or GPU decompression to be truly free. And leave CPU side data to the CPU. An argument could even be made for a complimentary CPU side ASIC to handle just the CPU data but that seems like massive overkill to me in this world of many core CPU's that are almost always underutilised.

techuse · Aug 15, 2023

Flappy Pannus said:
Then you remove one of the main benefits of GPU decompression - saving PCI/E bandwidth by transmitting the compressed assets over the bus, which you can only do if the GPU is handling the decompression.

For APU's of course, doesn't really matter.

Fair point, it seems like we have bandwidth to spare though. Also, PCI/E 7 is coming with the next round of GPUs which will double bandwidth up to 32 GB/s each way per lane.

Remij · Aug 15, 2023

pjbliverpool said:
If the data were to be decompressed on the same side of the PCIe bus that the NVMe drive sits on as would be the case with the expansion card idea above, then you would lose the doubling effect on bandwidth of decompression. i.e. a 7GB/s drive on a PCIe 4.0 4x interface would be limited to 7GB/s throughput rather than the 14GB/s+ that you can get with decompression.

The motherboard (or perhaps more realistically the IO die on the CPU itself essentially matching the PS5 solution) is one option, but as Flappy noted above, you then lose the bandwidth savings of the GPU decompression scheme.

Also, I'm not sure how well that would work with point to point memory transfers between disk and VRAM as AMD's Smart Access Storage seems to promise. Presumably it wouldn't.

For me the ideal is either an ASIC on the GPU, or GPU decompression to be truly free. And leave CPU side data to the CPU. An argument could even be made for a complimentary CPU side ASIC to handle just the CPU data but that seems like massive overkill to me in this world of many core CPU's that are almost always underutilised.

7GB/s is already more than enough. PCIe 5.0 makes that 14GB/s.. No game is going to be approaching that any time soon. Ratchet peaks at around 2GB/s in the most demanding scenario, I think. PCIe bandwidth is only so important because of the constant copying/reading that has to happen over the bus. Ideally we'd want to bypass system RAM, right? It would be nice to be able to pull that data off the drive, decompress it directly to where it needs to go.

Flappy Pannus · Aug 16, 2023

techuse said:
Fair point, it seems like we have bandwidth to spare though. Also, PCI/E 7 is coming with the next round of GPUs which will double bandwidth up to 32 GB/s each way per lane.

A little early to speculate on 7 when we don't have PCIE5 GPU's yet, and even with PCIE4 we routinely see the lanes cut in half for even $400/$500 cards now - my concern is that due to cost cutting this may even get worse in the future. PCIE-5X cards but with...4 lanes except for cards that start at $1k?

Still, as you allude to, the roadmap for PCI-E is indeed very aggressive, and at least on paper, it actually becoming a bottleneck to texture transfer speed in terms of just gb/sec will probably never materialize in the timeframe of a game actually shipping with detailed textures enough to saturate whatever is the standard at that time.

So perhaps with the general trend towards heterogenous architectures in CPU's we get dedicated decompression blocks that can handle a wide array of formats, who knows.

techuse · Aug 16, 2023

Flappy Pannus said:
A little early to speculate on 7 when we don't have PCIE5 GPU's yet, and even with PCIE4 we routinely see the lanes cut in half for even $400/$500 cards now - my concern is that due to cost cutting this may even get worse in the future. PCIE-5X cards but with...4 lanes except for cards that start at $1k?

Still, as you allude to, the roadmap for PCI-E is indeed very aggressive, and at least on paper, it actually becoming a bottleneck to texture transfer speed in terms of just gb/sec will probably never materialize in the timeframe of a game actually shipping with detailed textures enough to saturate whatever is the standard at that time.

So perhaps with the general trend towards heterogenous architectures in CPU's we get dedicated decompression blocks that can handle a wide array of formats, who knows.

Manufacturers selling handicapped GPUs is enabled by customers not having strong enough willpower to just stop buying them.

Flappy Pannus · Aug 16, 2023

techuse said:
Manufacturers selling handicapped GPUs is enabled by customers not having strong enough willpower to just stop buying them.

That approach is less effective in an era where said company can just use that fab capacity to produce GPU's that retail for 10X more, and their main competitor is also hamstringing the PCIE bus on their midrange cards as well. They're not selling well now, no indication that Nvidia is feeling particularly threatened.

The point though is that while perhaps not justified for the current cost of GPU's, there is an increased cost associated with newer PCIE standards that manufactures are passing onto consumers, and as such you can't just look at the PCIE roadmap and determine we'll have 512GB/sec bidirectional bandwidth available on consumer PC's relatively soon (esp as just the specifications for PCIE7 is expect to be finalized in 2025 - which means at best, you won't see it on motherboards until ~2027). Still, PCIE5 is more than enough for the coming years...provided the lanes aren't crippled.

arandomguy · Aug 16, 2023

pjbliverpool said:
It'd certainly be interesting to see the Re-Bar results but you would have thought Nvidia have tested whether that provides a significant benefit already and if so, would have whitelisted this game in their drivers to enable Re-Bar (I think that's how it works for them?)

I should clarify with the AMD side it wouldn't just be ReBAR but their Smart Access Memory (SAM) technology.

It was kind of glossed over after the initial buzz but at least according to AMD their Smart Access Memory is more involved than just toggling ReBAR and also includes considerations for their overall CPU platform as well (which I'm not sure if ever was deep dived into if there is a difference in behavior on Intel). This is also why I wonder if there is differing behavior with the CPU/platform used and not just SAM/ReBAR on/off as well.

Nvidia's ReBAR toggle has always had mixed results which is why they likely use the white list. This does suggest there is some limitation with respect to this area (compared to AMD's SAM solution) somewhere in Nvidia's current stack that they can't fully address. So in the case for Nvidia it may not be so simple as just enabling ReBAR.

techuse · Aug 16, 2023

Flappy Pannus said:
That approach is less effective in an era where said company can just use that fab capacity to produce GPU's that retail for 10X more, and their main competitor is also hamstringing the PCIE bus on their midrange cards as well. They're not selling well now, no indication that Nvidia is feeling particularly threatened.

The point though is that while perhaps not justified for the current cost of GPU's, there is an increased cost associated with newer PCIE standards that manufactures are passing onto consumers, and as such you can't just look at the PCIE roadmap and determine we'll have 512GB/sec bidirectional bandwidth available on consumer PC's relatively soon (esp as just the specifications for PCIE7 is expect to be finalized in 2025 - which means at best, you won't see it on motherboards until ~2027). Still, PCIE5 is more than enough for the coming years...provided the lanes aren't crippled.

Oh I wasn't singling out Nvidia. They are certainly offering more than AMD these days. There is a limited amount of demand for those super high end GPUs. If sales of the mid range GPUs for both companies completely cratered, AMD and Nvidia would have to change something. I don't believe there is nearly enough demand to allocate all that production to more expensive GPUs.

Shifty Geezer · Aug 16, 2023

Flappy Pannus said:
Then you remove one of the main benefits of GPU decompression - saving PCI/E bandwidth by transmitting the compressed assets over the bus, which you can only do if the GPU is handling the decompression.

For APU's of course, doesn't really matter.

Isn't the bottleneck SSD IO and not PCI/E bandwidth? We have those and VRAM limits. Compute is largely in abundance and not a limiting factor, so we have a decision of where to put the burden. Uncompressed assets on SSD isn't an option as we need that BW, so we have a binary choice of whether to keep the assets compressed in RAM and passed to GPU compressed, or decompressed in VRAM.

The former requires a fast bus. The latter requires more VRAM.

Edit: Fix confusing typo

Flappy Pannus · Aug 16, 2023

Shifty Geezer said:
Isn't the bottleneck SSD IO and not PCI/E bandwidth? We have those and VRAM limits. Compute is largely in abundance and not a limiting factor, so we have a decision of where to put the burden. Uncompressed assets on SSD isn't an option as we need that BW, so we have a binary choice of whether to keep the assets compressed in RAM and passed to GPU compressed, or decompressed in RAM.

The former requires a fast bus. The latter requires more VRAM.

Wait, maybe I'm reading this wrong but why does the latter require more vram? Yes, they're passed over the PCIE bus to the GPU compressed, but the GPU still needs to uncompress the textures before it can display them, hence the small 'scratchpad' in vram with DS GPU decompression for that purpose. The GPU is just 'faster' (well, theoretically) than the CPU is at this, but the gdeflate textures still need to be unpacked before being used, doesn't matter where they're stored.

Having the decompression happen on the CPU/motherboard side increases the RAM and PCI-E bandwidth demands, but not vram.

Shifty Geezer · Aug 16, 2023

Flappy Pannus said:
Having the decompression happen on the CPU/motherboard side increases the RAM and PCI-E bandwidth demands, but not vram.

Yes. It's decompression on the GPU JIT that needs more VRAM. If VRAM is scarce, decompression in RAM reduces VRAM requirement at the cost of PCIE BW.

Ah, a typo on my part. Meant to write "we have a binary choice of whether to keep the assets compressed in RAM and passed to GPU compressed, or decompressed in VRAM."

Allandor · Aug 16, 2023

Shifty Geezer said:
Yes. It's decompression on the GPU JIT that needs more VRAM. If VRAM is scarce, decompression in RAM reduces VRAM requirement at the cost of PCIE BW.

Ah, a typo on my part. Meant to write "we have a binary choice of whether to keep the assets compressed in RAM and passed to GPU compressed, or decompressed in VRAM."

Another choice would be to cache the compressed texture in vram and therefore remove the uncompressed as soon as it is no longer needed. Might decrease memory usage over time, but comes at the cost of memory bandwidth and GPU time.

The advantage of the CPU + system Memory is so far my preferred way, as there are plenty of unused resources most of the time as CPU cores increase multithreading is stagnating.
And PCIe bandwidth, ... well PCIe 5.0 is out but not used so far. This bus is not a limiting factor so far. On smaller cards, yes, but then you normally don't need the highest res assets. So I really don't see a big problem on PC side. It is more windows that limits the usage of nvme drives, but that also changes with dx storage.

function · Aug 16, 2023

Flappy Pannus said:
https://twitter.com/x/status/1690643454022361088

Increasingly seems like this is a big part of why performance with DS is problematic in places.

Raw decompression of any modern gaming card is massively above anything that R&C could possibly require, and yet the performance hit on Nvidia is anywhere between significant and enormous. It's much better on AMD in the general case, though in worst cases (1% low, 0.1% low) its can still be enormous.

So anyway, there seems to be some kind of conflict on the GPU between game rendering / game compute and DS decompression. I'm wondering if AMD's better showing across 99% of the game is due to having a hardware scheduler on the GPU, which may be able to respond much faster and in a more fine grained manner, vs Nvidia's partly CPU driven scheduling which is much further away from what's going on and may not have the same visibility or ability to rapidly and efficiently allocate work.

Just a thought, it's not I'd have any idea how to tell ... \_(ツ)_/¯

Edit: The stuff I said on Nvidia not having scheduling hardware turns out to be bunk - they do have it. There are other differences in the way things are managed on AMD vs Nvidia that could be responsible for what we see in R&C. It all seems a bit complicated.

function · Aug 16, 2023

Shifty Geezer said:
Yes. It's decompression on the GPU JIT that needs more VRAM. If VRAM is scarce, decompression in RAM reduces VRAM requirement at the cost of PCIE BW.

Ah, a typo on my part. Meant to write "we have a binary choice of whether to keep the assets compressed in RAM and passed to GPU compressed, or decompressed in VRAM."

Another complication might be the impact on performance on traffic over the PCIe bus.

If you're struggling to spare a few tens of MB's of VRAM for a decompression buffer, you might already be using some main ram as vram and trying to texture or swap textures over the PCI bus. Compressed textures coming from the SSD would take up about half the PCIe bandwidth, and therefore have a smaller knock on effect on other PCIe traffic.

The PCie bus seems to be a bit weird. Well before you've saturated its bandwidth you can sometimes already be seeing deteriorating performance due to it. I'm not sure why this is - I don't know a lot about it. Perhaps it's sensitive to traffic that interrupts a steady stream of draw calls or something.

I think there's likely to be a lot of fine tuning left to be done at the implementation level before we see exactly what the limits of PC DS 1.2 are.

Remij · Aug 20, 2023

So I wanted to have some fun and see what it would be like to run the game off of a MicroSD card which maxes out at ~30MB/s on my PC. First I tried the game with everything maxed... and uh yea.. that wasn't going to work.. but then towards the end of the first video I change only the textures from Very High to Low.. and after it reloaded everything in low textures, it was surprisingly playable.. though the crystal rift changing still took a long time.

Sorry for just the 1080p quality, I encoded them to a lower res so it wouldn't take so long to upload.

The next video was me basically playing the first level of the game with everything maxed except textures on Low.. and it was surprisingly great... right up until the portal sequence.. which is still just way too much data at once even on low. I never tried it on Very High, because it would probably crash lol.

Honestly, I'm quite surprised by it. This MicroSD card is a UHS 3 class card.. so 30MB/s... but there are VSC class cards which do 90MB/s.. I wonder how one of them would handle the game on a powerful PC? (not a Steam Deck)

Metal_Spirit · Aug 21, 2023

Remij said:
So I wanted to have some fun and see what it would be like to run the game off of a MicroSD card which maxes out at ~30MB/s on my PC. First I tried the game with everything maxed... and uh yea.. that wasn't going to work.. but then towards the end of the first video I change only the textures from Very High to Low.. and after it reloaded everything in low textures, it was surprisingly playable.. though the crystal rift changing still took a long time.

Sorry for just the 1080p quality, I encoded them to a lower res so it wouldn't take so long to upload.

The next video was me basically playing the first level of the game with everything maxed except textures on Low.. and it was surprisingly great... right up until the portal sequence.. which is still just way too much data at once even on low. I never tried it on Very High, because it would probably crash lol.

Honestly, I'm quite surprised by it. This MicroSD card is a UHS 3 class card.. so 30MB/s... but there are VSC class cards which do 90MB/s.. I wonder how one of them would handle the game on a powerful PC? (not a Steam Deck)

What are the specs on this PC, specially RAM?
And what's the point on this? The game can probably even run in a floppy disk. Once assets on RAM it should work normally, specially if there are large amounts of it. Question is, the instantaneous transitions that should make the game seamless are not there. And that's the whole concept of the game.

Remij · Aug 21, 2023

Metal_Spirit said:
What are the specs on this PC, specially RAM?
And what's the point on this? The game can probably even run in a floppy disk. Once assets on RAM it should work normally, specially if there are large amounts of it. Question is, the instantaneous transitions that should make the game seamless are not there. And that's the whole concept of the game.

32GB of RAM. The point was to see how the game would run from a MicroSD card I have on my PC. Does there need to be any other point than my own curiosity?

And I'm well aware that once data is in RAM from storage things should work normally. Also well aware that the concept of the game is instantaneous transitions and that they aren't there with the MicroSD card.

PS. No the game wouldn't work on a floppy disk because it wouldn't fit.

Metal_Spirit · Aug 21, 2023

Remij said:
32GB of RAM. The point was to see how the game would run from a MicroSD card I have on my PC. Does there need to be any other point than my own curiosity?

And I'm well aware that once data is in RAM from storage things should work normally. Also well aware that the concept of the game is instantaneous transitions and that they aren't there with the MicroSD card.

PS. No the game wouldn't work on a floppy disk because it wouldn't fit.

I also found your testing very interesting... My question was due to sheer curiosity about the reality of the testing conditions.

Also the floppy disk was just a figure of speech related to even lower data transfers...

Remij · Aug 21, 2023

Metal_Spirit said:
I also found your testing very interesting... My question was due to sheer curiosity about the reality of the testing conditions.

Also the floppy disk was just a figure of speech related to even lower data transfers...

lol I know, I'm just playing

I just wanted to see what it would be like running from the MicroSD card and how well it would handle various sequences.

With Very High textures, you could barely spin the camera around with out massive freezes.

Ratchet & Clank technical analysis *spawn

Flappy Pannus

pjbliverpool

B3D Scallywag

techuse

Remij

Flappy Pannus

techuse

Flappy Pannus

arandomguy

techuse

Shifty Geezer

uber-Troll!

Flappy Pannus

Shifty Geezer

uber-Troll!

Allandor

function

None functional

function

None functional

Remij

Metal_Spirit

Remij

Metal_Spirit

Remij

Similar threads