Ratchet & Clank technical analysis *spawn

More Ratchet RT analysis from Computerbase, with PS5 RT settings, a 4080 is 20% faster than 7900XTX, and the 3080 is 25% faster than 6800XT. However, the 3080 runs out of memory at 4K and becomes unplayable.

With max RT settings, the 4080 becomes 40% faster than 7900XTX, while the 3080 is 50% faster than 6800XT, and again at 4K the 3080 runs out of VRAM.

 
Basically, unless DirectStorage can be made to work better, PC might be a big problem for the generation.

...potentially. This game got so much attention because it was the 'stress test' for texture streaming supposedly, and it turns out that not even having DS CPU decompression utilized (which still can reduce overhead from standard Windows I/O calls afaik) makes the game - while not 'perfectly' - quite manageable for even midrange CPU's to handle.

Ideal? Of course not - again there's some opaqueness here with how Radeon is handling this generally much better (CPU? GPU?) - but it's certainly some egg on Nvidia's face considering how much they hyped RTX I/O and the first game to actually utilize it shows a performance regression instead of a benefit.

Who knows what will come down the pike in the next few years that will demand more of I/O, but it's a question of what developer effort is necessary to adapt engines to perform more loading up-front and take advantage of 32+ GB ram I guess, and/or have PC versions ship with more CPU-friendly texture compression formats and hence, larger install sizes. Both are reasonable workarounds IMO if DS GPU storage isn't up to snuff, at least until potentially we get some fixed function hardware on next gen GPU's that doesn't require sharing rendering resources.

That all presumes the bottleneck here is indeed GPU resource contention though, and not the less direct path compressed textures have to take to get into vram on the PC vs. consoles.
 
People are very quick to form conclusions based off one single test case... There's another DirectStorage game you know.. and THAT game, Forspoken, loads faster on PC than it does on console.


MAYBE the issue is in the implementation, and how far Nixxes were willing to go to incorporate it into the game? This was a game built specifically for Playstation, being adapted to PC... not built with PC in mind. Discussion is great, and I think it's fair to say that some of us expected different results than what we're seeing with this title specifically.. but it's too early.
 
That's interesting although you would expect that when PCIe bandwidth constrained. More interesting still though is that it would seem to support the conclusion that AMD is using GPU decompression. So it really does seem they like don't suffer the same general performance drop off as Nvidia. Perhaps RTX-IO is just not as efficient as AMD's implementation despite all the hype.

That would be funny, considering this compression format was developed by Nvidia lol.

Wonder if it has something to do with AMD having better Async compute implementation?

The theory I'm still circling back to and would like explored is with respect to Resizable BAR. Which in theory could have an impact as it is directly related to how data is moved across the PCIe bus to the GPU VRAM.

I guess what would need to be looked at is the impact on AMD GPUs first with ReBAR both on and off. If there is a difference with it on (better) this could mean that DS GPU decompression is impacted by ReBAR. Which has the implication that Nvidia (at least it's current stack) would face complications.

Has there been any tests done on Intel's Arc GPUs? Especially on an Intel/Intel platform? GPU + CPU/platform vendor for all we know may also have some impact here.
 
That all presumes the bottleneck here is indeed GPU resource contention though, and not the less direct path compressed textures have to take to get into vram on the PC vs. consoles.

While I can see the split memory pool copies as potentially being responsible (or contributing to) for the loading time bottleneck, and/or stuttering during the portal sequence, I can't see that having any impact on the general performance decrease with Direct Storage as the data still needs to take the same route (in uncompressed form at that) with Direct Storage off where performance is higher.

The theory I'm still circling back to and would like explored is with respect to Resizable BAR. Which in theory could have an impact as it is directly related to how data is moved across the PCIe bus to the GPU VRAM.

I guess what would need to be looked at is the impact on AMD GPUs first with ReBAR both on and off. If there is a difference with it on (better) this could mean that DS GPU decompression is impacted by ReBAR. Which has the implication that Nvidia (at least it's current stack) would face complications.

Has there been any tests done on Intel's Arc GPUs? Especially on an Intel/Intel platform? GPU + CPU/platform vendor for all we know may also have some impact here.

It'd certainly be interesting to see the Re-Bar results but you would have thought Nvidia have tested whether that provides a significant benefit already and if so, would have whitelisted this game in their drivers to enable Re-Bar (I think that's how it works for them?)

Regarding Arc, CapFrameX tested it and found it also loses performance with Direct Storage enabled. It's posted here somewhere but I can;t find it right now so not sure whether that was on an Intel platform or not.

Isn't Smart Access Storage AMD's branding equivalent to RTX IO?

Based on what we've seen of this so far which is pretty sketchy, it looks to go beyond RTX-IO in that it completely bypasses system memory for the data transfers. It looks like it might use point to point DMA to transfer data direct from the drive into VRAM. I don't think it's in use yet though.
 
Just another of my two cents why do storage might be slower it might just have to do with decompression is done on the GPU, which uses GPU and memory bandwidth on the graphics card. When done on the CPU side, the PCIe bandwidth might be used more, but therefore the assets are just copied over once and do not need extra processing by the GPU. Also the main memory is normally not used as much, do there is some "free bandwidth" for decompression.
as long as the CPU and PCIe interface does not limit anything (e.g. low end CPUs + pcie 4x interface) not using direct storage should always be faster.
maybe with one exception, small reads should be much faster with DS. So far it does not seem that r&c is using minimal asset sizes to only load things it needs. We are still far away from games using this kind of streaming.
 
Isn't Smart Access Storage AMD's branding equivalent to RTX IO?

Not quite sure. AMD has a suite of technologies that will work with Direct Storage.


AMD is so vague in its announcement its hard to determine exactly what SAS entails. AMD basically stated that SAS, SAM, Radeon GPU Asset Decompression and other platform tech will offered to allow AMD GPUs more direct access to storage and offload work from the CPU. When MS announced DS 1.2 with GPU decompression, AMD publicly stated they work with MS to support DS 1.1 and that they released metacommand drivers to ISVs but made no reference to SAM.

Its hard to discern if GPU decompression falls under SAM or is considered a separate tech.
 
Just another of my two cents why do storage might be slower it might just have to do with decompression is done on the GPU, which uses GPU and memory bandwidth on the graphics card. When done on the CPU side, the PCIe bandwidth might be used more, but therefore the assets are just copied over once and do not need extra processing by the GPU. Also the main memory is normally not used as much, do there is some "free bandwidth" for decompression.
as long as the CPU and PCIe interface does not limit anything (e.g. low end CPUs + pcie 4x interface) not using direct storage should always be faster.
maybe with one exception, small reads should be much faster with DS. So far it does not seem that r&c is using minimal asset sizes to only load things it needs. We are still far away from games using this kind of streaming.


DS GPU decompression is faster than DS CPU decom which is faster than DS disabled especially with large textures (3+ GB) when benchmarking with relatively simple scenes. AMD's demo showed decompressing large textures eating 10-20% of GPU performance. So what does that look like in actual games? The impact may noticeably hit frame rates even though technically the loading of textures is faster. Or loading may be negatively affected as decompression competes for shaders with other GPU workloads. It may take talented devs to mitigate the impact of either issue limiting the utility offered by DS to a subset of titles.

Ideally to me, it makes more sense to offload decompression from both the cpu and gpu but Nvidia, AMD nor Intel seems to excited with the ideal of adding ASIC decompressors on their GPUs. Maybe they want to see wide support of DS before committing transistors to such feature.
 
Ideally to me, it makes more sense to offload decompression from both the cpu and gpu but Nvidia, AMD nor Intel seems to excited with the ideal of adding ASIC decompressors on their GPUs. Maybe they want to see wide support of DS before committing transistors to such feature.

But this would require compressed CPU data to be passed from system memory over the PCIe bus to the GPU, decompressed there, and then passed back over the PCIe bus in uncompressed form to system memory. That's a lot of extra steps and wasted bandwidth just to offload the decompression of CPU data off the CPU which probably isn't that burdonsome anyway given that it will be far smaller than the GPU destined data.

I'd say the current design is pretty much optimal for the PC architecture but if Direct Storage can;t sort out it's performance issues, then a dedicated ASIC on the GPU would be an improvement. That said, AMD doesn't seem to experience the same performance loss as NV so hopefully this is just a matter of driver optimisation as per the Compusemble quote above.
 
But this would require compressed CPU data to be passed from system memory over the PCIe bus to the GPU, decompressed there, and then passed back over the PCIe bus in uncompressed form to system memory. That's a lot of extra steps and wasted bandwidth just to offload the decompression of CPU data off the CPU which probably isn't that burdonsome anyway given that it will be far smaller than the GPU destined data.

I'd say the current design is pretty much optimal for the PC architecture but if Direct Storage can;t sort out it's performance issues, then a dedicated ASIC on the GPU would be an improvement. That said, AMD doesn't seem to experience the same performance loss as NV so hopefully this is just a matter of driver optimisation as per the Compusemble quote above.

No. Only data destined for the GPU needs to be serviced by decompressors on the GPU. Compressed data destined for the cpu can be decompressed there. Its a scheme that the current DS supports.

1692125589748.png

The issue with data decompression on the GPU is that it eats up VRAM, bandwidth and shader resources.
 
Ideally to me, it makes more sense to offload decompression from both the cpu and gpu but Nvidia, AMD nor Intel seems to excited with the ideal of adding ASIC decompressors on their GPUs. Maybe they want to see wide support of DS before committing transistors to such feature.

Hard to even propose that before when we didn't have a standard for those supposed ASIC's to work against though. Now we do, it's Gdeflate. I think it would be prudent for fixed function decompression blocks in Lovelace/RDNA4 but I guess we'll see.
 
No. Only data destined for the GPU needs to be serviced by decompressors on the GPU. Compressed data destined for the cpu can be decompressed there. Its a scheme that the current DS supports.

View attachment 9399

The issue with data decompression on the GPU is that it eats up VRAM, bandwidth and shader resources.

I understand this, but your post I was responding to was proposing using an ASIC on the GPU to offload both GPU and CPU decompression. So in that case you would have to pass the CPU data back and forth across the PCIe bus. The diagram you have posted above is what my post is already referring to as the optimal solution for PC, minus perhaps an ASIC on the GPU to handle GPU data decompression. And of course if something like AMD's Smart Access Storage were to become mainstream and vendor agnostic, then there would be additional benefits to be had from transferring the GPU data directly into VRAM, bypassing system memory altogether.
 
Hard to even propose that before when we didn't have a standard for those supposed ASIC's to work against though. Now we do, it's Gdeflate. I think it would be prudent for fixed function decompression blocks in Lovelace/RDNA4 but I guess we'll see.

Gdeflate is a decompression scheme that's designed around the parallel nature of GPUs. It's not necessary for an ASIC. There would still be an issue with algorithm standardization but there is nothing stopping AMD, Nvidia and Intel from offering a solution that supports multiple algorithms other than transistor cost.

I understand this, but your post I was responding to was proposing using an ASIC on the GPU to offload both GPU and CPU decompression. So in that case you would have to pass the CPU data back and forth across the PCIe bus. The diagram you have posted above is what my post is already referring to as the optimal solution for PC, minus perhaps an ASIC on the GPU to handle GPU data decompression. And of course if something like AMD's Smart Access Storage were to become mainstream and vendor agnostic, then there would be additional benefits to be had from transferring the GPU data directly into VRAM, bypassing system memory altogether.

My bad. I was referring to any solutions where the CPU or GPU is used for decompressing gpu destined data. The problem with GPU decompression is that even with direct access you need two buffers, multiple reads, writes or copies, a round trip through the GPU hardware before the data makes it to its final destination and is ready to be used for rendering.

Both AMD (cited above with 10-20% figure) and Intel notes the impact of GPU decompression can have on framerates.


GPU decompression naturally competes for resources with rendering. Ideally, the work is complementary—that is, if decompression is memory-bound, and rendering is compute-bound, then decompression could be essentially free. In practice, the experience will depend on differences in platforms and software. For example, Expanse’s frame rate is not typically affected by GPU decompression, nor does frame rate noticeably affect bandwidth. However, in a targeted benchmark mode that tests the platform I/O performance, the frame rate is measurably affected by GPU decompression. Early data and investigation of this tension in real workloads is extremely interesting, and will feed into hardware and software roadmaps for the years to come.
 
Last edited:
Gdeflate is a decompression scheme that's designed around the parallel nature of GPUs. It's not necessary for an ASIC. There would still be an issue with algorithm standardization but there is nothing stopping AMD, Nvidia and Intel from offering a solution that supports multiple algorithms other than transistor cost.

What is that generally though - how much die space does the PS5's solution take? From what I remember it's quite small.
 
What is that generally though - how much die space does the PS5's solution take? From what I remember it's quite small.

I don't know. Its definitely less than the cost of cpu cores needed to accommodate the same level of decompression and io managment but that not saying alot. LOL. But it can't be too big as I don't recall anyone identifying the blocks on the SOCs of consoles even though we know that the FPUs on the PS5 were cut down due to layout images.
 
What I want is a dedicated DirectStorage PCIE card with 1 or more NVMe slots which has a dedicated I/O + decompression block which is capable of decompressing data directly to RAM or VRAM depending on the asset.. freeing up the CPU/GPU almost completely.

Yes, I would prefer the data to decompressed and sent over the PCIe directly where it needs to go, instead of compressed data to system RAM and then to GPU.

I think the solution needs to be on the motherboard itself.
 
Motherboards would seem the best candidate for adding a decompression block considering CPUs and GPUs are already hitting power and transistor extremes.
 
Back
Top