Why would DirectStorage hog the PCIe bus? Isn't the entire point of it to be to send compressed textures into VRAM to be decompressed? Wouldn't that utilize less bandwidth?
DirectStorage does two things. What you refer to is the GDeflate decompression that's happening on the GPU.
The other half is shifting the uploads from the 3D/Compute queues to the copy queue as a hidden implementation detail. That's not just semantic sugar to keep them out of the other queues, but the copy queue is actually backed by an ASIC that is capable and designed to perform a transfer at exactly the full available PCIe bandwidth. Unlike transfers scheduled on the other two engine types that the leads to executing as a low thread count kernel instead, which coincidentally also leaves a lot of "scheduling bubbles" in the whole involved memory system on the GPU, which coincidentally prevents the issue entirely. Anything that introduces bubbles safes your ass.
It's not that hard to try it for yourself in a small synthetic benchmark. Just go and schedule 1GB+ of uploads to the copy engine in parallel to your regular 3D load, and see what it does in terms of perceived execution latency. Don't even bother with monitoring GPU utilization during that experiment, you will most likely encounter a blind spot there too.
Looking at Spider-Man 2.. PCIe Rx and Tx never exceed 8GB/s. There's plenty of bandwidth there.
Unexpected information, but not surprising. Indicates that asset streaming was deliberately throttled. One of the legit workarounds.
But also possibly constrained by level/engine design. You only hit PCIe limits for an extended duration if a whole bulk of assets are already resident in RAM. If streaming is constrained by the disk instead, all is fine...
Does this GPU have 8GBs of VRAM?
Tested on an 7900XT with 20GB of VRAM. Additional VRAM size didn't really change much about the frequency of (usually too late) asset streaming.
IMO the issue is simply the decompression happening on the GPU cores stalling the engine.
Quite unlikely. Even the DirectStorage demo did never result in the utilization of effectively more than one concurrent decompression kernel, that scheduling detail is locked away in the implementation of DirectStorage. And the API design of DirectStorage doesn't really support that assumption either - even though you do have the option to await a GPU side fence, that one spans the entire duration from disk activity to decompression. Visibly delayed assetss pretty much rule out that this has happened.
If you rather refer to shader utilization introducing a stall - yes, that's happening. But that only introduces a gradual slowdown. Not a full stall for >100ms.