Bandwidth between CPU and VRAM is a tiny fraction of that between CPU and local RAM, which is the reason CPU and GPU have their own local RAM pools in the first place.
On a typical Zen 2 system you're likely to be looking at around 51.2GB/s between the CPU and system RAM. CPU to GPU bandwidth is about 14-15GB/s in each direction over PCI3.0. Over PCIe4.0 it's double that so around 30GB/s in a single direction.
So I wouldn't really describe it as a tiny fraction. Granted latency will be rubbish compared to local memory but we're not talking about the CPU rendering out of vram here, we're talking about the time to copy what is likely a few hundred MB of game data from where is is decompressed in VRAM to where it needs to be in main RAM for the CPU to work on it (assuming that's how DirectStorage/RTX-IO even works which is far from given). And all this is to be done at a loading screen so we're talking about timescales in full seconds, not the micro seconds of latency that are added by having to work over a PCIe bus rather than from local memory.
Tell me how you propose splitting the PCI resource snd I'll math not up for you. You're presumably wanting as many channels moving raw data as fast as possible to GPU to decompression while reserving enough to carry decompressed data to main RAM while reserving enough for all the other devices that rely on low-latency PCI bus access to function, like audio and networking.
You have 4 channels coming from SSD to GPU via the SSD<->CPU link and then the CPU<->GPU PCI link. So essentially 4 of your 16 channels from CPU->GPU are taken up by that. You still have 16 channels going back from GPU to CPU to move any data that needs to be in system RAM back. Given that data is now decompressed and thus potentially twice as large as when it came over, those 16 channels should still be double what you need to keep up with the maximum speed from the SSD into the GPU. Not that you're likely to need anything like that maximum speed as the data required by the CPU would only be a very small proportion of the total data streaming in from the SSD. MS say 80% of streamed game data is textures, so at the very most you're only looking at 20% of what you stream to the GPU having to go back over the 16x PCIe bus into main memory.
As an example, lets say at the game load you need to pull 10GB off the SSD and into memory. 2GB of that is for the CPU and 8GB is for the GPU. To keep things simple lets say you have a 5GB/s SSD with an effective throughput of 10GB/s with compression.
Provided you load and decompress the CPU data first, that will be in VRAM and decompressed in the first 0.4 seconds. You then push that back over the CPU<->GPU PCI link (4GB of it now) at a rate of ~30GB/s, so it takes about 0.13 seconds to put that decompressed data into system RAM. Meanwhile you're still spending the next 1.6 seconds bringing in the remainder of the GPU data into VRAM from the SSD.
So I'm not seeing why the PCIe bridge between CPU and GPU is acting as a bottleneck in any way in this scenario. Even if you didn't transfer the CPU data from SSD first and push it back in parallel to streaming the remainder of the GPU data from the SSD, you're still adding at worse 0.13 seconds to your 2 second timeframe.
A congested PCI bus will be terrible for audio and networking. PCIe has maximum theoretical bandwidth but you can only approach it if you're willing to sacrifice low latency priority devices.
You're doing this at a load/transition screen. Why would PCIe traffic between the CPU and GPU be heavily utilised at that point by audio and networking? More to the point, why would that be impacting the CPU<->GPU PCI link at all? Those functions sit on the south bridge which would have their own separate PCI link to the CPU.