Async texture streaming

Ext3h

Regular
Let's discuss a typical problem which can mess up the timing of your application significantly, but without technical need: Texture streaming, and alike tasks.

Since we got the shiny Vulkan and DX12 APIs, we suddenly have better control over memory transfers, in the form a copy engines / queues being exposed as an individual entity.

While this sounds like an improvement (mostly as you don't end up accidentally performing a full device launch for a copy which is stalling on PCIe transfers), the implementations are not quite there yet.

What theoretically works, is using multiple queues/engines with different priorities, so as long as the lower priority work is nicely sliced into small packages, you are able to express overlapping transfers with different real time requirements. (Compare upload of per-frame data vs e.g. streaming of textures which may be bound "when done".) Except that both APIs don't actually mandate that priority based scheduling is implemented. Whoops, so it's still up to the developer to *guess* how much idle time there is on the copy queue, and schedule manually, on a per-frame base.

The next issue? PCIe is effectively full-duplex. Using an API like CUDA, up- and downloads are even scheduled to two different units, so you get to actually use that available bandwidth, with all the associated benefits to effective latency. (In case you are wondering: With CUDA, it's the memory uploads which got a dedicated engine. Downloads blend in with copy tasks from the 3D APIs.)

If we once again compare this to D3D12 / Vulkan, some quick testing shows that all of the work just ends up on the same engine with the current implementation on Nvidia and AMD hardware. Doesn't matter if you try to batch transfers by direction or limiting format conversions, targeting the copy queue alone effectively limits you to half-duplex, and bulk uploads will add significant latencies to even the smallest downloads.
 
Last edited:
Is the PCIe bus a significant bottleneck for texture streaming?
Depends on what you stream. Multi-GPU (via PCIe) or any form of video content means you hit that bottleneck quickly. With ideas like "let's do GI and other view-independent shared content on a render farm and stream in CPU friendly codec X" this is bound to get even worse. Add any form of bulk feedback passes from GPU to CPU on top of that, as long as you can't treat the PCIe as the full-duplex it actually is.

Even if the PCIe itself isn't the bottleneck, you are also competing for memory bandwidth on the host. Which is (in theory) plenty, but can also be temporarily be exhausted in which case you get additional stalls into memory transfers. That's just additional jitter though. Unless you start doing something crazy like processing a huge working set with a bandwidth bound algorithm with 8+ threads on a modern platform (completely obscure example, it's not like anyone out there would combine 8 cores + SMT with a tiny dual channel DDR4?...), in which case that jitter gets excessive, respectively quickly results in reduced throughput.

So if you, due to limitations in API implementation, can't reliably schedule work in a way which allows you to achieve full duty cycle, not without effectively implementing preemption in user space (which is a bad idea for a lot of reasons, seriously), you are already required to keep utilization low enough that jitter doesn't propagate to frame times.

From practical experience in an application which does combine all of multi-GPU with bandwidth heavy CPU load with read-back, I pretty much found that you can't really utilize more than 30-40% of the theoretical peak PCIe half-duplex bandwidth before things go haywire. (Going haywire in the sense of massive jitter, and on top of that watchdog violations...) Which does occasionally become a bottleneck even with PCIe 3.0 16x, and is close to unusable with e.g. PCIe 2.0 8x you still encounter at times (especially in multi-GPU setups...).
 
Last edited:
Search for the DX12 implementation in The Division, they use 3 copy queues to manage texture streaming.
https://developer.download.nvidia.c...gular/GDC17/DX12CaseStudies_GDC2017_FINAL.pdf
Except that if you take a look at their statement as to why they used multiple queues ("it eases thread synchronization"), it becomes clear that they did not even expect any performance gains from multiple queues, they just used it to relax the ordering constraints on submission vs execution.

It still maps to the same copy engine instance on "real" submission to GPU. The difference is only that if one of your software side queues stalls on work submitted to a different queue, the other queues can still advance. You should not expect any fair scheduling from such an approach when nearing 100% utilization of the copy engine.
 
Last edited:
Streaming systems usually deal with devices that are many orders of magnitude slower than RAM; ie mecanical or even optical drives. I guess it comes down to "schedule it far enough in advanced that it will be ready by the time you need the data".

Other examples come to mind. The Far Cry guys detailed how they manage texture streaming and LODs in the Dunia Engine for Far Cry 4. It assumes 30fps and allow far LODs to load over a couple frames.

The "megatextures" system for the original RAGE rellied on decompressing assets on the GPU (CUDA/OpenCL at the time when compute shaders were not ubiquitous), maybe that's you answer if you are hitting the limits of 16x PCIe gen3.

I don't know if the Coallition guys even detailed the streaming system for Gears 4, but they do some clever load/eviction on textures where you can see textures beeing swapped if you are VRAM limited; but that all happens without any hitches in framerate. And they seem to be using higher tiers (as in higher than FL_11) of Tiled Resources.

Or maybe test you app on an NVLINK enabled system to see how it behaves,
 
Back
Top