Async texture streaming

Discussion in 'Rendering Technology and APIs' started by Ext3h, Apr 24, 2019.

  1. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    Let's discuss a typical problem which can mess up the timing of your application significantly, but without technical need: Texture streaming, and alike tasks.

    Since we got the shiny Vulkan and DX12 APIs, we suddenly have better control over memory transfers, in the form a copy engines / queues being exposed as an individual entity.

    While this sounds like an improvement (mostly as you don't end up accidentally performing a full device launch for a copy which is stalling on PCIe transfers), the implementations are not quite there yet.

    What theoretically works, is using multiple queues/engines with different priorities, so as long as the lower priority work is nicely sliced into small packages, you are able to express overlapping transfers with different real time requirements. (Compare upload of per-frame data vs e.g. streaming of textures which may be bound "when done".) Except that both APIs don't actually mandate that priority based scheduling is implemented. Whoops, so it's still up to the developer to *guess* how much idle time there is on the copy queue, and schedule manually, on a per-frame base.

    The next issue? PCIe is effectively full-duplex. Using an API like CUDA, up- and downloads are even scheduled to two different units, so you get to actually use that available bandwidth, with all the associated benefits to effective latency. (In case you are wondering: With CUDA, it's the memory uploads which got a dedicated engine. Downloads blend in with copy tasks from the 3D APIs.)

    If we once again compare this to D3D12 / Vulkan, some quick testing shows that all of the work just ends up on the same engine with the current implementation on Nvidia and AMD hardware. Doesn't matter if you try to batch transfers by direction or limiting format conversions, targeting the copy queue alone effectively limits you to half-duplex, and bulk uploads will add significant latencies to even the smallest downloads.
     
    #1 Ext3h, Apr 24, 2019
    Last edited: Apr 24, 2019
    digitalwanderer, Heinrich4 and BRiT like this.
  2. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,106
    Likes Received:
    883
    Location:
    still camping with a mauler
    Is the PCIe bus a significant bottleneck for texture streaming?
     
    digitalwanderer likes this.
  3. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    Depends on what you stream. Multi-GPU (via PCIe) or any form of video content means you hit that bottleneck quickly. With ideas like "let's do GI and other view-independent shared content on a render farm and stream in CPU friendly codec X" this is bound to get even worse. Add any form of bulk feedback passes from GPU to CPU on top of that, as long as you can't treat the PCIe as the full-duplex it actually is.

    Even if the PCIe itself isn't the bottleneck, you are also competing for memory bandwidth on the host. Which is (in theory) plenty, but can also be temporarily be exhausted in which case you get additional stalls into memory transfers. That's just additional jitter though. Unless you start doing something crazy like processing a huge working set with a bandwidth bound algorithm with 8+ threads on a modern platform (completely obscure example, it's not like anyone out there would combine 8 cores + SMT with a tiny dual channel DDR4?...), in which case that jitter gets excessive, respectively quickly results in reduced throughput.

    So if you, due to limitations in API implementation, can't reliably schedule work in a way which allows you to achieve full duty cycle, not without effectively implementing preemption in user space (which is a bad idea for a lot of reasons, seriously), you are already required to keep utilization low enough that jitter doesn't propagate to frame times.

    From practical experience in an application which does combine all of multi-GPU with bandwidth heavy CPU load with read-back, I pretty much found that you can't really utilize more than 30-40% of the theoretical peak PCIe half-duplex bandwidth before things go haywire. (Going haywire in the sense of massive jitter, and on top of that watchdog violations...) Which does occasionally become a bottleneck even with PCIe 3.0 16x, and is close to unusable with e.g. PCIe 2.0 8x you still encounter at times (especially in multi-GPU setups...).
     
    #3 Ext3h, Apr 26, 2019
    Last edited: Apr 26, 2019
    digitalwanderer and BRiT like this.
  4. doompc

    Joined:
    Mar 19, 2015
    Messages:
    7
    Likes Received:
    6
  5. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    Except that if you take a look at their statement as to why they used multiple queues ("it eases thread synchronization"), it becomes clear that they did not even expect any performance gains from multiple queues, they just used it to relax the ordering constraints on submission vs execution.

    It still maps to the same copy engine instance on "real" submission to GPU. The difference is only that if one of your software side queues stalls on work submitted to a different queue, the other queues can still advance. You should not expect any fair scheduling from such an approach when nearing 100% utilization of the copy engine.
     
    #5 Ext3h, Apr 28, 2019
    Last edited: Apr 28, 2019
  6. doompc

    Joined:
    Mar 19, 2015
    Messages:
    7
    Likes Received:
    6
    Streaming systems usually deal with devices that are many orders of magnitude slower than RAM; ie mecanical or even optical drives. I guess it comes down to "schedule it far enough in advanced that it will be ready by the time you need the data".

    Other examples come to mind. The Far Cry guys detailed how they manage texture streaming and LODs in the Dunia Engine for Far Cry 4. It assumes 30fps and allow far LODs to load over a couple frames.


    The "megatextures" system for the original RAGE rellied on decompressing assets on the GPU (CUDA/OpenCL at the time when compute shaders were not ubiquitous), maybe that's you answer if you are hitting the limits of 16x PCIe gen3.

    I don't know if the Coallition guys even detailed the streaming system for Gears 4, but they do some clever load/eviction on textures where you can see textures beeing swapped if you are VRAM limited; but that all happens without any hitches in framerate. And they seem to be using higher tiers (as in higher than FL_11) of Tiled Resources.

    Or maybe test you app on an NVLINK enabled system to see how it behaves,
     
    BRiT likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...