Ext3h
Regular
Let's discuss a typical problem which can mess up the timing of your application significantly, but without technical need: Texture streaming, and alike tasks.
Since we got the shiny Vulkan and DX12 APIs, we suddenly have better control over memory transfers, in the form a copy engines / queues being exposed as an individual entity.
While this sounds like an improvement (mostly as you don't end up accidentally performing a full device launch for a copy which is stalling on PCIe transfers), the implementations are not quite there yet.
What theoretically works, is using multiple queues/engines with different priorities, so as long as the lower priority work is nicely sliced into small packages, you are able to express overlapping transfers with different real time requirements. (Compare upload of per-frame data vs e.g. streaming of textures which may be bound "when done".) Except that both APIs don't actually mandate that priority based scheduling is implemented. Whoops, so it's still up to the developer to *guess* how much idle time there is on the copy queue, and schedule manually, on a per-frame base.
The next issue? PCIe is effectively full-duplex. Using an API like CUDA, up- and downloads are even scheduled to two different units, so you get to actually use that available bandwidth, with all the associated benefits to effective latency. (In case you are wondering: With CUDA, it's the memory uploads which got a dedicated engine. Downloads blend in with copy tasks from the 3D APIs.)
If we once again compare this to D3D12 / Vulkan, some quick testing shows that all of the work just ends up on the same engine with the current implementation on Nvidia and AMD hardware. Doesn't matter if you try to batch transfers by direction or limiting format conversions, targeting the copy queue alone effectively limits you to half-duplex, and bulk uploads will add significant latencies to even the smallest downloads.
Since we got the shiny Vulkan and DX12 APIs, we suddenly have better control over memory transfers, in the form a copy engines / queues being exposed as an individual entity.
While this sounds like an improvement (mostly as you don't end up accidentally performing a full device launch for a copy which is stalling on PCIe transfers), the implementations are not quite there yet.
What theoretically works, is using multiple queues/engines with different priorities, so as long as the lower priority work is nicely sliced into small packages, you are able to express overlapping transfers with different real time requirements. (Compare upload of per-frame data vs e.g. streaming of textures which may be bound "when done".) Except that both APIs don't actually mandate that priority based scheduling is implemented. Whoops, so it's still up to the developer to *guess* how much idle time there is on the copy queue, and schedule manually, on a per-frame base.
The next issue? PCIe is effectively full-duplex. Using an API like CUDA, up- and downloads are even scheduled to two different units, so you get to actually use that available bandwidth, with all the associated benefits to effective latency. (In case you are wondering: With CUDA, it's the memory uploads which got a dedicated engine. Downloads blend in with copy tasks from the 3D APIs.)
If we once again compare this to D3D12 / Vulkan, some quick testing shows that all of the work just ends up on the same engine with the current implementation on Nvidia and AMD hardware. Doesn't matter if you try to batch transfers by direction or limiting format conversions, targeting the copy queue alone effectively limits you to half-duplex, and bulk uploads will add significant latencies to even the smallest downloads.
Last edited: