Ext3h
Regular
It causes the steps of 32 in "pure compute" (not actually compute context, but compute slots in the graphic context queue!), as well as the steep increase in serialized mode.It's possible that there's more driver-level management and construction of the queue.
Possibly, the 32 "queues" are software-defined slots of independent calls the driver can determine that the GPU can issue in parallel, possibly by a single command front end.
If running purely in compute, this seems to stair-step in timings as one would expect.
In actual compute context (used for OpenCL und CUDA), this appears to work much better, but according to the specs, the GPU actually has an entirely different hardware queue for that mode.
It also explains the reported CPU usage when the DX12 API is hammered with async compute shaders, as the queue is rather short, containing only 32 entries, it means a lot of roundtrips.
On top of that, it appears to be some type of driver "bug".
Remember how I mentioned that the AMD driver appears to merge shader programs? It looks like Nvidia isn't doing such a thing for async queues, only for the primary graphic queue. That could probably need some optimization in the driver, by enforcing concatenated execution of independent shaders if the software queue becomes overfull.
Well, at least I assume that Nvidia is actually merging draw calls in the primary queue, at least I couldn't explain the discrepancy between the latency measured here in the sequential mode, and the throughput measured in other synthetic benchmarks. Even though it is true, that even now, Maxwell v2 has a shorter latency than GCN does when it comes to refilling the ACEs.
Good catch! That could also explain why the hardware needs to flush the entire hardware queue before switching between "real" compute and graphics mode. And the issues we saw here in this thread with Windows killing the driver for not handling a true compute shader program waiting for the compute context to become active. Either this, or some rather similar double use of some ASIC.AMD's separate compute paths may provide a form of primitive tracking separate from the primitive tracking in the geometry pipeline.
It does seem like the separate command list cases can pipeline well enough. Perhaps there is a unified tracking system that does not readily handle geometric primitive and compute primitive ordering within the same context?