How is the appropriate context determined? The same code is being interpreted by GCN as compute and by Maxwell as graphics.
Like is there a flag that needs to be set, marking the code as compute? Or is it something the driver/hardware determines on its own?
The context is defined when the queues commands are being sent to are defined.
However, what the graphics system categorizes them as internally shouldn't bother the API as long as they act appropriately.
Perhaps there's a wrinkle in the behavior that caused the timestamps to not work in the Nvidia compute queue?
https://forum.beyond3d.com/posts/1869354/
That is possibly even working properly - as long as the driver may assume that the work items in the queue are independent.
This seems like that can be assumed since DX12 has an explicitly asynchronous compute queue outside of programmer-defined synchronization points, or for independent user contexts for virtualized graphics products that would also be by definition independent.
With the latest IP, AMD has involved significant hardware management for both scenarios, whereas software appears to be more involved with Nvidia's implementations.
IMHO, the awful performance when enforcing serial execution speaks for a lack of dependency management in the hardware queue. This would enforce a roundtrip to the CPU between every single step.
It's possible that there's more driver-level management and construction of the queue.
Possibly, the 32 "queues" are software-defined slots of independent calls the driver can determine that the GPU can issue in parallel, possibly by a single command front end.
If running purely in compute, this seems to stair-step in timings as one would expect.
Unlike GCN, where "serial execution" doesn't appear to actually mean serial. It's possible that the driver only enforces the memory order in that case, and still pushes otherwise conflicting jobs to the ACEs and uses the scheduling capabilities of the hardware. This could possibly also explain the better performance when enforcing "serial" execution as the optimizer may now treat subsequent invocations as dependent and may therefore possibly even concatenate threads, which ultimately leads to reduced register usages.
It's a long shot, but it might be that Nvidias GPUs have no support for inter shader semaphores while operating in graphics context.
AMD's separate compute paths may provide a form of primitive tracking separate from the primitive tracking in the geometry pipeline.
It does seem like the separate command list cases can pipeline well enough. Perhaps there is a unified tracking system that does not readily handle geometric primitive and compute primitive ordering within the same context?