What time (and what's the session number) for your talk?
I can't figure out how to link the session on the IDF site :S In any case it's GVCS004 - 3D Optimization for Intel Graphics, Gen9 at 9:30-10:30 AM, Room 2007.
Without taking away from the usefulness of other DX12 features I'm surprised by how little asynchronous compute/shaders (aka "multi-engine" in Microsoft terms) is mentioned on this thread.
I think that's just because this is a "feature levels" thread and as noted, all DX12 implementations must support asynchronous queues. From a developer perspective, the capability is not hidden behind a cap because it's supported everywhere, and that's great. There are tons of great DX12 features "in general" (I'll point to execute indirect as another) that are supported everywhere.
Incidentally there are several other threads that are entirely dedicated to async compute, so I don't think we need to derail this one too much
Multi-Engine efficiency will vary across devices but unfortunately there is no CAP or feature level to indicate the level of support.
A "cap" for that sort of thing is highly problematic for a number of reasons. How much better doing things with multiple queues is will depend not only on the hardware architecture, but the nature of the workload itself. For instance, it's obviously not going to be much faster to "asynchronously" run a sampler heavy task alongside another sampler heavy task regardless of the hardware. On the hardware side it becomes far trickier... depending on the architecture a given workload could already be efficiently mapped to use the majority of the machine, or there could be constraints that prevent that leaving much of the machine idle. Even different SKUs can behave very differently here with wider ones typically needing more explicit parallelism.
It's quite similar to CPUs though - the ideal is always to mortgage as little parallelism as possible to fill the machine as there is always parallelization overheads (in this case largely due to the additional synchronization and scheduling). Unfortunately there's no easy way to know "how much" parallelism an implementation needs, and I don't think there's a simple caps bit to express that given the inherent complexities.
(on AMD GCN this is a wavefront).
Not to get too off-topic, but can a single CU be running both compute and 3D at once? I thought there were resources that were overlapped and used by the 3D pipe such that the "granularity" of switching between 3D and compute is at least a whole CU, right? Otherwise is the smem just sitting around doing nothing while 3D is running? :O
Obviously for pure compute workloads most (all?) implementations can mix multiple kernels at a HW thread/"wavefront" granularity.