What would this measure, since the threads are not independent at a lane level, but must execute on SIMD hardware?
The capabilities of hard- and software to actually handle interleaved execution efficiently, rather than just activating formerly unused SIMD units. The goal isn't just to issue at least one wavefront to each SIMD - it is to achieve full utilization even under the aspect of pipeline stalls and other limiting factors.
With the exception of dependences between queued commands visible to the API, what do these other elements measure as far as how freely the queues can issue past each other?
Individually: Nothing, at least nothing you couldn't read from the specs.
When issued concurrently: The actual possible gains from async scheduling.
The entire idea behind async computing is to increase the utilization by interleaving different workloads, which allows you to achieve full utilization of the entire GPU despite not having optimized your individual shaders onto the pipeline of each GPU architecture.
However, there are many limitations which can prevent actual concurrent execution on a per SIMD level, leave alone efficient execution, e.g. register file usage, L1 and L2 cache misses, shared command paths and so on. So two workloads can't be taken as suited for concurrent execution per default, but each single pair actually needs to be tested individually.
The multi-queue version of the test did not reveal a major difference in the behavior of GCN, although I didn't see more than a few batches posted.
There's no guarantee of uniform queue loading, and AMD admitted the count of queues is currently overkill, so restricting an ACE's ability to track resources to a small fraction of the GPU's throughput might hamstring it.
I actually don't know for sure yet, either. That's why I suggested to include these parameters as variables into the testbench.
The lower limit for Nvidia is 64? I thought it was half that.
You are right, somewhat, I oversimplified that statement. Wavefront size is 32 for Nvidia and so is therefore optimal workgroup size, however, with 2048 shaders and only 32 concurrent kernels, you can only achieve full shader saturation with at least 64 threads per kernel on average. You can still issue the kernel to operate with a workgroup size of 32, but there shouldn't be less than 64 threads on average.
Either way, since scheduling and launch is measured at a wavefront or warp granularity, what insight does a fully utilized SIMD give towards measuring how freely queued launches can move past each other?
Well, out of order launches respectively scheduling full wavefronts are only one half of async compute. The actual concurrent execution of multiple wavefronts per SIMD is the other half.
You can construct scenarios where only the one OR the other aspect will limit you, but the goal is to find the point where you are limited by neither, but by raw performance of the card instead. That is the sweet point you want to hit.
In order to stay on the original topic:
Effective latency is also a function of both. In order to keep it down, you need your kernel scheduled ASAP, as well as ensuring efficient concurrent execution on SIMD units which are possibly already under full load. You should never hope on finding completely idle SIMD units nor can guarantee that scheduling is even possible right now due to various constraints.