How is that full saturation? So you issue 64 warps. Then what? Next clock you can't issue new instructions from any of the threads that are being worked on until those warps are done (ILP aside). So SMM won't issue anything useful for another 5 clocks at best (warp is executing a MAD) and 100s of clocks at worst (warp is executing memory access). The same goes for GCN.
My fault, forgot about instruction latency on Maxwell. Well, so the lowest reasonable limit for warp count is even much higher. Even better, more parameters to play with.
Guess you were referring to the numbers of this paper?
http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf
GCN has only 4 cycles latency for the simple SP instructions. Which I already accounted for, at least for "pure" SP loads, not to mention that this is also the minimum latency for all instructions on GCN and GCN also features a much larger register file.
The 6 cycle SP latency for Maxwell however is ... weird. I actually though that Maxwell had a LOWER latency for primitive SP instructions than GCN, but the opposite appears to be the case???
App currently uses two queues: one compute and one direct and there's a scenario where both go through direct queue.
No, in both scenarios, the application is always using a dedicated direct and an dedicated compute queue (at least from the software view). What changes though, are the flags set on the compute queue, and whether there is a CPU side barrier between committing to the direct and compute queue or not. Oh, and whether there are any draw calls at all.
What happens in the "super slow" scenario where everything is run on one direct queue is that all compute dispatches and draw calls run serially. Meaning it won't even run 32 compute dispatches concurrently for some reason. This same scenario will also happen on direct queue if you only dispatch compute kernels without draws.
Uhm, no. That's not the case. It's just refusing the compute dispatches concurrently because it has been told not to. The software queue was explicitly flagged to be executed sequentially.
For that test, it's actually GCN which behaved odd, as it chose to ignore that flag, probably because the driver assumed it was safe to do so.
Don't try to make too much sense from the results, so far the benchmark is neither limited by anything except for the concurrency on a call level, nor does it even measure the impact of concurrency on a SMM/CU level.
The current benchmark only yielded very few usable results so far, and that is the the number of calls the schedulers of Maxwell, GCN 1.0/1.1 and GCN 1.2 can each dispatch, with the biggest surprise being that the HWS (two ACEs "fused") in GCN 1.2 apparently differ in behavior from the plain ACEs used in GCN 1.1 and 1.0.
As well as provoking the driver crash on Nvidia from starving calls and revealing and that driver optimization on GCN where it could ignore the sequential flag.
Apart from that, you can rather clearly see that power management screwed the numbers both on Maxwell and GCN. Maxwell went into boost whenever it had no draw calls at all (or at least received a constant speedup for any other unknown reason), and GCN didn't leave 2D clocks fast enough when confronted only with a single "batch" of compute calls.
Which made especially these two graphs as inconclusive as it could possible get:
http://www.extremetech.com/wp-content/uploads/2015/09/3qX42h4.png
http://www.extremetech.com/wp-content/uploads/2015/09/vevF50L.png
Just to pick two graphs which got abducted from this thread and ... got misinterpreted, because they mostly measured at lot of random noise. A+ for scientific method.