Ryan (or anyone else from Anand) - how did you setup/implement the GEMM tests? I'm guessing it's cuBLAS multiplying two matrices which are *both* very large? I agree memory bandwidth is a key question for efficiency here - I'm thinking the way the tensor cores are used for cuDNN might have different characteristics in terms of "external bandwidth required per amount of computation" (depending on what they're doing). Might also be interesting to try downclocking core and/or memory separately and see what happens.
BTW on the Beyond3D "Estimated FMA latency" test - it doesn't really make sense for GCN to be 4.5 cycles
There are possible HW explanations for non-integer latencies but they're not very likely. The test inherently has some overhead (which can be amortised by trading off how long it runs in various ways) so maybe it's just higher on GCN for some reason which makes it "look" like 4.5 cycles when it's really 4 cycles; I'm not quite sure.
I always thought it'd be interesting to get power consumption numbers when running that test btw (would probably have to be changed to run in a loop) - it's effectively using the GPU as little as theoretically possible, but still keeping it active non-stop (1 lane of 1 warp/wave). So in a way it's the smallest possible step up from "idle" and shows what's the minimum power when you're not allowed to just power gate (or shutdown) everything!
I'm getting my Titan V in early/mid January - I'll definitely write some microbenchmarks to test a few things I'm curious about, thinking of maybe writing articles describing the deep learning HW landscape too, we'll see...
(P.S.: Agreed with silent_guy, every instance of "scheduling hardware" should really be replaced by "dependency tracking hardware"!)
EDIT: And needless to say, thanks for the really nice article with original analysis and tests - happy to see you guys spending the time to do that!