I finally got to play with a 1080 today and I thought I'd post my findings.
So the int8/16 dot production instructions (dp4a and dp2a) are indeed full throughput and work just as advertised. The fp16 support, however, has a lot of functionality that isn't documented:
- HFMA2.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>,<->b.<H0_H0|H1_H1>,<->c.<H0_H0|H1_H1|F32>;
- <HADD2|HMUL2>.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>,<->b.<H0_H0|H1_H1>;
On sm_61 this of course runs at 1/64 throughput (the instruction, not the math) so all this is of limited use. But it might be good to be aware of it if you get ahold of hardware that isn't crippled ( sm_60, sm_62? ).
Interestingly, there's a lot of mixed fp32/fp16 precision support. Though the mode I was most interested in (fp16x2 dot product with fp32 accumulate) isn't supported. Any time you mix in fp32 registers the instruction can only work on one of the packed fp16 values in other registers (as far as I can tell anyway).
The H0_H0 and H1_H1 flags are there to facilitate a SIMD matrix multiply outer product, but are also used to select the packed values in single throughput mode. The merge flags don't currently work on sm_61 but it's clear they're meant to facilitate BFI like functionality.
Oh, and one other thing.. it looks like there's a performance bug in the compiler. If you're loading in memory in fp16 and then converting to fp32 for compute, ptxas tries to be clever and use HADD2 instead of F2F to do the conversion. But on sm_61 hardware this is a 16x slower pathway. On sm_60 hardware this makes a lot more sense since it's a 4x speedup. I've already submitted a bug for this.