Wouldn't that show up in the SASS though? Or are you saying that an emulated HFMA2 would be transparent via the JIT?No... sm_50 and sm_52 do not have this capability.
If so you would just be able to access half-words and would achieve 64 ops/clock/SMM.
So 2*M*N fp16 vs. 1*M*N fp32 takes 6x more time?
A simulated (necessary for pre-sm_53) half2 FMA takes 11 instructions but you get 2 fp16 FMAs per thread.
Sounds like 6 ops per fp16 to me.
One way to discover more about what's going on under the hood is to not perform FMAs but just MULs.
If F2F conversion is happening then you'll be skipping unpacking/packing of the addend.
If I disassemble the binaries that are generated by nvcc for my code, it's just HFMA2s, and the code doesn't even run on pre sm_53 targets as-is, so I'm not seeing transparent emulation.
I'll check and make sure I'm not building with stripped PTX.