When looking at the tests hardware.fr did, I get the hunch that the lowered performance of FP16 writes (without blending) ....
For the sake of clarity you have to include a G16R16F test (yes that is a valid rendertarget format). Then you can compare 32bit integer (RGBA8) vs. 32bit fp (G16R16F) writes without bandwidth contention/difference. That should boil down to pure FP vs. INT blending performance difference.
I think we agree that it makes no difference for the ROP if it's 2xFP16 or 4xFP16, I'm pretty sure they contain fully redundant vector-ALUs (always do 4x scalars).