I'm convinced it will take three micro-instructions (one to extract the mask, one to perform the actual gather, and one to perform the final blend), with a maximum throughput of one instruction per cycle when all elements are from the same cache line. The micro-op breakdown is obvious considering that Haswell won't support FMA4 and the current vblendv instruction can take three 256-bit source registers thanks to a movmsk micro-instruction which extracts the mask and passes it as an immediate to a vblend micro-instruction.
I'm also expecting Haswell to feature one regular 256-bit load unit and one 256-bit gather unit per core. It needs the extra L1 bandwidth for the FMA instructions, and this setup would allow the gather unit to have a slightly higher latency and thus a reasonably power efficient implementation.
It would be irrational of Intel to aim for anything less. LRBni supports 512-bit gather, and they wouldn't add gather to AVX2 if it wasn't efficient. 2 x 256-bit execution begs for a high performance gather instruction. Also, there's nothing reasonable in between sequential extract/insert, and gathering from one cache line at a time, so it has to be the latter. And lastly note that the coherency rules for the instruction are consistent with such an implementation.
I think the decoding granularity for gather would be somewhere between 1 uop/cacheline and 1uop/page for the first implementation. Intel has a track record of exposing ISA and gradually speeding up the instruction.
Also, there is no reason to believe whatsoever that Haswell will have 2x256b FP hw/core. Quite the contrary.