Why would a Hyper-Threaded core with two FMA units be overkill, while a Bulldozer module with two FMA units is not?Two 256-bit FMA pipes per core could be an overkill...
There's no need for Intel to settle for anything less. 22 nm Tri-Gate gives them plenty of power efficient transistors. What else would they use them for anyway? More cores would mean more threads and thus worse scaling, as well as more of the other power-hungry components. Compared to the alternatives, 2 x 256-bit FMA offers the best performance / Watt even if they have to widen a few things....but we still don't know anything about the Haswell's load/store pipeline capabilities. Probably Intel could settle for asymmetric ALU design with FMA + MUL "co-issue" organization, or FMA + ADD.
An FMA + ADD configuration doesn't make sense since multiplications are more prevalent. And if you got FMA + MUL you may as well make it a second FMA since it's not much bigger and you already got three source operands for other operations.