That only needs one AGU op for the split load. Each independent address calculation gets its own uop in the memory pipeline. Do you envision a gather pipe that would handle the normally separate memory uops internally?
Yes, the load unit would receive a single gather uop and all the source operands. It computes the full (virtual) address of the first element, and checks which of the other indices point to the same cache line. In the next cycle, it computes the address of the next element for which the cache line fetch hasn't already been queued (if any), again comparing which indices of other elements are stored within the same cache line. This is repeated till every unique cache line fetch has been queued up (1 to 8 cycles).
How would Haswell feed the gather instruction with its operands?
From the description, there is a source/destination register, a mask register, a base register, and an index register.
There would be move from the GPR file as well as the FP file for the combined VSIB:Base and VSIB:Index values.
In total, the hardware would see 4 separate operands in the uop.
If vgather has special treatment to expand the number of inputs, why stop at FMA3? FMA4 was promised and then dropped, potentially because of the number of operands needed in the uop.
The most straightforward solution would be to perform the source/destination register blending in a separate uop corresponding to a vblend instruction. There are two ports capable of executing vblend on Sandy Bridge, so this wouldn't have a noticeable effect on gather performance. However, considering that the other ports are already fed with up to three 256-bit input operands, it might not be that big of a deal to have the load port take the same amount of input.
Note that vblend with an immediate operand as the mask requires only a single uop, while vblendv takes a fourth register requires two uops. This indicates that the 8-bit immediate is a free extra operand, but it takes a uop (the same one as vmovmsk) to extract it from a register.
The choice to go with FMA3 instead of FMA4 may simply be about uop encoding size. Since they store physical register indexes, FMA4 would have allowed significantly fewer uops in an equal sized uop cache versus FMA3. Note however that a fused FMA3 uop with a memory operand encodes four registers. Same for gather.
So in the worse case a gather instruction may end up taking three uops:
vmovmsk imm, mask
vgather temp, [base+vindex*scale], imm
vblend dest, dest, temp, imm
Despite that, the peak throughput would still be 1 full gather instruction every clock cycle since each of these uops take a different port.
I'm not sure if a fourth uop would be needed, possibly for the write to the second destination in the suspend case?
An exception starts the execution of a micro-coded routine, so these may contain a uop to have the load unit write back the mask register.