Another bit of work that is needed if we try a gather that is little more than N standard loads chained together is a way to evaluate the index values for the gather to see if they are neighbors on the same cache line, based on their values and the value of the base register loaded from the scalar pipeline.
If we don't rely on a stream of microcoded instructions to do this, we need a logic block that can spit out a list of cache line neighbors in order to elide duplicate loads. There is currently no guarantee that Haswell will attempt this, but it would help reduce the latency and power cost of using the instruction.
The comparison would be an N-way check of the arithmetic difference of each index value + the lowest significant bits of the base register needed to pick a cache line. The resulting values would be checked for where they lie on a cache line.
This could be done with a stream of instructions, though it would be faster if specialised hardware helped.
You don't have to compute the full arithmetic difference of each index value. You just have to compare the upper bits for equality, and check whether the sum of the lower 7 bits of the base and index cause an overflow into the next cache line. Since the results aren't needed till a later clock cycle, it seems trivial to provide small and power efficient dedicated hardware for it.
I'm slightly more concerned about the hardware needed to extract multiple elements simultaneously. Considering that cache line sizes of 64 bytes have been common for a decade it might not be that big a deal to have multiple of these shift units though. Larrabee is even supposed to be capable of gathering up to 16 elements per cycle, and has smaller cores, so I'm optimistic. Latency might be affected, but there's actually an elegant solution for that; see below.
The next question would be the number and width of Haswell's memory ports, which could change some details as well.
Two read ports, one write port, all 256-bit. It's pointless to feature two 256-bit FMA units if the caches can't provide sufficient bandwidth. x86's limited number of registers, and the fact that non-destructive AVX instructions help increase throughput, make this an absolute necessity.
Indeed this affects the gather implementation. I imagine either each load unit has a lightweight gather implementation, they cooperate in some way, or only the second one has an advanced gather implementation. The latter would be an interesting option since it allows for the second unit to have a higher latency. If most other load operations use the first low latency port, it would hardly affect legacy workloads.
If we look at Bulldozer with 2 128-bit FMAs and a vector permute block, we see an FPU that is something shy of half the size of the 2 integer cores in a module.
Within the FPU, the XBAR block is between the two register file halves, and would be about 20-25% of the area of the rest of the FPU (and it is only 128 bits). There are other ops in that block of course (horizontal ops, etc), which Haswell will have as well, so growth is not on the permute block alone.
An FMA is somewhat smaller than having 1 FMUL and 1 FADD. It goes to follow two of them is somewhat smaller than doubling the ALU pipes in the FPU of Haswell.
Bulldozer's
FlexFP unit can execute up to four instructions each cycle. Sandy Bridge on the other hand borrows some of the integer data paths for the 256-bit operations. So it seems cheaper to me to equip Haswell with two 256-bit FMA units, than to extend the FlexFP unit to sustain two 256-bit operations.
Haswell is also a new core design, so the integer side would be growing as well.
According to some rumors, they might extend macro-op fusion to pairs of mov and ALU instructions, turning them into a single non-destructive operation (if applicable of course). This would merely affect the decoders. The ironic part is that current compilers actually avoid emitting such code, so legacy code may not observe much of a benefit. Recompiled code (which is still perfectly compatible with older x86 chips) could run a bit faster though, and just like test and branch macro-op fusion this technique would slightly improve performance/Watt.