Anyway, if it takes as you say 3 clock cycles to bring data from L1 and from optimization manual you can see that it takes only 1 clock cycle to compare registers (CMP, TEST) and you can start one compare each 0.5 cycles then you could perhaps still gain something at least in terms of saved bandwidth.
Which architecture are you counting on for that compare bandwidth? It's risky to rely on that for an implementation.
It would make things take about the same or even longer in the best case (all same pointer), assuming the .5 cycle compare latency.
4 pointers means 6 possible pairings, so 6 compares or 3 cycles.
If that first load goes through to cache, the op takes 6 cycles for just the load and branches.
The first load is required always, so that's a given.
If the loads are fired of sequentially, the others will take three additional cycles to complete, less if the cache can handle two loads a cycle.
It might save bandwidth, but it won't particularly matter unless the processor has SMT. Unfortunately, this instruction as you've described is atomic, so the safest way to keep it so would be to block the other thread from issuing any instructions, so the bandwidth would be reserved for pretty much the same amount of time.
When I said atomic before I only meant that no other reads come in between. Surely there is no need to write to the same memory since gather is read only operation anyway or it would be called scatter.
Atomic in its strictest sense means nothing interrupts the instruction. If another thread issues a write, it may go to a gather location, which it shouldn't until after the last load is completed. Atomicity means nothing can tell that this instruction was implemented sequentially. To do this, some additional ops might be needed to block other threads.