Exactly the same thing as 8 TLB misses when you emulate gather with extract / insert. Nothing to worry about.What happens when it causes 8 TLB misses, costing 10^3-10^4 cycles for one instruction?
Gather is a set of load operations. They can execute in any order and some may execute simultaneously. There is no change to the coherency protocol since you're only queuing up one cache line access per cycle per port.And what happens to coherency protocol while a core is stalled on a single memory uop?