Why does that have to be the case? Intel already added two load units to SNB, without compromising single-threaded scalar performance. So I don't see any reason why a gather instruction with a maximum throughput of one every four cycles can't be implemented in a straightforward way without negatively impacting anything else.As 3D and others pointed out, trying to add wide scatter/gather hurts serial performance, so they are unlikely to be added.
Note that these load units have been capable of handling misaligned loads which straddle a cache line boundary for ages. One line could be in L1, the other in a swapped out page (though typically they'll be close together). The above implementation would merely require extending this to keeping track of four instead of two cache lines.
And from that point forward it seems relatively simple to me to add the ability to fetch up to four 32-bit elements from each cache line. Of course each of these things will require additional transistors and/or latency, but as the transistor budget continues to increase exponentially and clock frequencies increase only at a modest pace, it seems to me this will soon enough pose little of a problem. Doubling the number of load units in SNB also wasn't free, but 32 nm made it feasible without noticeable compromises...
For a competitive GPU, sure, but this was about making the CPU competitive with the IGP (and at the same time turning it into a generic high-performance computing device like Larrabee). Very different design rules apply and everything is negotiable.Tex units are non negotiable, and in all likelyhood, to make a competitive gpu you'll need more ff hw as well.