"Scatter/gather must take a mask register... afterwards, mask will be all-zeros". This is their answer to allowing vector loads/stores to be interruptible. Say one of the memory locations has a page fault? They can actually interrupt the vector memory operation mid-way through by just adjusting the mask to reflect what has been done and what is left to be done. When they come back from the page fault, they just restart the instruction and it just handles it. Very simple; very clever.
It's a simplifying design choice, one that has been discussed as being a likely choice.
Larrabee does not implement scatter and gather as single instructions, if I recall some text I read (possibly in an article by Abrash).
Rather they have a microcode component.
The implementation details can be interesting. The microcode engine in other CPUs can potentially block instruction fetch completely for a thread, and the serialization of gathers and stores also means there are questions on the atomicity of the process.
As nice as gather and scatter are, some people did hope for atomic vector operations, as difficult as that might be to implement.
It also says: "L1$ [first-level cache] can only handle a few access per clock." Since they didn't say exactly "two" or something, what I bet they did is bank it four ways. That would give "a few accesses per block" on average.
I'm curious if it has something to do with the fact there's a vector memory pipe and the standard 32-bit ports used by the x86 front end. Perhaps certain operations that have been masked to only use a certain number of bits can borrow the other ports?
It also says: "Offsets referring to the same cache line can happen on the same cycle". So, it sounds like if it does a dense vector load or store it actually goes pretty quickly, but even if you do a scatter/gather, it can take advantage of some locality, too.
It seems to indicate that the vector memory pipe uses whole cache-line loads as an input, and that it has a very wide memory port.
Sorting offsets after that through the already present multiplexing hardware would seem to follow.
When you add four threads per core to suck up various pipeline latencies (branch mis-predictions, dependent operations, cache misses) plus scatter/gather prefetching instructions (giving lots of memory-level parallelism), it seems like this could be reasonably easy to program (perhaps not to get absolute peak performance, but getting reasonable performance doesn't sound so hard).
It potentially can and potentially can't.
The software rasterizer being used has some interesting restrictions that do indicate areas where Larrabee's supposedly generalist architecture enforce through practicality certain outlines of the implementation.
Larrabee won't necessarily fail on other schemes (unless they need atomic scatter/gather for some reason), but they could still be suboptimal.
Threads of a certain type should run together and fixed to a core.
Threads should try to maintain locailty as much as possible.
Threads should try to minimize the amount of invalidation of shared data.
Threads should not migrate, hardware context switches are ruinous.
It is better to have redundant work than try to arbitrate between cores.
The binned renderer design takes all of these concerns into account, it pins threads to a single core, it keeps raster threads associated with their tiles which keeps their data set local physically and in terms of screen space and screen data, and primitives at the edge of tiles are worked over multiple times if need be. It is possibly the optimal scheme for Larrabee.