Who ever said this is fully general? No massively multicore beast (be it a traditional GPU, a multi-core, or Larrabee) is fully general. In fact, programming these multicore things and using wide vectors is tricky.
It's implied by various statements that deemphasize any possible Larrabee performance shortfall by saying "it'll be more flexible".
The question is what they mean by that if the architecture's limitations mean that the binning rasterizer Larrabee uses is one of a few optimal solutions.
It's kind of bleh if programmers are given all the flexibility to implement the same rasterizer + an extra bell or whistle.
At least the real time raytracer proponents would have their sights set on something interesting.
My point was that Larrabee has several attributes that will make it easier to realistically achieve high performance.
In various situations, I'd also believe this to be the case.
The claim was qualified with the word "reasonable", which I'll get to later.
The relative utility of those cases won't be determined until an actual implementation is put through its paces.
Many common parallel idioms can be directly mapped to a task-based model. A good task scheduler isn't a limitation as much as an enabler.
There are certain underlying assumptions that a task model on Larrabee would suggest to a programmer that are not univerally necessary, but reflect quirks of the design.
I just feel it's more honest to state that while we can decry how GPUs in various places constrain the programmer, that the alternative is not unconstrained either.
I would say if cache-to-cache misses are as fast as a miss to main memory, that would be "reasonably fast".
I think it'd be faster than that. The texture units and cores talk to each other over the same interconnect the caches use to transfer data.
It wouldn't do to have the act of just asking for a texture access taking several hundred cycles.
Depending on the latency of the off-chip DRAM access, it might be significantly faster. Also, multithreading helps hide the latency of cache-to-cache misses as well, so even if it slower than we'd like, a modest amount of sharing shouldn't have that much of an impact on performance.
It can, though 4 threads isn't an embarrassment of riches when it comes to latency hiding, particularly since the threads on a core would tend to be running the exact same code and so would hit the same hot spots in the same time frame.
That was just a non-serious point about the use of the word "reasonable", which I feel can be a rather squishy and pliable term.
There was a paper from Intel at ISCA 2008 titled: "Atomic Vector Operations on Chip Multiprocessors". At least in this paper, the atomic operations they are talking about allow for atomic read-modify-writes on a per-lane basis (rather than saying all the loads or store of a single gather/scatter occur atomically). The key thing this lets you do is vectorize a loop with some sort of synchronization in it (such as a lock or lock-free data-structure). Basically, the proposal is a load-linked/store-conditional pair (like used for MIPS, Alpha, PowerPC), but in vector form.
If that is the sort of atomic vector operation you're talking about, I too would like to see them have it. Perhaps Larrabee II will add such support.
I think that's the one.