I totally agree.
Let me clarify my previous post. What I intended to say is that there are new instructions *and* the special purpose hardware ALUs to support those instructions. However, unlike current GPUs, there is no *other* special hardware. No other fragment pipeline or special z-buffering frame buffer (or whatever else GPUs have today). Just many x86 cores with extra vector ALUs for executing the new instructions tailored for graphics processing.
The hardware internals for GPUs are often more flexible than is exposed by graphics APIs and graphics drivers, hence CUDA and CTM.
While there are a number of design wrinkles and hardware designed to help emulate the state machine of the graphics pipeline, a significant amount of the hardware is pretty much agnostic.
It's sort of like how Transmeta's VLIW processors had a fair amount of design decisions that made little sense for VLIW, but were added to allow it to perform the job of emulating an x86 state machine.
The setup engine, a few caches optimized for 2-dimensional accesses, are examples that come to mind that may or may not translate well to other workloads.
Without a die plot and better numbers, the penalty of the specialized hardware is difficult to quantify.
For GPGPU, this might be a problem.
For consumer graphics for quite some time into the future, the lack of specialized hardware for emulating the consumer graphics pipeline may be a detriment to Larrabee.
The key difference is the programming model. For Larrabee, a program can just use inline assembly (or library calls) to insert these vector operations into a regular program. There is no special setup or other low-level implementation-specific poking of the hardware to get the special purpose hardware going. Just as SSE isn't conceptually difficult to add to a program (assuming it has the right sort of data parallelism) these vectors will be similarly easy to use.
Larrabee's flexibility would be mitigated in the consumer graphics scene by the fact that it would be hiding behind an API and driver like the GPUs.
In systems where it is allowed to function as a primary processor, it would have an advantage.
GPUs do suffer from the fact that they do not have similar capability, though consumer graphics has adapted to this model well enough.
Another key point is that Larrabee has coherent caching (just like Intel's other multi-core systems). Unlike a GPU that requires explicit commands to move data around the system and/or flush caches at the right time, all that is done seamlessly in Larrabee. Instead of burdening the programmer in worrying about all these issues, Larrabee really is just a shared-memory multiprocessor on a chip.
GPUs, without SLI and Xfire usually operate with internally partly shared caches. G80 has exposed an explicit parallel data cache, which complicates matters.
R600 has a number of shared caches.
Slides on R7xx seem to indicate a more transparent sharing of separate memory controllers. It seems likely that by 2010 a fair amount will have changed in this area.
R600 was already equipped with TLBs and possesses an internally distributed memory client model.
Memory coherency and synchronization tends to be weakly defined with GPUs between clusters, but signs point towards an evolution towards a model that will be closer to x86 (though likely still distinct).
I don't understand why under SMT one thread would block the other threads. The whole point of threading is to allow the other threads to continue.
I was reinforcing the point that Larrabee's units are likely fully pipelined for most instructions.
If an execution unit is not fully pipelined for a given operation, it cannot start on the next operation until the first instruction has cleared whatever interlock is in place.
If a unit takes 5 cycles for an operation, but is not fully pipelined, then for some number of cycles, it cannot allow any instruction issue at all, irrespective of data dependency. As a hardware hazard, it also spans between threads, whereas data dependencies cannot.
If it is fully pipelined, an operation can begin stage 1 as an earlier operation enters stage 2.
For common instructions, a lack of full pipelining is highly undesirable.
For common instructions compounded by having multiple threads, contention is worse. This is quite possibly worse in consumer graphics, where a lot of threads can be assumed to be working with similar instruction mixes at a given instant.
For SMT, it can result in stalls, especially at 4-way threading.
For FMT, it is an extra scheduling headache because the threads are supposed to cycle regularly.
Although most systems have a hard time reaching peak performance, having 4 threads per processor to suck up ALU bandwidth will help Larrabee get much closer to peak performance than systems without threads (such Intel's current multi-core chips).
Of course, the big down side is that now the programs need to generate 128 threads, which isn't a trivial task.
For graphics work, it isn't too hard to generate threads.
The "thread" counts are in the many hundreds to maybe a thousand for GPUs right now.
A full x86 thread's context is definitely heavier than a single primitive running through a GPU, however.
For consumer graphics, I am unsure heavy threading is the way to go in the long run, and it does run up against a fair amount of inertia in the near and medium term.