From what I've heard about Larrabee, the vector extensions are wildly different (which is one of the reasons that the rest of Intel doesn't like it).
Sounds like the ones most angry are the main x86 designers who might have wanted to integrate Larrabee alongside the big cores without a heterogenous multicore solution. Both Intel and AMD initially thought specialized and software incompatible cores were the next step, but both have backed off on that for the time being.
Larrabee doesn't support MMX or any of the SSE instructions. They actually went back to the microcode from the Pentium. They did extended it to 64-bit x86, but without SSE.
Sounds like somebody needed the opcode space and didn't want to follow AMD's Another Damned Prefix Byte approach.
Most of my speculation works so long as Larrabee's new instruction set has not done the following:
1) Gone totally Load/Store
2) Gone with non-destructive operands
3) Junked x86 memory and addressing semantics
4) Added 2-4 bytes to every instruction (which kind of goes against the point of avoiding Another Damned Prefix Byte) to the detriment of its instruction cache.
1,2, and 3 would go a long way in invalidating a lot of the work in established x86 vectorizing compilers Intel was supposedly leveraging for Larrabee to entice software developers.
Such a departure isn't unprecedented. The IBM Cell SPEs use a different sort of SIMD instructions (and different number of registers) than the normal PowerPC Altivec stuff.
The particulars vary, but the overall tenor is not massively different, save one interesting exception.
It's not like the SPE decided to go CISC with destructive operands and reg-memory operations.
The big difference is that the SPEs and their local store and DMA to main memory required a reworking of a lot of the memory addressing behavior and a removal of all software permissions checking.
Larrabee, if it supports any x86 at all, cannot do that (unless there's some wacky scheme where there are two decoders and two control units for an x86 mode than can handle virtual memory and permissions and a vector mode that does something entirely different.)
They basically re-designed the vector instructions from the ground up to be graphics specific. That is how they plan to get away with not having any other specialize graphics hardware on the chip. Just these special vector ALUs. Seems like a big gamble, but I am convinced by the pitch, frankly.
If that means they broke x86 semantics and are moving away from 2 decades of handling x86's baggage, I'd be happy from a philisophical standpoint.
That doesn't jive with the idea that Intel wants to leverage the weight of x86 to make headway in new markets, though.
Actually, with a very GPU-like extension, it might even open up an avenue for GPUs to partly mitigate Larrabee's x86 advantage, especially since Larrabee doesn't support x86 instructions still covered by copyright. Perhaps Intel Legal and Intel Markeing are also a little cheesed at Larrabee's team as well because of this.
In many ways, perhaps Larrabee is Cell "done right".
No, 1-2 Gesher cores and 8-16 fully compatible Larrabee cores on the ring bus would be Cell "done right". Larrabee's designers apparently went out of their way to screw that.
Transistor count isn't the most relevant issue anymore. The two most important issues are (1) power and (2) die area. Granted, these are related to transistor count, but not always one-for-one.
Any differential in transistor count per core is going to have an effect on die area and power consumption. It isn't one-for-one, but it almost never is a negative relationship, with one exception being that sleep transistors can cut power consumption for idle units.
edit: To complete this point: Any differential is scaled by a factor of 24. Unless you really expect there to be a significant benefit to any additional transistors, sometimes leaving it off is a better option.
Intel's 45nm process has very small SRAMs and it has very low power SRAM transistors (by using special low-leak transistors). You can get lots of L2 cache on a chip without burning much power or taking up that much die area. Once you're basically power limited by your ALUs, why not throw some extra cache on the chip if you have enough die area?
The question becomes why you're power-limited at a certain level by the ALUs.
There's a difference between working hard and working smart.
If a certain amount of specialized hardware on a common workload task with the footprint of 1 general ALU and supporting network can do the work of 10 ALUs at the same or less power consumption, that's enough to either cut 10 ALUs or add 9 more.
edit: to continue
A specialized unit might have a massive affect in the target workload.
A few million extra L2 transistors in each core that wind up yielding a few percent in a few workloads (thank you diminishing returns), not so much.
Conventional wisdom is that caches don't work for graphics computations. Perhaps Intel has found more locality in graphics applications (in the multi-MB range of caching) than previously thought.
That's an interesting question, and I'm sure research is active in this area.
The conventional wisdom has worked well thus far.
Some low-hanging fruit with large caches is perhaps loading up whole tables and textures, but in many complex scenarios the amount needed for a given pixel can go up and down wildly.
I was going to make this same point. I think the point is even stronger considering the caches in Larrabee are all kept coherent by the hardware. No need to perform explicit extra data movement or flushing of caches (or whatever). In that sort of system, any part of the address space can become a flexible buffer for storing intermediate results on chip.
I hope that can be turned off at will, given how much of the workload doesn't need full coherency. Those nifty scatter and gather operations would generate in really bad cases 24 time their number of accesses in coherency updates, to the exclusion of actual data being passed around.
If you know you are working intermediate results, you shouldn't have to care about updating 24 separate caches and 24 separate TLBs.
On the flip side, Intel would likely have to implement some way of maintaining processor affinity. Passing any amount of thread context is enough to saturate a good fraction of the bus for a good amount of time.
I think this is also a key point. The more specialized units you have, the more likely one of them will become the bottleneck while others go idle. You've statically allocated the resources. With a multi-core CPU, you can dynamically load balance to apply computation to just where you need it.
That is one thing the generalized hardware has for a bottleneck: that shared ring bus. It's likely overspecced at 256B/cycle for this reason.
Try load balancing by passing an 8KiB vector thread context between cores more than a couple times.
At 24 cores, even if each core has a low probability of requiring this, the aggregate probability is higher.
God forbid either the OS or the driver software does the willy-nilly thread thrashing current x86 multicore does.