It's obvious Larrabee won't be a full fledged Core2, so let's speculate on what it will and won't have.
We already know, I think, that it will be in-order.
Branch prediction: Larrabee's in many-core territory, so speculative execution of any sort is going to raise eyebrows.
It seems like a small sacrifice in this case to dump or severely restrict it.
Separate predictors impose a pretty hefty storage penalty, though basic prediction can simply be a 2-bit saturating counter stored alongside a cache line.
It would seem workable, but might be pointless for target workloads that are so rich in non-speculative work.
Superscalar execution: Larrabee's vector processor is 512 bits wide, which is too narrow to support the DP flop count per cycle, unless there are at least two fp computation pipes.
That might mean each core is at least 2-wide superscalar.
Hardware data prefetching: Sounds iffy with many-core. I'd expect it to be software-only (with x86 compatibility, it would have to support what's already there), perhaps augmented by the rumored new cache and data control instructions that should be coming along.
Register renaming: It can be done with in-order cores, but its utility is much less than if they were OoO and speculative. x86 has reg/mem ops, which reduces some of the pressure on the register file, though at the expense of beating up on the L1.
The rather large amount of L1 that is private to each core seems to indicate it will be leaning pretty heavily on the cache.
Cache:
If the core is superscalar, it would most likely need a dual-ported data cache.
The data paths would be pretty huge, too.
512 bits is the width of one cache line, so the vector engine would be pulling in an entire cache line to fill a register or memory operand.
The L2 is apparently private-write to each core, which is good since the L1 was said to be write-through.
I'm curious to see where a stop on the ring bus would fit into this. It's supposed to interface the core with its neighbors, which means it would be charged with coherency, memory, and remote read traffic.
That sounds like it would sit next to the load/store hardware and might subsume the cache controller.
Threading: 4-way is known. Has it been said if it was SMT or something else?
I'd hazard it would be some variation of fine-grained and switch on event, like Niagara or GPU threading.
SMT might not be worth it.
The big unknown is the vector instruction set.
It supposedly goes as far as implementing specialized control and branch instructions.
That means Larrabee for graphics might be able to run mostly on the extension set, with almost no conventional x86 instructions getting in the way.
In theory, that could mean significant changes to the ISA, since it would be a lot more self-contained than current extensions.
In graphics vector mode, it might dispose of a lot of the cruft found the rest of the ISA, which means it could do some extra things with the empty op code space.
(More registers, other stuff?)
Intel might not be that adventurous, though.