What "architectural baggage" would you remove and how many transistors would that save exactly (in percentage of the entire die size)?
How about a fixed-length load/store instruction set with a few addressing modes?
I've gone over guesstimates before.
The original Pentium had merely 3.1 million transistors, and only a fraction of that can be considered legacy overhead.
The P5 core weighed in at 3.1 million transistors, a contemporaneous Alpha weighed in at 1.68 million. We don't even need legacy overhead to see a significant dent.
Larrabee's cores will be much faster than the original Pentium, close to an Atom perhaps (which runs lots of modern regular applications).
For loads that can be put through vector resources, Larrabee cores should be faster.
In x86, Atom has an advantage in cache, clock, and issue width compared to Larrabee's significantly more restricted issue width.
This is where the tiny bit of x86 baggage starts paying off.
We've gone over this canard as well. The miniscule effort saved in using x86 over any other established ISA is dwarfed by the fact you need to learn how to massively multithread, properly use the shared cache, and use the new vector ISA.
Their live prototype demonstration indicates none of that. My guess is still software complications.
The live demonstration showed a non-overclocked SGEMM running at ~800 GFLOPS.
Other x86 chips running SGEMM have hit +90% of theoretical peak running this.
Larrabee's target DP was at least 1 TFLOP, which also meant 2 or more for SP, with a clock range between 1.5 and 2.5 GHz.
When overclocked, the >600mm2 chip barely outperformed the best known score for a stock RV770 (in this forum).
At the very least, the chip that was demoed was not ready for prime-time.