Modifying a Pentium MMX doesn't seem like a bad idea to me. Discounting the effects of instruction set like SSE, Pentium III wasn't a whole lot faster per clock (maybe 20-30%?).
But per clock comparisons can be quite misleading.
The P-MMX (P55c) had a 6 stage pipeline vs PPRO/P2/P3's 10 stages.
More importantly it has a single cycle dual ported data cache. The P3 has a single ported data cache with 3 cycle load-to-use latency. D$ access is likely to be critical path defining. Load-to-use latency is very important for in-order CPUs because they can't schedule around data dependency hazards. That goes double for x86 where a significant amount of instructions have memory operands.
Looking back, even at the same frequency the gap was significant:
P55c @ 233MHz
SpecInt95: 7.12
SpecFP95: 5.21
P-2 @ 233MHz
SpecInt95: 9.47
SpecFP: 7.04
P-2 being 33% faster at SpecInt and 35% faster at SpecFP.
Of course at the same time there were 300MHz versions of the P2 available which scored 11.7 and 8.15 in SpecInt/FP. so in even terms (same availability date, same proces) the P2 was 64% and 56% faster than P-MMX.
Regarding Larrabee:
For Intel to reach 1 teraflop/s with a 4GHz device, they'll need 32 cores each doing 8 floating point ops per cycle. If we assume they will use the existing SSE/2/3/4 instructionset, that dictates a dual issue core. IMO, that's the only resemblence it will have with the original Pentiums.
Guesstimate; a Larrabee core is likely to:
1. Implement enough threads that simple round-robin scheduling of threads will cover most execution latencies.
2. Not spend any effort dealing with simultaneous partial register updates (serialise the updates).
3. Not implement any x87 gunk (trap and emulate).
4. Not implement any sophisticated branch prediction, instead use multiple threads to cover the latency of calculating the condition used in the branch (like 1.).
5. Be absolutely streamlined for vector op execution (superfast SSEx, slow muls, shifts, divides, etc on regular registers).
Cheers