In absolute transistor counts, sure, but not relative to the transistor budget. A Core i7 measures 263 mm² and has four cores, while the P5 measures 294 mm² and has one core. And I'm pretty sure that i7's branch predictors combined don't take up as much area as a single P5 branch predictor.
It might be close.
Going by rough math (730M/263 for i7 and 3.1/294 for P5), it looks like the i7's density is over 260 times better than that of the P5.
The P5 predictor held 256 entries, and that was the extent of its dynamic branch capabilities.
I've not seen solid numbers on on i7, but RWT postulated 256-512 entries for the first level predictor and anywhere from 2-8K for the second level.
This does not include the possible indirect target buffers, the loop detectors, and a return stack buffer that is also loaded with renaming logic.
For just the second level BTB, the upper bound of that guess with 8K divided by the scale factor ~263 gives us ~31. Which means that an 8K table shrunk from the P5's process would be 31 times smaller.
8K is 32 times larger than 256, the size of the P5's predictor.
The lower end of the estimate is 2K, which is 1/4 the upper bound.
Luckily, we have 4 cores. It's possibly close to even, or possiby four times higher going by the large 2nd level predictor alone.
It doesn't seem likely that Nehalem would lag K8, which already has 2K entries.
It's the same with x86 decoders. They haven't really gotten all that much bigger. And while Core i7 has 16 of them, P5 had only 2 (and Larrabee should have 64).
I don't know of any numbers that would hint at the overall growth like there are for things like BTBs.
I do know that not everything involved in superscalar decoding scales linearly, especially not for variable-length ISAs.
K8 as an older example, has a pre-decode stage needed to resolve instruction lengths in its 16-byte instruction fetch, which involves 16 parallel predecoders for a 3-issue architecture.
In addition, every byte in the instruction cache is accompanied by an additional 3 bits for stored pre-decode information, a near 50% increase in the number of bits per cache line.
I'm not sure P5 needed to bother with this.
i7 also has the additional burden of generating the micro ops that result from the decoders and all the additional bitwork needed for that.
This didn't happen until the PPro, so both P5 and Larrabee are probably slimmer as a result.
I'm not sure. On the CPU side, Hyper-Threading is helping keeping the pipelines full after a branch. And on the GPU side, not speculating anything means they have to keep all intermediate results in registers and make sure they have plenty of other work to do.
SMT isn't the same as speculation, it's going past branches, possible exceptions, and unresolved memory addresses where the chip must reverse or discard calculations that is speculation.
Can you clarify what the exact difference is?
Any register value that is stored in a program is going to be stored by either a non-speculative or speculative processor.
Do you mean that a speculative processor tends to spread this out in time, while a non-speculative one does it all at once?
For developers it has become increasingly more difficult to avoid stalls when latencies are so high. So even when theoretical efficiency should be excellent, in practice it can decimate due to dependencies. Speculation is not as bad as it sounds, when you have a 99% hit rate and it makes your register file smaller and avoids stalls...
As far as speculation goes, for i7, there is a 16-stage pipeline for a 4-wide processor. That means up to 64 instructions can be decoded and issued in a sustained fashion (peak is higher) before a branch is resolved.
The quoted error rate for OoO chips is about 30%. That's 30% of full execution wasted, and 30% of wasted execution that the power-saving logic is fooled into running at peak.
Larrabee is far milder. It might have at most 10 instructions in flight, and the error rate is probably lower because it stalls much more often.
Larrabee also keeps branch prediction in the realm of the x86 side. There are no vector branch predictors, for example.
Wide non-speculative processors have their own problems, particularly with wide vectors. There, the execution is over-determined, but it is usually much clearer to the silicon where it can save power.