How big was P4 trace cache?

msxyz

Newcomer
Just an historical curiosity... how big (in bits or bytes) was the Pentium 4 cache that contained the decoded x86 instructions? Official documents state it's 12K of microops, with 8 way associativity.

As for the microops size, their format and how many ops per x86 instruction were generated, I don't think I've ever seen a definitive answer. Some sources say each opcode was 100 bits in length. That would mean the equivalent of 150 Kilobytes for 12K. Looking at this picture of a 180nm P4, however, the trace cache (left side) seems smaller than each half of the 256KB L2 cache.

www.tayloredge.com/museum/processor/2000_Pentium4.jpg

It's also not clear to me why the cache contains a number of instructions which is not a "nice" multiple of 2. I would understand it if the cache had 3 or 6 way associativity. Was part of the cache deactivated for sake of improving the yields?
 
The trace cache had to serve a number of purposes, with a likely constraint being the number uops Intel found optimal or found it was capable of delivering in a cycle, which per Agner's document turned out to be three in a cycle. The trace cache served as the high-throughput instruction fetch method, and it would physically be adapted to match what Intel needed along those lines.
The number of lines was a power of two, but the number of entries per line was not. The number of bits per entry was initially not a power of two, which might have been needed to keep the cache physically small enough with the process at the time.

Longer trace lines would suffer more from poorer utilization when traces failed to properly fit, and the cache needed to be fast enough to service instruction fetch for a high-clock processor. The goal was likely to create a fetch path that was fast, had decent storage utilization, had good fetch bandwidth with the software that would be running on it, and had as much capacity as they could deliver. It seems that what Intel found it could do wasn't a power of 2. Cutting off extra capacity just to make the numbers round out would be detrimental, whereas upping the capacity to make it the next power of two might have been costlier than the expected gain.
 
Maybe they used trits. 3^5 = 243 ~= 2^8 = 256 It's common in compression, used for more discreed coders than arithmetic coding.
 
Per Agner, there are 2K lines in the trace cache. So far, it's power of 2.
There are 6 entries per line, which is where we lose the nice math.
 
Thanks for the links. They answered many questions I had and more. I'm still amazed by the complexity of the design of the P4. In hindsight, if Intel wanted to go for higher frequencies above everything else, why not using a simpler design like the PPE? You'll loose some efficiency with older code optimized for other architectures but the cpu would be smaller and run cooler. Are there any factors I'm neglecting?
 
The PPE was not a particularly good processor, other than it was the sort of core that could be designed with a truncated timeline and a reduced budget, in order to fit a general-purpose core with Sony and Toshiba's eventually unsuccessful SPE philosophy.
 
Does anybody have a copy of Intel presentations held at the IDF 2000/2001 about the P4 architecture? I see they're mentioned more than once but all the links I've found (pointing to an Intel FTP) are now dead. :/
 
Back
Top