Intel resurrects the Pentium MMX for Larrabee

Gubbi · Jul 4, 2007

Mintmaster said:
Modifying a Pentium MMX doesn't seem like a bad idea to me. Discounting the effects of instruction set like SSE, Pentium III wasn't a whole lot faster per clock (maybe 20-30%?).

But per clock comparisons can be quite misleading.

The P-MMX (P55c) had a 6 stage pipeline vs PPRO/P2/P3's 10 stages.

More importantly it has a single cycle dual ported data cache. The P3 has a single ported data cache with 3 cycle load-to-use latency. D$ access is likely to be critical path defining. Load-to-use latency is very important for in-order CPUs because they can't schedule around data dependency hazards. That goes double for x86 where a significant amount of instructions have memory operands.

Looking back, even at the same frequency the gap was significant:

P55c @ 233MHz
SpecInt95: 7.12
SpecFP95: 5.21

P-2 @ 233MHz
SpecInt95: 9.47
SpecFP: 7.04

P-2 being 33% faster at SpecInt and 35% faster at SpecFP.

Of course at the same time there were 300MHz versions of the P2 available which scored 11.7 and 8.15 in SpecInt/FP. so in even terms (same availability date, same proces) the P2 was 64% and 56% faster than P-MMX.

Regarding Larrabee:

For Intel to reach 1 teraflop/s with a 4GHz device, they'll need 32 cores each doing 8 floating point ops per cycle. If we assume they will use the existing SSE/2/3/4 instructionset, that dictates a dual issue core. IMO, that's the only resemblence it will have with the original Pentiums.

Guesstimate; a Larrabee core is likely to:
1. Implement enough threads that simple round-robin scheduling of threads will cover most execution latencies.
2. Not spend any effort dealing with simultaneous partial register updates (serialise the updates).
3. Not implement any x87 gunk (trap and emulate).
4. Not implement any sophisticated branch prediction, instead use multiple threads to cover the latency of calculating the condition used in the branch (like 1.).
5. Be absolutely streamlined for vector op execution (superfast SSEx, slow muls, shifts, divides, etc on regular registers).

Cheers

Killer-Kris · Jul 4, 2007

Mintmaster said:
Pentium III wasn't a whole lot faster per clock (maybe 20-30%?)

A factor some might be missing is the difference in memory latency at the time of the Pentium MMX, the P3, and even now with C2D. The P5 core would likely see horrible performance at a couple of GHz and modern memory when compared to the P6 derivatives. Well that is without threading.

3dilettante · Jul 5, 2007

Gubbi said:
Regarding Larrabee:

For Intel to reach 1 teraflop/s with a 4GHz device, they'll need 32 cores each doing 8 floating point ops per cycle. If we assume they will use the existing SSE/2/3/4 instructionset, that dictates a dual issue core. IMO, that's the only resemblence it will have with the original Pentiums.

Some Intel slides peg Larrabee at release to be 2.5 GHz max.
The vector unit is also supposedly 512 bits wide, which means each extended SSE operation is capable of 16 single-precision ops.
If Larrabee is capable of dual-issue vector ops, and this seems like a likely possibility for it to do well in code that is heavily dependent on MADDs, then that throughput is doubled.
At single precision, Larrabee may be four times as wide as you say, and twice as wide at double precision.

Guesstimate; a Larrabee core is likely to:
1. Implement enough threads that simple round-robin scheduling of threads will cover most execution latencies.

Perhaps, though I wonder if a Niagara-type scheme that more proactively demotes threads that hit long-latency events would be useful.

3. Not implement any x87 gunk (trap and emulate).

Perhaps, perhaps not.
The Larrabee core is meant for a broader market than just graphics, and some Intel slides show it operating alone, without a host CPU. The expectations for Larrabee's ability to function as a pure CPU seem to be higher than trapping instructions.

The Intel slides also showed a throughput of 2 non-SSE floating point ops per cycle. I'm not sure a chip that relies on trapping x87 can maintain that kind of throughput.

The cost of such legacy hardware at this point is likely small on such a simple core.

Gubbi · Jul 8, 2007

3dilettante said:
Some Intel slides peg Larrabee at release to be 2.5 GHz max.
The vector unit is also supposedly 512 bits wide, which means each extended SSE operation is capable of 16 single-precision ops.
If Larrabee is capable of dual-issue vector ops, and this seems like a likely possibility for it to do well in code that is heavily dependent on MADDs, then that throughput is doubled.
At single precision, Larrabee may be four times as wide as you say, and twice as wide at double precision.

Yeah, just read the slides on it.

3dilettante said:
Perhaps, though I wonder if a Niagara-type scheme that more proactively demotes threads that hit long-latency events would be useful.

Sorry, could've been clearer, I was thinking round-robin scheduling of threads that haven't stalled. Long latency events are typicallu cache misses of some sort, by lowering priority of thread with many cache misses your long latency events becomes even longer latency events (ie. you want the memory ops to start as soon as possible).

3dilettante said:
The expectations for Larrabee's ability to function as a pure CPU seem to be higher than trapping instructions.

The Intel slides also showed a throughput of 2 non-SSE floating point ops per cycle. I'm not sure a chip that relies on trapping x87 can maintain that kind of throughput.

The cost of such legacy hardware at this point is likely small on such a simple core.

Hmm, it says two scalar ops per cycle, not specifically 2 x87 ops. There are scalar ops in SSE2. Two scalar SSE2 ops per cycle would jive with your guess above that it can issue mul and add instructions every cycle.

Implementing x87 isn't that simple. You'd need to support 80 bit FP in hardware. And being in-order you'd need some explicit renaming of some sort to get around the top-of-stack bottleneck of the braindead x87 stack architecture (reinstroducing FXCGs?

)

Cheers

3dilettante · Jul 9, 2007

Gubbi said:
Hmm, it says two scalar ops per cycle, not specifically 2 x87 ops. There are scalar ops in SSE2. Two scalar SSE2 ops per cycle would jive with your guess above that it can issue mul and add instructions every cycle.

The slide I see that compares Larrabee and Gesher specifically says non-SSE.

Implementing x87 isn't that simple. You'd need to support 80 bit FP in hardware. And being in-order you'd need some explicit renaming of some sort to get around the top-of-stack bottleneck of the braindead x87 stack architecture (reinstroducing FXCGs? )
Cheers

FXCH has been present all along.
The first Pentium could do it (not free, but present), and by extension, the MMX could as well.

I doubt x87 will be driven to high levels of performance, but trapping to microcode or worse software would significantly impact throughput.
Going to microcode blocks the decoders, which would seem to rule out a 2 FP (non-SSE) per clock throughput.

I also don't see how x87 should be difficult for Intel to implment. It's already been done for over a decade.

Gubbi · Jul 10, 2007

3dilettante said:
FXCH has been present all along.
The first Pentium could do it (not free, but present), and by extension, the MMX could as well.

I know, FXCHs are necessary on Pentiums ( being in-order) to achieve good performance because otherwise the FPU stalls on RAW hazards on the TOS register. On OOO CPUs FXCH is essentially a nop because the TOS is just renamed by the ordinary register renaming apparatus.

I don't even know how you would go about issuing two x87 stack arithmetic operations per cycle without register renaming to get around the TOS bottle neck. Issuing two arithmetic (mul+add) ops and two FXCH every cycle seems highly unlikely.

On top of that your multiplier array needs to support the 64bit mantissa of extended precision instead of the 53 bits of doubles.

I just don't see Intel spending that kind of effort to support legacy applications in what is very clearly a new architecture focused on throughput computing; legacy single threaded apps will suck on this thing regardles.

3dilettante said:
I doubt x87 will be driven to high levels of performance, but trapping to microcode or worse software would significantly impact throughput.
Going to microcode blocks the decoders, which would seem to rule out a 2 FP (non-SSE) per clock throughput.

I see two modes of operation. One where you just don't do extended math and do math with 64 bit precision (remember, extended precision registers are stored as 64bit doubles). And another where you trap to a handler and do the whole nine yards on precision/quirks of x87.

Cheers

Jawed · Jul 10, 2007

Gubbi said:
I don't even know how you would go about issuing two x87 stack arithmetic operations per cycle without register renaming to get around the TOS bottle neck. Issuing two arithmetic (mul+add) ops and two FXCH every cycle seems highly unlikely.

Speaking right out of turn here, but why not gang two Larrabee cores together? Literally dupe the instruction streams and provide some behind the scenes tagging/juggling so that the two cores cooperate.

Jawed

Killer-Kris · Jul 10, 2007

Jawed said:
Speaking right out of turn here, but why not gang two Larrabee cores together? Literally dupe the instruction streams and provide some behind the scenes tagging/juggling so that the two cores cooperate.

Jawed

Something tells me that it would take way more effort to get two cores to work together than to just add Tomasulo style OoO with only a few reservation stations.

3dilettante · Jul 10, 2007

Gubbi said:
I know, FXCHs are necessary on Pentiums ( being in-order) to achieve good performance because otherwise the FPU stalls on RAW hazards on the TOS register. On OOO CPUs FXCH is essentially a nop because the TOS is just renamed by the ordinary register renaming apparatus.

Only in some OOO x86s was FXCH "free".
The P4, for example, had a cycle penalty associated with FXCH.

I don't even know how you would go about issuing two x87 stack arithmetic operations per cycle without register renaming to get around the TOS bottle neck. Issuing two arithmetic (mul+add) ops and two FXCH every cycle seems highly unlikely.

In-order does not rule out some register renaming. Some Intel clone processors used register renaming in some limited circumstances.

Other in-order chips can use forms of register renaming.

I'm still going by an Intel slide that says 2 DP non-SSE ops per cycle.
It seems that Intel currently intends to try something along those lines.
I find the possibility of a 3rd FP instruction set in x86 less likely.

I just don't see Intel spending that kind of effort to support legacy applications in what is very clearly a new architecture focused on throughput computing; legacy single threaded apps will suck on this thing regardles.

That would hurt the draw of x86 compatibility, which Intel has trumpeted as an advantage.

I see two modes of operation. One where you just don't do extended math and do math with 64 bit precision (remember, extended precision registers are stored as 64bit doubles). And another where you trap to a handler and do the whole nine yards on precision/quirks of x87.

Cheers

That would mean a third FP mode in x86. I would consider that to be unlikely.

Killer-Kris · Jul 12, 2007

3dilettante said:
In-order does not rule out some register renaming. Some Intel clone processors used register renaming in some limited circumstances.

Other in-order chips can use forms of register renaming.

I'd like to hear more about non-OoO register renaming because in the context I was taught renaming and OoO go pretty much hand in hand.

But all I can come up with off the top of my head is a checkpoint based shadow register system or any number of variations on it. Whether that involves hardware loop unrolling, or just recovering quickly from a branch mispredict. Another in-order renaming technique I find could be useful would be if you could rename on pushing and popping register values temporarily "to memory".

Though I'd like to learn about other methods/reasons for doing renaming on an in-order processor, along with what processors did that.

3dilettante · Jul 12, 2007

The Cyrix M1 used register renaming. It allowed out of order completion, but was in-order issue.

The PowerPC 603 and other early POWER chips also had limited forms of renaming.
These seemed to restrict renaming to a few instructions and pointer registers.

The PowerPC604 was the design that was OoO.

A software controlled example would be Itanium, which compresses unrolled loops by using instructions that control register renaming.

The registers past the first 32 are accessed through an explicit form of renaming.

Intel resurrects the Pentium MMX for Larrabee

Similar threads