Larrabee: 16 Cores, 2GHz, 150W, and more...

The 68000 instruction set is much closer to x86 than to ARM: 68000 is cleaner than x86 in that it doesn't have such strong limitations wrt register usage, but that part can be avoided with register renaming, which is not all that hard.

Hehe great ! I managed to stir up a bit of 68k vs x86 debate in a 2007 thread on a GPGPU :)

Yeah I know the mythical '68090' or whatever that powers Macs in the "apple skipped ppc" parallel universe wouldn't be much less convoluted than x86, prettier assembler doesn't make it a risc machine.

One thought - does the many core/many thread vs small GPGPU program approach allow the 'cisc-to-risc JIT hardware' to be re-used ..caching the uop generation.. (is that sounding too 'crusoe')

I.E, replicate the work of decoding one peice of program code across many cores/threads - decreasing it's impact on the system (and decreasing my fear of the satanic isa..)

i suppose it will all be about their extentions, and they may not need anything so exotic.
 
Hehe great ! I managed to stir up a bit of 68k vs x86 debate in a 2007 thread on a GPGPU :)

Yeah I know the mythical '68090' or whatever that powers Macs in the "apple skipped ppc" parallel universe wouldn't be much less convoluted than x86, prettier assembler doesn't make it a risc machine.

One thought - does the many core/many thread vs small GPGPU program approach allow the 'cisc-to-risc JIT hardware' to be re-used ..caching the uop generation.. (is that sounding too 'crusoe')

I.E, replicate the work of decoding one peice of program code across many cores/threads - decreasing it's impact on the system (and decreasing my fear of the satanic isa..)

i suppose it will all be about their extentions, and they may not need anything so exotic.

There's no indication that Larrabee's cores have front ends that run off anything but raw x86, and they've been described as fully independent.

It also looks like the instruction caches are private, so there's no disclosed mechanism for a JIT would be able to broadcast decoded instructions.

Uops are also very large; they do more than just indicate operation and operands.
They are not encoded very densely, and they contain execution state and exception status, something that would be bloated and useless if shared across cores.
 
Uops are also very large; they do more than just indicate operation and operands.
They are not encoded very densely, and they contain execution state and exception status, something that would be bloated and useless if shared across cores.
That depends on how fine-grained they would want to go. VLIW in it's most purest form.

But I agree that it's unlikely they'll got that far.
 
Intel might not be that adventurous, though.

Isn’t using x86 for a GPU not already adventurous enough? Tim Sweeney will properly love it.

An interesting side effect could be that this is the first GPU that would be upgraded to a new DX version with only a driver.
 
All the current x86 CPUs aren't really using the x86 ISA internally. They use a RISC CPU with an x86 JIT compiler on top.

No they absolutely do not. Most instructions are mapped one-to-one from the x86 ISA to the micro-architecture.

Core and Core 2 CPUs crack instructions that has a memory operand into more uops. That's to try to advance the load to save execution latency. The penalty is that the x86 instruction takes two (or three for a mem,reg op ?) in the scheduler

Athlons don't even do that. They map x86 instructions with a memory operand into a macro-op, which only takes up one slot in the scheduler. These macro-ops are executed as a whole in the integer execution units. The advantage of this approach is better scheduler density (not trying to extract parallism from the inherently sequential nature of the sub-operations of a macro-op). The downside is that the load can't be started until the register-operand of the macro-op is generated, which means loads can't be advanced the way they are in Core/Core2 CPUs.

Only the really CISCY instructions are executed from microcode ROM/statemachine and that's hardly a JIT compiler.

The 68000 and ARM are/were neat because they don't need all that to do well.

Huh? The 68K ISA was revolutionary because it presented a 32bit CPU (and a 24bit address space) to the programmer which was heaps easier to work with than the overlapping segments of 8086.

However, Motorola completely b0rked the ISA with the 68020 because every instruction could be extended with advanced memory operand modes, - and these extensions to an instruction could be extended, which resulted in a serial decoding of each instruction making a superscalar implementation almost unfeasable.

It pains me to say this (being an old 68K assembly Amiga coder), but the 68K was late, power hungry (more transistors, bigger die) and, after 1982, lower performing than all its x86 implementation counterparts.

ARM is made for a completely different market segment but carries its own baggage that makes fast implementations tricky.

Cheers
 
Last edited by a moderator:
However, Motorola completely b0rked the ISA with the 68020 because every instruction could be extended with advanced memory operand modes, - and these extensions to an instruction could be extended, which resulted in a serial decoding of each instruction making a superscalar implementation almost unfeasable.

It pains me to say this (being an old 68K assembly Amiga coder), but the 68K was late, power hungry (more transistors, bigger die) and, after 1982, lower performing than all its x86 implementation counterparts.
Agreed, but with those crazy mem access modes you could fill a triangle and texture map it (point filtering) with a single asm instruction per pixel, it was so cooooool ;)

Marco
 
Isn’t using x86 for a GPU not already adventurous enough? Tim Sweeney will properly love it.
Adventurous from the point of view of fiddling with the x86 ISA.
These extensions may be a straighforward widening of the SSE registers and some additional mask and compare instructions, or they could try to do something more drastic.

An interesting side effect could be that this is the first GPU that would be upgraded to a new DX version with only a driver.
Not any more than how the shader cores in G80 and R600 could be.

The fixed-function block that Intel has left pretty much blank would not upgrade to match the DX version.
 
However, Motorola completely b0rked the ISA with the 68020
Heh I guess there must have been many good reasons why they were so keen to move onto ppc.
rose tinted spectacles. I think I'm just permanently psychologically scarred by the moment when I realised that to continue graphics coding i would have to lose 8 registers :)

I suppose it's also ironic that i've just named a processor that pretty much resurrects the concept of "Memory Segmentation" as my "favourite peice of silicon :)" I was commenting to one of my colleagues a while back that we needed near & far keywords in the compiler.
 
It supposedly goes as far as implementing specialized control and branch instructions.
Without sophisticated branch prediction you pretty much need those. Maybe finally after decades we will be able to use loops on x86 without knowing for a fact the predictor will get every branch wrong at least once ... goodbye loop unrolling?
 
Unrolling is a little more involved than just register renaming.

I don't see how the concepts are related at all, loop unrolling is a software technique, OOO is a hardware technique. That said OOO can be used to implement loop unrolling and that requires rename registers but it's not the same thing.

Loop unrolling is a technique which has a number of benefits. It reduces the number of instructions used, reduces the number of branches and makes better use of the CPU's pipeline. It can boost performance by a hefty amount - the first time I ever used the technique it boosted performance by 5X. You can also use the technique on things like memory moves and the performance boost can also be big.
 
Loop unrolling is a technique which has a number of benefits. It reduces the number of instructions used, reduces the number of branches and makes better use of the CPU's pipeline. It can boost performance by a hefty amount - the first time I ever used the technique it boosted performance by 5X. You can also use the technique on things like memory moves and the performance boost can also be big.

Loop unrolling does not reduce the number of instructions used.

The increase in code size is one of the big drawbacks to loop unrolling, since it takes up instruction cache space.
 
The increase in code size is one of the big drawbacks to loop unrolling, since it takes up instruction cache space.
And memory bandwidth.

Then again, it doesn't mispredict. And, as most loops are simple counters, it would be best and simplest to use a branch prediction that specifies the amount.
 
Loop unrolling does not reduce the number of instructions used.

I think ADEX meant the number of instructions scheduled is reduced, - because of the reduced loop overhead.

The increase in code size is one of the big drawbacks to loop unrolling, since it takes up instruction cache space.

Not only does the the unrolled loop take up n times more instructions, but you usually have a preamble to preload the first data needed, and an epilogue to finalize the loop (storing results without prefetching more data). The bloat can be substantial.

One of the few redeeming features of IPF is that you can normally collapse the preamble, unrolled loop and epilogue into a compact loop body by using predicated instructions and the rotating integer register stack.

Loop unrolling really is a clumsy way of avoiding data-dependency stalls.

Cheers
 
Without sophisticated branch prediction you pretty much need those. Maybe finally after decades we will be able to use loops on x86 without knowing for a fact the predictor will get every branch wrong at least once ... goodbye loop unrolling?

Core2 and Pentium M already have forms of loop detection.
Conroe goes as far as having a loop cache that is some number of instructions in size (64 I think).

It was rumored at one point Barcelona would have something similar, but it's not in the documentation.
 
If it were limited to one core, it seems possible to run, though I don't think it would be an option anyone would like.

The top clock speed in the presentation was 2.5 GHz.

There are a few unknowns.

The threading used wasn't listed. If SMT, then it's possible for one thread to use the core most of the time.
If fine-grained, then a 4-threaded core at 2.5 GHz is going to look like it's 625 MHz to a single thread.
OSs are more threaded than most desktop applications, but it would have an impact.

The integer width was not disclosed. It was a ??? in the slide.
If the minimum FP width is 2 ops, then it may be that the minimum integer width is two ops.
It may look like a Pentium (pre-Pro).

The next question is how much branch prediction there would be.
That wasn't disclosed, and it is unlikely it would be anywhere near the huge predictors for P4 and Core2.
Having a branch predictor as limited as the pre-Pro Pentium might be possible, and Intel's more generalized vision for Larrabee may mean it will have some prediction, even though such speculation is usually a waste for a GPU.

Best case, we see it perform like a 2.5 GHz Pentium MMX + SSE.
Not so good, maybe something like half that.
Worse, we see it perform like a ~800 Pentium MMX with SSE.
Worst (unlikely?) case, we might see a single core peform like a ~800 MHz Pentium MMX with no branch prediction, which would cut performance significantly.

I don't think it would be acceptable for most users, but it could chug along fine with no other programs running.

The spooky part about this is that a number of plausible high-level design decisions would make it look a lot like a Pentium with MMX and SSE multiplied 16 to 32 times.

edit: And there's even a thread in the Hardware forum about that, which would have been nice had I seen it before all this writing.
 
Back
Top