Larrabee: 16 Cores, 2GHz, 150W, and more...

Is there even a need to point out that 49.5mmx49.5mm die would be the most insane thing ever?

As said elsewhere that's probably the package, though that itself implies a very large die.

However compared to some chips, 49x49mm is *small*. There are chips which are so big they can only make one per 200mm wafer. They're used in things like medical imaging or big telescopes - one has a 120 megapixel image sensor!
 
It doesn't seem a stretch to assume that a 16-core variant would be a chip with another row of cores disabled.

I'm curious about the arrangment shown: where one row of cores is right next to the cache and the other row is more distant.
The other slides didn't have that detail.
 
Larabee...blech...don't say that word... The people running that project in German are a bunch of lieing dipshits... at least the ones I was working with....
 
Larrabee will be used for gamer-cards:
vcgrb6.png

source
 
So I gather you're ad-blocking on our site then? :devilish: As those ads have been here for MONTHS.
 
Now I know who are this "selected web-sites" INQ is talking about. X-D
I don't think ours are operated via Google Ads, I could be wrong though since I'm not seeing them right now due to region restrictions...
 
Back to the question of ISA, registers, in-orderness

looking back through microprocessor history isn't it the case that pre OOOE the load/store machines had a bigger advantage through their large reg file isa's.

though, I suppose the SIMD enhancements and hyperthreading may shrink this factor...

I suppose i've got a deep rooted "religious hatred" of the x86 :) (yes, i grew up on 68k assembler) Maybe it's unfounded now, what with x86 having 16 gp registers these days.

The cell is my favourite peice of silicon these days -with explicit cache control and unified simd/gp regs being it's selling points; But I can just see the wheels in motion. it's going to be destroyed by an x86 derivative isn't it...

(heh for all I know, maybe intel will overlap the simd / gp registers in larabee, and add more. Then i'll have no reason to complain, will I :) )
 
Back to the question of ISA, registers, in-orderness
looking back through microprocessor history isn't it the case that pre OOOE the load/store machines had a bigger advantage through their large reg file isa's.

Yes, those extra registers would be used for unrolling your code to avoid false dependencies, which are completely removed by the register renaming apparatus of an OOO CPU.

Think of unrolling as an explicit (and clumsy) form of register renaming.

No real reason to hate x86 more than 68K. Yeah, from an assembly programmer's POV the 68K was really neat, but from a CPU architectural POV it's an absolute disaster because the instruction extensions introduced with 68020 are mind bendingly hard to handle in a superscalar implementation, the 68060 was years later than Pentium, had more bugs and ran a lot slower.

Cheers
 
Unrolling is a little more involved than just register renaming.

It's also a limited form of branch prediction, prefetch hint, and a power-saving strategy for running loops.

All the logic used to fully do limited loop unrolling in hardware includes larger branch predictors, renaming and scheduling logic, and the extra retirement logic.

On the other hand, for tight long-running loops, it can decrease code size.

Even OoO processors benefit from loop unrolling, though it has to be balanced against filling up the instruction cache and register pressure.
 
Yes, those extra registers would be used for unrolling your code to avoid false dependencies, which are completely removed by the register renaming apparatus of an OOO CPU.

Think of unrolling as an explicit (and clumsy) form of register renaming.

No real reason to hate x86 more than 68K. Yeah, from an assembly programmer's POV the 68K was really neat, but from a CPU architectural POV it's an absolute disaster because the instruction extensions introduced with 68020 are mind bendingly hard to handle in a superscalar implementation, the 68060 was years later than Pentium, had more bugs and ran a lot slower.

Cheers
All the current x86 CPUs aren't really using the x86 ISA internally. They use a RISC CPU with an x86 JIT compiler on top. The 68000 and ARM are/were neat because they don't need all that to do well.
 
The 68000 and ARM are/were neat because they don't need all that to do well.

The 68000 instruction set is much closer to x86 than to ARM: variable length instructions and the ability to use complex destinations for calculations instead of just registers.

This is what makes it hard to design an efficient CPU for it. ARM has none of that.

A high end 68000 CPU would basically need similar tricks than those used in a current x86 CPUs.

68000 is cleaner than x86 in that it doesn't have such strong limitations wrt register usage, but that part can be avoided with register renaming, which is not all that hard.
 
It's obvious Larrabee won't be a full fledged Core2, so let's speculate on what it will and won't have.

We already know, I think, that it will be in-order.

Branch prediction: Larrabee's in many-core territory, so speculative execution of any sort is going to raise eyebrows.
It seems like a small sacrifice in this case to dump or severely restrict it.
Separate predictors impose a pretty hefty storage penalty, though basic prediction can simply be a 2-bit saturating counter stored alongside a cache line.
It would seem workable, but might be pointless for target workloads that are so rich in non-speculative work.

Superscalar execution: Larrabee's vector processor is 512 bits wide, which is too narrow to support the DP flop count per cycle, unless there are at least two fp computation pipes.
That might mean each core is at least 2-wide superscalar.

Hardware data prefetching: Sounds iffy with many-core. I'd expect it to be software-only (with x86 compatibility, it would have to support what's already there), perhaps augmented by the rumored new cache and data control instructions that should be coming along.

Register renaming: It can be done with in-order cores, but its utility is much less than if they were OoO and speculative. x86 has reg/mem ops, which reduces some of the pressure on the register file, though at the expense of beating up on the L1.
The rather large amount of L1 that is private to each core seems to indicate it will be leaning pretty heavily on the cache.

Cache:
If the core is superscalar, it would most likely need a dual-ported data cache.
The data paths would be pretty huge, too.
512 bits is the width of one cache line, so the vector engine would be pulling in an entire cache line to fill a register or memory operand.

The L2 is apparently private-write to each core, which is good since the L1 was said to be write-through.

I'm curious to see where a stop on the ring bus would fit into this. It's supposed to interface the core with its neighbors, which means it would be charged with coherency, memory, and remote read traffic.
That sounds like it would sit next to the load/store hardware and might subsume the cache controller.

Threading: 4-way is known. Has it been said if it was SMT or something else?
I'd hazard it would be some variation of fine-grained and switch on event, like Niagara or GPU threading.
SMT might not be worth it.

The big unknown is the vector instruction set.
It supposedly goes as far as implementing specialized control and branch instructions.
That means Larrabee for graphics might be able to run mostly on the extension set, with almost no conventional x86 instructions getting in the way.
In theory, that could mean significant changes to the ISA, since it would be a lot more self-contained than current extensions.
In graphics vector mode, it might dispose of a lot of the cruft found the rest of the ISA, which means it could do some extra things with the empty op code space.
(More registers, other stuff?)
Intel might not be that adventurous, though.
 
Just a though:

Intel might not need the best architecture to compete, after all they are industry leaders in process tech, that can not be denied.
 
It's x86, so its being a less than optimal architecture is already a given.

The interesting part is going to be how Intel compensates, and the process advantage is likely going to be a big part of it.
 
Back
Top