Larrabee: 16 Cores, 2GHz, 150W, and more...

ADEX · Jun 12, 2007

Geeforcer said:
Is there even a need to point out that 49.5mmx49.5mm die would be the most insane thing ever?

As said elsewhere that's probably the package, though that itself implies a very large die.

However compared to some chips, 49x49mm is *small*. There are chips which are so big they can only make one per 200mm wafer. They're used in things like medical imaging or big telescopes - one has a 120 megapixel image sensor!

Megadrive1988 · Jun 18, 2007

Larrabee roadmap

32 and 48 core variants.

would be interesting if a Larrabee variant makes its way into one of the next-gen consoles.

nutball · Jun 18, 2007

The third slide is Polaris, not Larabee

3dilettante · Jun 18, 2007

It doesn't seem a stretch to assume that a 16-core variant would be a chip with another row of cores disabled.

I'm curious about the arrangment shown: where one row of cores is right next to the cache and the other row is more distant.
The other slides didn't have that detail.

ausername · Jun 20, 2007

Larabee...blech...don't say that word... The people running that project in German are a bunch of lieing dipshits... at least the ones I was working with....

AnarchX · Jun 20, 2007

source translated by Google

AnarchX · Jun 21, 2007

Larrabee will be used for gamer-cards:

source

Geo · Jun 21, 2007

So I gather you're ad-blocking on our site then?

As those ads have been here for MONTHS.

Arun · Jun 21, 2007

Geo said:
So I gather you're ad-blocking on our site then? As those ads have been here for MONTHS.

Actually, it's region-based I think...

(which is kinda logical anyway)

AnarchX · Jun 21, 2007

Geo said:
So I gather you're ad-blocking on our site then? As those ads have been here for MONTHS.

Now I know who are this "selected web-sites" INQ is talking about. X-D

Arun · Jun 21, 2007

AnarchX said:
Now I know who are this "selected web-sites" INQ is talking about. X-D

I don't think ours are operated via Google Ads, I could be wrong though since I'm not seeing them right now due to region restrictions...

ebola · Jun 22, 2007

Back to the question of ISA, registers, in-orderness

looking back through microprocessor history isn't it the case that pre OOOE the load/store machines had a bigger advantage through their large reg file isa's.

though, I suppose the SIMD enhancements and hyperthreading may shrink this factor...

I suppose i've got a deep rooted "religious hatred" of the x86

(yes, i grew up on 68k assembler) Maybe it's unfounded now, what with x86 having 16 gp registers these days.

The cell is my favourite peice of silicon these days -with explicit cache control and unified simd/gp regs being it's selling points; But I can just see the wheels in motion. it's going to be destroyed by an x86 derivative isn't it...

(heh for all I know, maybe intel will overlap the simd / gp registers in larabee, and add more. Then i'll have no reason to complain, will I

)

aaaaa00 · Jun 22, 2007

3dilettante said:
True.

I wanted to reserve my list to first tries after which Intel didn't give up.

Small piece of trivia:

The i860 was the first architecture Windows NT booted on.

The second was MIPS.

Gubbi · Jun 22, 2007

ebola said:
Back to the question of ISA, registers, in-orderness
looking back through microprocessor history isn't it the case that pre OOOE the load/store machines had a bigger advantage through their large reg file isa's.

Yes, those extra registers would be used for unrolling your code to avoid false dependencies, which are completely removed by the register renaming apparatus of an OOO CPU.

Think of unrolling as an explicit (and clumsy) form of register renaming.

No real reason to hate x86 more than 68K. Yeah, from an assembly programmer's POV the 68K was really neat, but from a CPU architectural POV it's an absolute disaster because the instruction extensions introduced with 68020 are mind bendingly hard to handle in a superscalar implementation, the 68060 was years later than Pentium, had more bugs and ran a lot slower.

Cheers

3dilettante · Jun 22, 2007

Unrolling is a little more involved than just register renaming.

It's also a limited form of branch prediction, prefetch hint, and a power-saving strategy for running loops.

All the logic used to fully do limited loop unrolling in hardware includes larger branch predictors, renaming and scheduling logic, and the extra retirement logic.

On the other hand, for tight long-running loops, it can decrease code size.

Even OoO processors benefit from loop unrolling, though it has to be balanced against filling up the instruction cache and register pressure.

Frank · Jun 22, 2007

Gubbi said:
Yes, those extra registers would be used for unrolling your code to avoid false dependencies, which are completely removed by the register renaming apparatus of an OOO CPU.

Think of unrolling as an explicit (and clumsy) form of register renaming.

No real reason to hate x86 more than 68K. Yeah, from an assembly programmer's POV the 68K was really neat, but from a CPU architectural POV it's an absolute disaster because the instruction extensions introduced with 68020 are mind bendingly hard to handle in a superscalar implementation, the 68060 was years later than Pentium, had more bugs and ran a lot slower.

Cheers

All the current x86 CPUs aren't really using the x86 ISA internally. They use a RISC CPU with an x86 JIT compiler on top. The 68000 and ARM are/were neat because they don't need all that to do well.

silent_guy · Jun 22, 2007

Frank said:
The 68000 and ARM are/were neat because they don't need all that to do well.

The 68000 instruction set is much closer to x86 than to ARM: variable length instructions and the ability to use complex destinations for calculations instead of just registers.

This is what makes it hard to design an efficient CPU for it. ARM has none of that.

A high end 68000 CPU would basically need similar tricks than those used in a current x86 CPUs.

68000 is cleaner than x86 in that it doesn't have such strong limitations wrt register usage, but that part can be avoided with register renaming, which is not all that hard.

3dilettante · Jun 22, 2007

It's obvious Larrabee won't be a full fledged Core2, so let's speculate on what it will and won't have.

We already know, I think, that it will be in-order.

Branch prediction: Larrabee's in many-core territory, so speculative execution of any sort is going to raise eyebrows.
It seems like a small sacrifice in this case to dump or severely restrict it.
Separate predictors impose a pretty hefty storage penalty, though basic prediction can simply be a 2-bit saturating counter stored alongside a cache line.
It would seem workable, but might be pointless for target workloads that are so rich in non-speculative work.

Superscalar execution: Larrabee's vector processor is 512 bits wide, which is too narrow to support the DP flop count per cycle, unless there are at least two fp computation pipes.
That might mean each core is at least 2-wide superscalar.

Hardware data prefetching: Sounds iffy with many-core. I'd expect it to be software-only (with x86 compatibility, it would have to support what's already there), perhaps augmented by the rumored new cache and data control instructions that should be coming along.

Register renaming: It can be done with in-order cores, but its utility is much less than if they were OoO and speculative. x86 has reg/mem ops, which reduces some of the pressure on the register file, though at the expense of beating up on the L1.
The rather large amount of L1 that is private to each core seems to indicate it will be leaning pretty heavily on the cache.

Cache:
If the core is superscalar, it would most likely need a dual-ported data cache.
The data paths would be pretty huge, too.
512 bits is the width of one cache line, so the vector engine would be pulling in an entire cache line to fill a register or memory operand.

The L2 is apparently private-write to each core, which is good since the L1 was said to be write-through.

I'm curious to see where a stop on the ring bus would fit into this. It's supposed to interface the core with its neighbors, which means it would be charged with coherency, memory, and remote read traffic.
That sounds like it would sit next to the load/store hardware and might subsume the cache controller.

Threading: 4-way is known. Has it been said if it was SMT or something else?
I'd hazard it would be some variation of fine-grained and switch on event, like Niagara or GPU threading.
SMT might not be worth it.

The big unknown is the vector instruction set.
It supposedly goes as far as implementing specialized control and branch instructions.
That means Larrabee for graphics might be able to run mostly on the extension set, with almost no conventional x86 instructions getting in the way.
In theory, that could mean significant changes to the ISA, since it would be a lot more self-contained than current extensions.
In graphics vector mode, it might dispose of a lot of the cruft found the rest of the ISA, which means it could do some extra things with the empty op code space.
(More registers, other stuff?)
Intel might not be that adventurous, though.

compres · Jun 22, 2007

Just a though:

Intel might not need the best architecture to compete, after all they are industry leaders in process tech, that can not be denied.

3dilettante · Jun 22, 2007

It's x86, so its being a less than optimal architecture is already a given.

The interesting part is going to be how Intel compensates, and the process advantage is likely going to be a big part of it.

Larrabee: 16 Cores, 2GHz, 150W, and more...

ADEX

Megadrive1988

nutball

3dilettante

ausername

AnarchX

AnarchX

Geo

Mostly Harmless

Arun

Unknown.

AnarchX

Arun

Unknown.

ebola

aaaaa00

Gubbi

3dilettante

Frank

Certified not a majority

silent_guy

3dilettante

compres

3dilettante

Similar threads