Larrabee at GDC 09

trinibwoy · Apr 15, 2009

aaronspink said:
if you have a "scalar" program, don't count on it getting vectorized by any hardware be it Nvidia, ATI, or Intel. threadId/blockId are effectively just array index variables and can be done via either software or hardware within the context of the CUDA API.

Nominally you would write a program that loops over a set of work items. The body of that loop is concerned with a single work-item. The indexing just lets multiple instances of your program (thread) point to different work items. Not sure what you're trying to say here. Of course it's vectorized for you since you're not doing it yourself - there's no explicit mapping of groups of work items to the hardware configuration.

aaronspink · Apr 15, 2009

trinibwoy said:
Nominally you would write a program for a single work-item. The indexing just lets multiple instances of your program (thread) point to different work items. Not sure what you're trying to say here. Of course it's vectorized for you since you're not doing it yourself.

I was playing on the built in joke of "scalar" in your text. ie, scalar things don't aren't parallel by definition.

JasonCross · Apr 15, 2009

I think 12 cores must be WAY off.

Consider that the P54C (which the Larrabee core is based on) had 16K of L1 and no onboard L2 cache, and weighed in at 3.1 million transistors. It was 4.5M transistors for the Pentium with MMX, which had twice the L1 and 256K of L2, if memory serves.

LRB's core is based on P54C, with 4X the L1 cache, and 256K of L2 cache, and of course the really big vector unit. How much bigger does the vector unit make each core, in terms of transistor count? Twice as big? Three times? Maybe it adds 10M transistors to each core?

Let's assume that, making each core some ~13M transistorsr. You could still pack 64 cores in 832M transistors. Add in the texture units, memory controller, and ring bus between the L2 caches, and it's not hard to predict that LRB would be smaller on a 45nm process than GT200 was on 65nm. Way smaller. Intel has also stated that they're keeping the pipelines really short.

I also highly doubt the 512-bit external memory bus being floated around. One of the key differentiating features of LRB, going back to the first time Intel publicly talked about it, is what they say is FAR less bandwidth necessary to render a scene. Chalk that up to a tiled, deferred renderer and lots of cache per core (big enough that some of the necessary data may still be in there when the next frame comes around). Intel would talk about how Larrabee uses 1/4 to 1/2 the bandwidth per frame of conventional graphics products, averaging somewhere around 1/3.

Assuming it still requires 1/2 the bandwidth per frame as most graphics cards, you really only need something like a 384-bit GDDR3 memory interface with reasonable memory speeds, or a 256-bit GDDR5 interface with modest speeds, to keep the thing fed.

I was at Abrash's talk at GDC, and he said, "I can't talk about how many cores there will be, but we're talking teraflops." He didn't say "around a teraflop" or anything, he said "teraflops."

Granted, that leaves a lot of wiggle room. Colloquially, someone could say "teraflops" if it's 1.5 teraflops, or if it's 8 teraflops. I don't think LRB is going to be 8 teraflops, but I could see 2. 64 cores at 2GHz would be 2 teraflops, right?

trinibwoy · Apr 15, 2009

aaronspink said:
I was playing on the built in joke of "scalar" in your text. ie, scalar things don't aren't parallel by definition.

Ah, how could I forget

Jawed · Apr 15, 2009

JasonCross said:
64 cores at 2GHz would be 2 teraflops, right?

Double precision, yeah!

This spreads its FLOPs around, seemingly having a pair of MADs per processor to achieve 1TFLOP (4 FLOPs per core, 2 floating point units):

http://techresearch.intel.com/articles/Tera-Scale/1449.htm

Wider SIMDs and 45nm, at least, should make it smaller. It's 100M transistors on 65nm. It's also very toasty at 3+GHz

Jawed

nAo · Apr 15, 2009

JasonCross said:
64 cores at 2GHz would be 2 teraflops, right?

That would be 4 Teraflops -> 64 cores * 16 lanes * 2 ops (1 madd per clock) * 2 Ghz.

Scali · Apr 16, 2009

JasonCross said:
Consider that the P54C (which the Larrabee core is based on) had 16K of L1 and no onboard L2 cache, and weighed in at 3.1 million transistors. It was 4.5M transistors for the Pentium with MMX, which had twice the L1 and 256K of L2, if memory serves.

L2 was still on the motherboard in the Pentium-days (including Pentium MMX). Only L1 was on-die. But yes, L1 was doubled from 16k to 32k on P55C.

nAo · Apr 16, 2009

IIRC P54C L1 was 8 Kb, not 16 Kb.

Scali · Apr 16, 2009

nAo said:
IIRC P54C L1 was 8 Kb, not 16 Kb.

I think it was 8k code + 8k data, and then expanded to 16k code + 16k data.
So respectively 16k and 32k in total.

liolio · Apr 16, 2009

Disclaimer=gross approximation
Say the "scalar part" of larrabee is ~6 millions transistors (to support extra features) that would put the whole core around 18 millions transistors (2/3 of the core is vector unit). To be safer we could consider 20 millions, how many transistors 256k of L2 + associated logic consist in? ~2 millions?
Could we state that 25 millions transistors is a reasonable figure for a complete larrabee core?

crystall · Apr 16, 2009

liolio said:
how many transistors 256k of L2 + associated logic consist in? ~2 millions?

Only the memory cells of 256 KiB of L2 using 6T SRAM would be over 12 million transistors:

256 * 1024 (bytes) * 8 (bits) * 6 (transistors) ~= 12.6 million transistors

That's for a non-ECC protected non-redundant L2. In practice you cannot do away without some kind of data protection and some redundancy so the actual number will be certainly north of that.

liolio · Apr 16, 2009

Oups too bad I made a rough calculation 256*1000*7 +healthy rounding (<=instead of 6 to take in account extra logic) but I didn't convert bytes to bits...

fellix · Apr 16, 2009

crystall said:
Only the memory cells of 256 KiB of L2 using 6T SRAM would be over 12 million transistors:

256 * 1024 (bytes) * 8 (bits) * 6 (transistors) ~= 12.6 million transistors

Before ECC and redundant cells, there is the basic parity bit per byte, so...:
256 * 1024 (bytes) * 9 (bits) * 6 (transistors)

mczak · Apr 16, 2009

fellix said:
Before ECC and redundant cells, there is the basic parity bit per byte, so...:
256 * 1024 (bytes) * 9 (bits) * 6 (transistors)

You don't need any additional parity bits if you have ECC. ECC typically requires 8 bits per 64 bits, so same amount as does parity (1 bit per byte usually). Anyway, if you're willing to skip any kind of error detection you certainly wouldn't need either so 8 bits is correct.

Heinrich4 · Apr 16, 2009

Sorry all maybe this off topic,but here we see pics of larabee in waffers 300 sq. mm.:

http://xtreview.com/addcomment-id-8580-view-Intel-Larrabee-core-area.html

http://xtreview.com/addcomment-id-8527-view-Intel-larrabee-demonstrated.html

(86 chips per waffer!)

JasonCross · Apr 16, 2009

nAo said:
That would be 4 Teraflops -> 64 cores * 16 lanes * 2 ops (1 madd per clock) * 2 Ghz.

Ah, yes. The one op that is two ops.

Okay, even if you cut it back to 32 cores because of that, 12 cores just seems way, way off. Somebody somewhere is doing estimation thinking of more modern out-of-order Intel CPU cores or something.

Even with L2 cache at 12M transistors per core, and the rest of the core at like 20M transistors (we're starting to highball it here), you're still at "only" a billion transistors for 32 cores. You could double that to 64 cores and include the ring bus and texture units and still be, on 45nm, the same size or smaller than GT200 was on 65nm.

If I had to guess, and this is only a guess, I would think Larrabee will have 64 cores or more in its high-end configuration. Now, clock speeds are a real interesting bit. Who knows? A big, hot, dense, power-hungry chip with deliberately short pipelines spells slower clock speeds. But it's Intel.

As for graphics performance...I've spoken with Intel folks in the know and they won't tell me how it's looking. They will say that they know they have to be faster than the current cards, because they'll be up against the next-gen from Nvidia and ATI. I believe their catchphrase for their chances against those is, "It would be arrogant of us to say we'll be faster when we don't even know how well those cards will perform."

To me, the real issue Intel will need to step up to the plate on is drivers. The GMA products are famously problemmatic with lots of games, not just in performance but in compatibility and rendering glitches and stuff. Their control panels and other desktop software for EVERYTHING (motherboards, etc) are just TERRIBLE. Ugly, poor interfaces, poor options, and so on. They need to catch up to the years and years of experience and relationships Nvidia and ATI have built with developers and publishers in testing games and fixing graphics glitches, even taking a game that does something "wrong" and make it look right. And they need to deliver the kind of control panel software an enthusiast would expect. It's a tall order from a company that really hasn't done end-user software well, like, ever.

trinibwoy · Apr 16, 2009

JasonCross said:
As for graphics performance...I've spoken with Intel folks in the know and they won't tell me how it's looking. They will say that they know they have to be faster than the current cards, because they'll be up against the next-gen from Nvidia and ATI. I believe their catchphrase for their chances against those is, "It would be arrogant of us to say we'll be faster when we don't even know how well those cards will perform."

I hope your friends are hedging because that doesn't sound very promising. It seems like the software guys have a long road ahead of them even after hardware is ready. Nobody knows what kind of numbers the thing is going to put up in the end.

Scali · Apr 16, 2009

JasonCross said:
To me, the real issue Intel will need to step up to the plate on is drivers. The GMA products are famously problemmatic with lots of games, not just in performance but in compatibility and rendering glitches and stuff.

Not sure if you've been paying attention recently... but I happen to have a laptop with Intel X3100 graphics... and I found that they have been releasing new drivers almost every month.
Last year they added full DX10 support (I dedicated a thread to this milestone), and since then they've gradually been fixing bugs and improving performance. It's not perfect yet, but it's improving at an impressive rate.
The DX10 support is good enough to run Crysis (although my X3100 and 1.5 GHz processor are way underpowered... I guess a nice Core2 Quad with X4500 or so would do much better).
Aside from that, Intel also offers ClearVideo video acceleration, which looks very nice in PowerDVD.
They are still very weak in OpenGL, mainly because wgl extensions simply aren't implemented. This means that most software can't even initialize. Aside from that they do support OpenGL 2.1 I believe. But I can't run something like Doom3, it just won't start with the missing functions.

At any rate, for me it works fine. I get full Vista Aero, I can even develop my DX10 code on the laptop now, without having to resort to refrast, and watching DVDs or other video stuff works great aswell.
I think Intel has been cleaning up its act in the past 2 years.

bowman · Apr 16, 2009

Heinrich4 said:
Sorry all maybe this off topic,but here we see pics of larabee in waffers 300 sq. mm.:

http://xtreview.com/addcomment-id-8580-view-Intel-Larrabee-core-area.html

http://xtreview.com/addcomment-id-8527-view-Intel-larrabee-demonstrated.html

(86 chips per waffer!)

The second one is still Jasper Forest, not a Larrabee based product.

rpg.314 · Apr 17, 2009

I wonder why the widows people complain so much about intel igp drivers. Their linux drivers at least are about as good as drivers get.

Larrabee at GDC 09

trinibwoy

Meh

aaronspink

JasonCross

trinibwoy

Meh

Jawed

nAo

Nutella Nutellae

Scali

nAo

Nutella Nutellae

Scali

liolio

Aquoiboniste

crystall

liolio

Aquoiboniste

fellix

mczak

Heinrich4

JasonCross

trinibwoy

Meh

Scali

bowman

rpg.314

Similar threads