Larrabee at GDC 09

if you have a "scalar" program, don't count on it getting vectorized by any hardware be it Nvidia, ATI, or Intel. threadId/blockId are effectively just array index variables and can be done via either software or hardware within the context of the CUDA API.

Nominally you would write a program that loops over a set of work items. The body of that loop is concerned with a single work-item. The indexing just lets multiple instances of your program (thread) point to different work items. Not sure what you're trying to say here. Of course it's vectorized for you since you're not doing it yourself - there's no explicit mapping of groups of work items to the hardware configuration.
 
Nominally you would write a program for a single work-item. The indexing just lets multiple instances of your program (thread) point to different work items. Not sure what you're trying to say here. Of course it's vectorized for you since you're not doing it yourself.

I was playing on the built in joke of "scalar" in your text. ie, scalar things don't aren't parallel by definition.
 
I think 12 cores must be WAY off.

Consider that the P54C (which the Larrabee core is based on) had 16K of L1 and no onboard L2 cache, and weighed in at 3.1 million transistors. It was 4.5M transistors for the Pentium with MMX, which had twice the L1 and 256K of L2, if memory serves.

LRB's core is based on P54C, with 4X the L1 cache, and 256K of L2 cache, and of course the really big vector unit. How much bigger does the vector unit make each core, in terms of transistor count? Twice as big? Three times? Maybe it adds 10M transistors to each core?

Let's assume that, making each core some ~13M transistorsr. You could still pack 64 cores in 832M transistors. Add in the texture units, memory controller, and ring bus between the L2 caches, and it's not hard to predict that LRB would be smaller on a 45nm process than GT200 was on 65nm. Way smaller. Intel has also stated that they're keeping the pipelines really short.

I also highly doubt the 512-bit external memory bus being floated around. One of the key differentiating features of LRB, going back to the first time Intel publicly talked about it, is what they say is FAR less bandwidth necessary to render a scene. Chalk that up to a tiled, deferred renderer and lots of cache per core (big enough that some of the necessary data may still be in there when the next frame comes around). Intel would talk about how Larrabee uses 1/4 to 1/2 the bandwidth per frame of conventional graphics products, averaging somewhere around 1/3.

Assuming it still requires 1/2 the bandwidth per frame as most graphics cards, you really only need something like a 384-bit GDDR3 memory interface with reasonable memory speeds, or a 256-bit GDDR5 interface with modest speeds, to keep the thing fed.

I was at Abrash's talk at GDC, and he said, "I can't talk about how many cores there will be, but we're talking teraflops." He didn't say "around a teraflop" or anything, he said "teraflops."

Granted, that leaves a lot of wiggle room. Colloquially, someone could say "teraflops" if it's 1.5 teraflops, or if it's 8 teraflops. I don't think LRB is going to be 8 teraflops, but I could see 2. 64 cores at 2GHz would be 2 teraflops, right?
 
Consider that the P54C (which the Larrabee core is based on) had 16K of L1 and no onboard L2 cache, and weighed in at 3.1 million transistors. It was 4.5M transistors for the Pentium with MMX, which had twice the L1 and 256K of L2, if memory serves.

L2 was still on the motherboard in the Pentium-days (including Pentium MMX). Only L1 was on-die. But yes, L1 was doubled from 16k to 32k on P55C.
 
Disclaimer=gross approximation
Say the "scalar part" of larrabee is ~6 millions transistors (to support extra features) that would put the whole core around 18 millions transistors (2/3 of the core is vector unit). To be safer we could consider 20 millions, how many transistors 256k of L2 + associated logic consist in? ~2 millions?
Could we state that 25 millions transistors is a reasonable figure for a complete larrabee core?
 
how many transistors 256k of L2 + associated logic consist in? ~2 millions?
Only the memory cells of 256 KiB of L2 using 6T SRAM would be over 12 million transistors:

256 * 1024 (bytes) * 8 (bits) * 6 (transistors) ~= 12.6 million transistors

That's for a non-ECC protected non-redundant L2. In practice you cannot do away without some kind of data protection and some redundancy so the actual number will be certainly north of that.
 
Oups too bad I made a rough calculation 256*1000*7 +healthy rounding (<=instead of 6 to take in account extra logic) but I didn't convert bytes to bits... :oops:
 
Last edited by a moderator:
Only the memory cells of 256 KiB of L2 using 6T SRAM would be over 12 million transistors:

256 * 1024 (bytes) * 8 (bits) * 6 (transistors) ~= 12.6 million transistors
Before ECC and redundant cells, there is the basic parity bit per byte, so...:
256 * 1024 (bytes) * 9 (bits) * 6 (transistors) ;)
 
Before ECC and redundant cells, there is the basic parity bit per byte, so...:
256 * 1024 (bytes) * 9 (bits) * 6 (transistors) ;)
You don't need any additional parity bits if you have ECC. ECC typically requires 8 bits per 64 bits, so same amount as does parity (1 bit per byte usually). Anyway, if you're willing to skip any kind of error detection you certainly wouldn't need either so 8 bits is correct.
 
That would be 4 Teraflops -> 64 cores * 16 lanes * 2 ops (1 madd per clock) * 2 Ghz.

Ah, yes. The one op that is two ops. ;)

Okay, even if you cut it back to 32 cores because of that, 12 cores just seems way, way off. Somebody somewhere is doing estimation thinking of more modern out-of-order Intel CPU cores or something.

Even with L2 cache at 12M transistors per core, and the rest of the core at like 20M transistors (we're starting to highball it here), you're still at "only" a billion transistors for 32 cores. You could double that to 64 cores and include the ring bus and texture units and still be, on 45nm, the same size or smaller than GT200 was on 65nm.

If I had to guess, and this is only a guess, I would think Larrabee will have 64 cores or more in its high-end configuration. Now, clock speeds are a real interesting bit. Who knows? A big, hot, dense, power-hungry chip with deliberately short pipelines spells slower clock speeds. But it's Intel.

As for graphics performance...I've spoken with Intel folks in the know and they won't tell me how it's looking. They will say that they know they have to be faster than the current cards, because they'll be up against the next-gen from Nvidia and ATI. I believe their catchphrase for their chances against those is, "It would be arrogant of us to say we'll be faster when we don't even know how well those cards will perform."

To me, the real issue Intel will need to step up to the plate on is drivers. The GMA products are famously problemmatic with lots of games, not just in performance but in compatibility and rendering glitches and stuff. Their control panels and other desktop software for EVERYTHING (motherboards, etc) are just TERRIBLE. Ugly, poor interfaces, poor options, and so on. They need to catch up to the years and years of experience and relationships Nvidia and ATI have built with developers and publishers in testing games and fixing graphics glitches, even taking a game that does something "wrong" and make it look right. And they need to deliver the kind of control panel software an enthusiast would expect. It's a tall order from a company that really hasn't done end-user software well, like, ever.
 
As for graphics performance...I've spoken with Intel folks in the know and they won't tell me how it's looking. They will say that they know they have to be faster than the current cards, because they'll be up against the next-gen from Nvidia and ATI. I believe their catchphrase for their chances against those is, "It would be arrogant of us to say we'll be faster when we don't even know how well those cards will perform."

I hope your friends are hedging because that doesn't sound very promising. It seems like the software guys have a long road ahead of them even after hardware is ready. Nobody knows what kind of numbers the thing is going to put up in the end.
 
To me, the real issue Intel will need to step up to the plate on is drivers. The GMA products are famously problemmatic with lots of games, not just in performance but in compatibility and rendering glitches and stuff.

Not sure if you've been paying attention recently... but I happen to have a laptop with Intel X3100 graphics... and I found that they have been releasing new drivers almost every month.
Last year they added full DX10 support (I dedicated a thread to this milestone), and since then they've gradually been fixing bugs and improving performance. It's not perfect yet, but it's improving at an impressive rate.
The DX10 support is good enough to run Crysis (although my X3100 and 1.5 GHz processor are way underpowered... I guess a nice Core2 Quad with X4500 or so would do much better).
Aside from that, Intel also offers ClearVideo video acceleration, which looks very nice in PowerDVD.
They are still very weak in OpenGL, mainly because wgl extensions simply aren't implemented. This means that most software can't even initialize. Aside from that they do support OpenGL 2.1 I believe. But I can't run something like Doom3, it just won't start with the missing functions.

At any rate, for me it works fine. I get full Vista Aero, I can even develop my DX10 code on the laptop now, without having to resort to refrast, and watching DVDs or other video stuff works great aswell.
I think Intel has been cleaning up its act in the past 2 years.
 
Last edited by a moderator:
I wonder why the widows people complain so much about intel igp drivers. Their linux drivers at least are about as good as drivers get.
 
Back
Top