Larrabee at GDC 09

fellix · Apr 12, 2009

Nay -- 675 is too large for 85 pieces.

Jawed · Apr 12, 2009

trinibwoy said:
Any clue how strands and fibers are actually represented?

The simplest scenario is in pixel shading, where 4 quads of fragments are grouped to form what Intel calls a qquad.

Depending on the kind of code contained in the pixel shader and on the number of render targets written, each of the hardware threads that undertake pixel shading will take "batches" of qquads.

The program that actually runs on Larrabee will be a loop, "for each qquad: shade".

This is similar to the discussion we've been having recently about making a kernel produce more than one result: logically making one invocation compute multiple work items.

So, a fibre corresponds with a qquad. "Fibre" is purely software-implemented multi-threading. "Strand" is then the number of elements that share a program counter.

In Larrabee it appears that the normal way fibres will be constructed is from 16 strands. There's no reason not to use more (and for double-precision the minimum would be only 8) but this would prolly be a tweak for performance.

Can I take a single hardware thread and do anything I want

Yep. If there's 4GB of memory then you have 4GB/64 bytes (64 bytes = 16 scalars packed) ~67M variables that you can store in memory. If you have 32 cores and 4 hardware threads per core, that's 0.5MB per thread of private variables per hardware-scheduled context.

How you split up memory (and allocate some for shared storage, e.g. as textures usuable by all threads) is up to you, depending on how coarse- or fine-grained you want to make your work-items.

I'm also curious as to how hardware and software switching are going to work together.

As far as I can tell the 4 hardware threads (contexts) run symmetrically, by default. This may be how Intel hides read-after-write latency in the register file.

It appears that individual contexts can be put to sleep - they may request it. Not sure. This happens when a long-latency operation is started and they have no instructions that can be issued in the shadow of that latency. Ideally the contexts make their requests as early as possible in order to fill the latency-shadow with work, but obviously the randomness of memory operations makes that unpredictable.

Jawed

Jawed · Apr 12, 2009

fellix said:
Nay -- 675 is too large for 85 pieces.

Are you using a dies-on-wafer calculator that assumes square dies?

Jawed

rpg.314 · Apr 12, 2009

Considering how secretive they have been, I'll bet that those pics are intended to create and spread FUD. And by the looks of it, they are succeeding.

Tridam · Apr 12, 2009

bowman said:
He only held up two wafers during the keynote so it's either one or the other. The one on the bench behind him does look like Jasper Forest so its pretty sure in my book.

The picture on intel website is of the same wafer of my picture. It's Larrabee. At least Gelsinger said so. The Jasper Forest wafer is the one you can see in the background. The die is much smaller and is also the one computerbase said was Larrabee.

CarstenS · Apr 12, 2009

Is Jasper Forest the same as the polaris proof of concept thingie? The wafer in the background sure looks a lot like polaris.

Tridam · Apr 12, 2009

CarstenS said:
Is Jasper Forest the same as the polaris proof of concept thingie? The wafer in the background sure looks a lot like polaris.

Not at all. Jasper Forest is a Nehalem-based CPU for embedded market.

iwod · Apr 13, 2009

I am starting to think Larrabee, even at this huge die size doesn't matter at all. Especially at the current economy. Intel have lots of 45nm wafer space unused, or not @ 100% production capacity. So using those to make Larrabee will help smooth the transition to 32nm. They can sell it at a competitive price, and still make use of their production capacity and iron out all software bugs......

rpg.314 · Apr 15, 2009

Here

"What you saw is the 'extreme' version, let me put it that way," said Otellini, adding that the GPU is in the debug stage now. "I would expect volume introduction of this product to be early next year."

128 LRB cores

What say you?

BTW, does any one have any idea about the die size (and the process, of course) of the original pentiums on which this is supposedly based?

hoho · Apr 15, 2009

rpg.314 said:
128 LRB cores
What say you?

I'd be very surprised if it was >32 cores.

rpg.314 said:
BTW, does any one have any idea about the die size (and the process, of course) of the original pentiums on which this is supposedly based?

I don't but it has little use anyway, the size of the "real" CPU part of Larrabee cores should be insignificant compared to the vector ALU so making any conclusions about the number of cores based on it is pretty meaningless.

rpg.314 · Apr 15, 2009

The lrb paper said that the vpu takes ~2/3 the die area of one lrb core.

3dilettante · Apr 15, 2009

It stated 1/3 of the core.

hoho · Apr 15, 2009

3dilettante said:
It stated 1/3 of the core.

Did that definition of "core" include caches and (part of) ringbus?

3dilettante · Apr 15, 2009

Intel stated the ring bus runs above the L2 to save space.
Whether the L2 tiles count towards that estimation, I'm not sure.
Intel typically does not include the L2 as part of the core.

Scali · Apr 15, 2009

Jawed said:
So, a fibre corresponds with a qquad. "Fibre" is purely software-implemented multi-threading. "Strand" is then the number of elements that share a program counter.

In Larrabee it appears that the normal way fibres will be constructed is from 16 strands. There's no reason not to use more (and for double-precision the minimum would be only 8) but this would prolly be a tweak for performance.

So let me get this straight...
A fibre is a piece of SIMD code, where you have 16 scalar (or 8 in the case of DP) strands operating in parallel, all running on a logical 'x86' core?

That sounds quite similar to what nVidia does.

trinibwoy · Apr 15, 2009

Jawed said:
The program that actually runs on Larrabee will be a loop, "for each qquad: shade".

This is similar to the discussion we've been having recently about making a kernel produce more than one result: logically making one invocation compute multiple work items.

So, a fibre corresponds with a qquad. "Fibre" is purely software-implemented multi-threading. "Strand" is then the number of elements that share a program counter.

Thanks, that's the impression I got from the slides. I'm still wary of LRB's hard-wired latency hiding capacity within a fiber given the variability of cache and global memory latencies.

How you split up memory (and allocate some for shared storage, e.g. as textures usuable by all threads) is up to you, depending on how coarse- or fine-grained you want to make your work-items.

That should make for some fun/nerve-wracking performance tuning

Scali said:
So let me get this straight...
A fibre is a piece of SIMD code, where you have 16 scalar (or 8 in the case of DP) strands operating in parallel, all running on a logical 'x86' core?

That sounds quite similar to what nVidia does.

On the surface yes. But that's where the similarities end it seems. Nvidia assumes more responsibility for latency hiding than LRB does. I'm also not clear on whether the developer is also responsible for predication within a strand group.

Jawed · Apr 15, 2009

How many copies of SSE could be fitted within 700mm²?

Then apply these scalings:

1/2 for LRBni
1/4 due to 16-wide in Larrabee
1/3 due to the stated proportion of VPU to Larrabee core

Jawed

nAo · Apr 15, 2009

trinibwoy said:
I'm also not clear on whether the developer is also responsible for predication within a strand group.

What do you exactly mean by that?

trinibwoy · Apr 15, 2009

nAo said:
What do you exactly mean by that?

How are the vector masks generated? Is it always handled by the hardware based on branches encountered in the code? I guess I'm missing the part where my scalar program would get strung across the vector unit for 16 data items at a time. It's pretty clear what happens in CUDA with the warp/block construct. But does LRB also have a similar setup where it automatically breaks down your data set into strand groups and maps them across the VPU? I thought those were just concepts and it was up to the developer to structure the data and code appropriately?

Scali · Apr 15, 2009

trinibwoy said:
On the surface yes. But that's where the similarities end it seems. Nvidia assumes more responsibility for latency hiding than LRB does.

Yes, doesn't every x86 core get 4x SMT, running sequentially?
So that would mean that you get 4 fibres on a core, so you have '3 instructions latency' before the first fibre comes round again.

Aside from that, I suppose the compiler can fudge some PREFETCH instructions into the code, to hide some of the latency of memory access... and whatever the texture hardware may be capable of...?

Larrabee at GDC 09

fellix

Jawed

Jawed

rpg.314

Tridam

CarstenS

Moderator

Tridam

iwod

rpg.314

hoho

rpg.314

3dilettante

hoho

3dilettante

Scali

trinibwoy

Meh

Jawed

nAo

Nutella Nutellae

trinibwoy

Meh

Scali

Similar threads