Larrabee at GDC 09

Nay -- 675 is too large for 85 pieces.

69259642.jpg
 
Last edited by a moderator:
Any clue how strands and fibers are actually represented?
The simplest scenario is in pixel shading, where 4 quads of fragments are grouped to form what Intel calls a qquad.

Depending on the kind of code contained in the pixel shader and on the number of render targets written, each of the hardware threads that undertake pixel shading will take "batches" of qquads.

The program that actually runs on Larrabee will be a loop, "for each qquad: shade".

This is similar to the discussion we've been having recently about making a kernel produce more than one result: logically making one invocation compute multiple work items.

So, a fibre corresponds with a qquad. "Fibre" is purely software-implemented multi-threading. "Strand" is then the number of elements that share a program counter.

In Larrabee it appears that the normal way fibres will be constructed is from 16 strands. There's no reason not to use more (and for double-precision the minimum would be only 8) but this would prolly be a tweak for performance.

Can I take a single hardware thread and do anything I want
Yep. If there's 4GB of memory then you have 4GB/64 bytes (64 bytes = 16 scalars packed) ~67M variables that you can store in memory. If you have 32 cores and 4 hardware threads per core, that's 0.5MB per thread of private variables per hardware-scheduled context.

How you split up memory (and allocate some for shared storage, e.g. as textures usuable by all threads) is up to you, depending on how coarse- or fine-grained you want to make your work-items.

I'm also curious as to how hardware and software switching are going to work together.
As far as I can tell the 4 hardware threads (contexts) run symmetrically, by default. This may be how Intel hides read-after-write latency in the register file.

It appears that individual contexts can be put to sleep - they may request it. Not sure. This happens when a long-latency operation is started and they have no instructions that can be issued in the shadow of that latency. Ideally the contexts make their requests as early as possible in order to fill the latency-shadow with work, but obviously the randomness of memory operations makes that unpredictable.

Jawed
 
Considering how secretive they have been, I'll bet that those pics are intended to create and spread FUD. And by the looks of it, they are succeeding.
 
He only held up two wafers during the keynote so it's either one or the other. The one on the bench behind him does look like Jasper Forest so its pretty sure in my book.

The picture on intel website is of the same wafer of my picture. It's Larrabee. At least Gelsinger said so. The Jasper Forest wafer is the one you can see in the background. The die is much smaller and is also the one computerbase said was Larrabee.
 
Is Jasper Forest the same as the polaris proof of concept thingie? The wafer in the background sure looks a lot like polaris.
 
I am starting to think Larrabee, even at this huge die size doesn't matter at all. Especially at the current economy. Intel have lots of 45nm wafer space unused, or not @ 100% production capacity. So using those to make Larrabee will help smooth the transition to 32nm. They can sell it at a competitive price, and still make use of their production capacity and iron out all software bugs......
 
Here

"What you saw is the 'extreme' version, let me put it that way," said Otellini, adding that the GPU is in the debug stage now. "I would expect volume introduction of this product to be early next year."

128 LRB cores :)
What say you?

BTW, does any one have any idea about the die size (and the process, of course) of the original pentiums on which this is supposedly based?
 
128 LRB cores :)
What say you?
I'd be very surprised if it was >32 cores.
BTW, does any one have any idea about the die size (and the process, of course) of the original pentiums on which this is supposedly based?
I don't but it has little use anyway, the size of the "real" CPU part of Larrabee cores should be insignificant compared to the vector ALU so making any conclusions about the number of cores based on it is pretty meaningless.
 
Intel stated the ring bus runs above the L2 to save space.
Whether the L2 tiles count towards that estimation, I'm not sure.
Intel typically does not include the L2 as part of the core.
 
So, a fibre corresponds with a qquad. "Fibre" is purely software-implemented multi-threading. "Strand" is then the number of elements that share a program counter.

In Larrabee it appears that the normal way fibres will be constructed is from 16 strands. There's no reason not to use more (and for double-precision the minimum would be only 8) but this would prolly be a tweak for performance.

So let me get this straight...
A fibre is a piece of SIMD code, where you have 16 scalar (or 8 in the case of DP) strands operating in parallel, all running on a logical 'x86' core?

That sounds quite similar to what nVidia does.
 
The program that actually runs on Larrabee will be a loop, "for each qquad: shade".

This is similar to the discussion we've been having recently about making a kernel produce more than one result: logically making one invocation compute multiple work items.

So, a fibre corresponds with a qquad. "Fibre" is purely software-implemented multi-threading. "Strand" is then the number of elements that share a program counter.

Thanks, that's the impression I got from the slides. I'm still wary of LRB's hard-wired latency hiding capacity within a fiber given the variability of cache and global memory latencies.

How you split up memory (and allocate some for shared storage, e.g. as textures usuable by all threads) is up to you, depending on how coarse- or fine-grained you want to make your work-items.
That should make for some fun/nerve-wracking performance tuning :)

So let me get this straight...
A fibre is a piece of SIMD code, where you have 16 scalar (or 8 in the case of DP) strands operating in parallel, all running on a logical 'x86' core?

That sounds quite similar to what nVidia does.

On the surface yes. But that's where the similarities end it seems. Nvidia assumes more responsibility for latency hiding than LRB does. I'm also not clear on whether the developer is also responsible for predication within a strand group.
 
How many copies of SSE could be fitted within 700mm²?

Then apply these scalings:
  • 1/2 for LRBni
  • 1/4 due to 16-wide in Larrabee
  • 1/3 due to the stated proportion of VPU to Larrabee core
:p

Jawed
 
What do you exactly mean by that?

How are the vector masks generated? Is it always handled by the hardware based on branches encountered in the code? I guess I'm missing the part where my scalar program would get strung across the vector unit for 16 data items at a time. It's pretty clear what happens in CUDA with the warp/block construct. But does LRB also have a similar setup where it automatically breaks down your data set into strand groups and maps them across the VPU? I thought those were just concepts and it was up to the developer to structure the data and code appropriately?
 
On the surface yes. But that's where the similarities end it seems. Nvidia assumes more responsibility for latency hiding than LRB does.

Yes, doesn't every x86 core get 4x SMT, running sequentially?
So that would mean that you get 4 fibres on a core, so you have '3 instructions latency' before the first fibre comes round again.

Aside from that, I suppose the compiler can fudge some PREFETCH instructions into the code, to hide some of the latency of memory access... and whatever the texture hardware may be capable of...?
 
Back
Top