Larrabee at GDC 09

How are the vector masks generated? Is it always handled by the hardware based on branches encountered in the code? I guess I'm missing the part where my scalar program would get strung across the vector unit for 16 data items at a time. It's pretty clear what happens in CUDA with the warp/block construct. But does LRB also have a similar setup where it automatically breaks down your data set into strand groups and maps them across the VPU? I thought those were just concepts and it was up to the developer to structure the data and code appropriately?

My brief look at the instructionset gave me the impression that it will probably be done 'manually'.
As in, you can generate a mask, and then repack your strands based on the mask, but you'd have to specifically inject instructions to do that. But perhaps I missed something very significant there :)
 
I still haven't seen the threading modely fully characterized.

The phrase "4-way SMT running sequentially" is a contradiction.
If it's sequential, we have some kind of fine-grained or hybrid round-robin scheduling.

Given the narrowness of Larrabee, it looks like the core can only physically issue from two threads. If the order is not dynamically determined, the core is not using SMT.
 
So let me get this straight...
A fibre is a piece of SIMD code, where you have 16 scalar (or 8 in the case of DP) strands operating in parallel, all running on a logical 'x86' core?
"Fibre" is just nomenclature for multi-threading within the hardware threads that Larrabee offers, a soft context if you like, like an operating system thread.

You can have any number of fibres in a hardware thread. The code running in that thread is repeated for each fibre. The fun starts with control flow and un-packing/packing related to irregular data structures.

For something like pixel shading the same instructions are repeated on the same-sized blobs of fragments. In Larrabee's case, 16.

That sounds quite similar to what nVidia does.
Yes, it's very similar to what the GPUs do. In essence NVidia has a minimum fibre size of 32 and ATI 64.

The issue is that Larrabee has very little in the way of automation for latency hiding, as it only has 4 hardware threads. So most latency-hiding has to be performed within each thread by glooping-together fibres.

So pixel shading code, for example, is dependent upon an asymmetric mix of code running on a core to successfully hide latency. e.g. setup and blending/z-testing/resolve are running on one hardware thread. This work is very low latency because all the data lies in L2. The size of the tile of pixels in L2 (i.e. portion of render target) also affects the ratio of latency-hiding to math.

So, on their own, cores aren't up to the task of hiding typical worst case texturing latency. But with task asymmetry and the effectively-zero latency of RBE the overall total latency experienced by a core in pixel shading is manageable.

Jawed
 
How are the vector masks generated? Is it always handled by the hardware based on branches encountered in the code? I guess I'm missing the part where my scalar program would get strung across the vector unit for 16 data items at a time. It's pretty clear what happens in CUDA with the warp/block construct. But does LRB also have a similar setup where it automatically breaks down your data set into strand groups and maps them across the VPU? I thought those were just concepts and it was up to the developer to structure the data and code appropriately?
I am not sure what you are comparing. I don't know how close is CUDA programming model to the real hardware (I guess very close, but it's probably on a slowly diverging path..) so there's a lot of it which is automatically handled by the hardware right now, but that doesn't mean it's going to be that way in the future as well.
On LRB your scalar program has to be 'vectorized' and/or fibered, which is something a compiler can do for you (for instance a shader compiler), unless you want to do it by yourself, as your obviously free to do so.
 
Yes, doesn't every x86 core get 4x SMT, running sequentially?
So that would mean that you get 4 fibres on a core, so you have '3 instructions latency' before the first fibre comes round again.

I got the impression that the 4xSMT was at a very high level. But each thread could be managing multiple fibers and multiple strand groups per fiber.

So to Jawed's point about the fiber just being a big for loop over all the strand groups it looks like the "manual latency hiding" within a fiber could be accomplished by breaking up the loop.

Code:
for (group : strand_groups) {
 //memory requests
}

for (group : strand_groups) {
 //do independent latency hiding alu ops
}

for (group : strand_groups) {
 //do dependent alu ops
}
Or maybe latency hiding doesn't happen at that level and it's all done by switching between entire fibers. But I imagine it'd be a similar approach. I have no clue!
 
The phrase "4-way SMT running sequentially" is a contradiction.
If it's sequential, we have some kind of fine-grained or hybrid round-robin scheduling.

Depends on how you look at it.
If you see the 'synchronous' in SMT in the most literal meaning, then no... the instructions of multiple threads aren't executed at the same time.
However, there ARE multiple threads running 'at the same time' on a single core (multiple program counters etc, no timeslices, but threading at instructionlevel). And the instructions of multiple threads are actually in execution at the same time. They just aren't started at the exact same moment.

I believe Sun's SMT works in the same round-robin way... as does Atom?
 
I am not sure what you are comparing. I don't know how close is CUDA programming model to the real hardware (I guess very close, but it's probably on a slowly diverging path..) so there's a lot of it which is automatically handled by the hardware right now, but that doesn't mean it's going to be that way in the future as well.

True but there are some things set in stone. For example your code can access certain constructs like threadId, blockId etc. So those things are presumably going to stay around. And you don't really need to know the SIMD size to write good performing code. I was just wondering whether LRB provided this sort of abstraction as well where instances of my scalar program would automatically get vectorized across the VPU.

On LRB your scalar program has to be 'vectorized' and/or fibered, which is something a compiler can do for you (for instance a shader compiler), unless you want to do it by yourself, as your obviously free to do so.

So the answer is yes. Thanks.
 
"Fibre" is just nomenclature for multi-threading within the hardware threads that Larrabee offers, a soft context if you like, like an operating system thread.

Makes sense, in a way.
In Windows terms, a 'fiber' is a thread that is not scheduled by the OS. Larrabee doesn't really have an 'OS' as such... so it leaves its thread scheduling to drivers/applications I suppose.
 
As in, you can generate a mask, and then repack your strands based on the mask, but you'd have to specifically inject instructions to do that. But perhaps I missed something very significant there :)
That concept is something that's been called dynamic warp formation.

I've discovered that split-and-segment is the generic name for this, section 2.3.3:

http://www.idav.ucdavis.edu/func/return_pdf?pub_id=915

which can then be used as the basis to construct new fibres. But it's fairly intensive, so you really have to want to do this it seems. The memory architecture of Larrabee (big caches with trivial across-chip/core sharing) may make performance of this very practical.

So, dynamic warp (fibre) formation, done purely in the hardware scheduler, could be a big win. But it seems unlikely this will appear any time soon in Larrabee.

Larrabee's minimum fibre size of 16 is theoretically an advantage, but of course there's still the fundamental problem of how much latency needs to be hidden, which requires more fibres. The work scheduler that generates fibres for processing by threads may not have complete freedom to generate replacement fibres for those that are complete if fibres still executing cause a block - e.g. pixel-rendering order.

Jawed
 
Depends on how you look at it.
If you see the 'synchronous' in SMT in the most literal meaning, then no... the instructions of multiple threads aren't executed at the same time.
That leaves the realm of SMT and goes to fine-grained multithreading or barrel processing.


I believe Sun's SMT works in the same round-robin way... as does Atom?
Sun's multithreading isn't SMT, it is a hybrid round-robin fine-grained design.
Atom, from what the web articles on it have stated, is SMT.

I'm still looking for an Intel presentation on Larrabee that uses the term SMT, as opposed to just pointing out that there are multiple hardware threads.
 
The phrase "4-way SMT running sequentially" is a contradiction.
If it's sequential, we have some kind of fine-grained or hybrid round-robin scheduling.
I suspect it's something like round-robin to hide cache-line and register latencies.

Jawed
 
Yeah, your pseudo-code is pretty much how I envisage this working.

Or maybe latency hiding doesn't happen at that level and it's all done by switching between entire fibers. But I imagine it'd be a similar approach. I have no clue!
OK, maybe it's just better to link:

http://en.wikipedia.org/wiki/Simultaneous_multithreading

Thread

You can do pretty much anything within a hardware thread on Larrabee, including having sets of fibres each progressing separately.

For D3D/OpenGL programmers this complexity should not be visible.

Jawed
 
You can do pretty much anything within a hardware thread on Larrabee, including having sets of fibres each progressing separately.

Yeah I wasn't really questioning the hardware's capabilities or imposing any artificial restrictions. It was more just thinking out loud about how developers might actually approach the problem.
 
Sun's multithreading isn't SMT, it is a hybrid round-robin fine-grained design.

I'm not sure what Sun calls it... people generally refer to it as SMT, although that perhaps may not be the proper term. I suppose barrel processing is the proper term, but I guess for simplicity's sake, many websites just use the term SMT.

I'm still looking for an Intel presentation on Larrabee that uses the term SMT, as opposed to just pointing out that there are multiple hardware threads.

Intel never uses the term SMT as far as I know. They generally just use the term HyperThreading, which can mean various things.
 
I'm not sure what Sun calls it... people generally refer to it as SMT, although that perhaps may not be the proper term. I suppose barrel processing is the proper term, but I guess for simplicity's sake, many websites just use the term SMT.
Many sites gobble up the idea that every lane in a GPU is a core...

Intel never uses the term SMT as far as I know. They generally just use the term HyperThreading, which can mean various things.
Intel's architects have used the term simultaneous multithreading when discussing Atom.

edit:
Quick google found this:
http://pcworld.about.com/od/businesscenter/Chasing-New-Markets-With-Intel.htm
 
True but there are some things set in stone. For example your code can access certain constructs like threadId, blockId etc. So those things are presumably going to stay around. And you don't really need to know the SIMD size to write good performing code. I was just wondering whether LRB provided this sort of abstraction as well where instances of my scalar program would automatically get vectorized across the VPU.

if you have a "scalar" program, don't count on it getting vectorized by any hardware be it Nvidia, ATI, or Intel. threadId/blockId are effectively just array index variables and can be done via either software or hardware within the context of the CUDA API.
 
I'm not sure what Sun calls it... people generally refer to it as SMT, although that perhaps may not be the proper term. I suppose barrel processing is the proper term, but I guess for simplicity's sake, many websites just use the term SMT.

there are various disagreements in academia and industry about the strict meaning of SMT. Some propose that it should be restricted to simultaneous issue architectures while others believe that any design that allows more than 1 thread to be in some stage of execution qualify for the term SMT.

This is basically to differentiate them from SoEMT and some form of barrel threading that are generally fairly restricted in both their execution slotting and issue (in the strictest sense a barrel threading MUST slot an issue from an active thread even it that thread has no work to do, effectively hardware injecting a nop slot while an execution only SMT will only issue from threads that have instructions to execute).

So in general the term SMT has been devolved from the original UW/DEC work where effective co-issue was a requirement to any architecture that is capable of dynamically selecting between multiple threads for issue based on instruction availability and also capable have having more than 1 thread in some state of execution.
 
Many sites gobble up the idea that every lane in a GPU is a core...

Pfft, everyone knows those are threads :)

Intel's architects have used the term simultaneous multithreading when discussing Atom.

Yea, I tried to look up some info on Atom's HT, and according to Arstechnica, it can issue up to 2 instructions per cycle, both coming from 1 thread, or coming from 2 different threads.
So it can dispatch instructions from 2 threads in the same cycle, making it as simultaneous as it gets :)
 
Back
Top