Any clue how strands and fibers are actually represented?
The simplest scenario is in pixel shading, where 4 quads of fragments are grouped to form what Intel calls a qquad.
Depending on the kind of code contained in the pixel shader and on the number of render targets written, each of the hardware threads that undertake pixel shading will take "batches" of qquads.
The program that actually runs on Larrabee will be a loop, "for each qquad: shade".
This is similar to the discussion we've been having recently about making a kernel produce more than one result: logically making one invocation compute multiple work items.
So, a fibre corresponds with a qquad. "Fibre" is purely software-implemented multi-threading. "Strand" is then the number of elements that share a program counter.
In Larrabee it appears that the normal way fibres will be constructed is from 16 strands. There's no reason not to use more (and for double-precision the minimum would be only 8) but this would prolly be a tweak for performance.
Can I take a single hardware thread and do anything I want
Yep. If there's 4GB of memory then you have 4GB/64 bytes (64 bytes = 16 scalars packed) ~67M variables that you can store in memory. If you have 32 cores and 4 hardware threads per core, that's 0.5MB per thread of private variables per hardware-scheduled context.
How you split up memory (and allocate some for shared storage, e.g. as textures usuable by all threads) is up to you, depending on how coarse- or fine-grained you want to make your work-items.
I'm also curious as to how hardware and software switching are going to work together.
As far as I can tell the 4 hardware threads (contexts) run symmetrically, by default. This may be how Intel hides read-after-write latency in the register file.
It appears that individual contexts can be put to sleep - they may request it. Not sure. This happens when a long-latency operation is started and they have no instructions that can be issued in the shadow of that latency. Ideally the contexts make their requests as early as possible in order to fill the latency-shadow with work, but obviously the randomness of memory operations makes that unpredictable.
Jawed