Looks like Larrabee would need like at least 24 times the number of threads per core to be able to hide texture cache misses?
If a Larrabee core has one thread (lets's call it fragment thread C) stalled due to a pending texture request (bilinearly filtered texture result) then while C is stalled, fragment threads A, B and D can run their bilinear texture filter program. Since the bilinear program takes a whole
pile of instructions to produce a single texture result, Larrabee will happily fill in C's stall time with work for A, B and D.
Not all "programs" in a graphics system have the same "effective latency". Since there's a wide variety of latencies available in the mix of these programs, Larrabee should be able to find a set of threads, somewhere, that can hide the latency of a single "ALU instruction" - which is the worst-case scenario for a contemporary GPU. So you have a rasteriser program (1 thread, somewhere?), a vertex fetcher program (4 threads?), a Z-cull/Z-test program (8 threads?), a blend program (4 threads?), an MSAA program (8 threads?), a shader-ALU program (64 threads?), a texturing program (16 threads?) etc. and spread them round Larrabee as loading demands, presumably moving threads amongst cores to suit the workload of all the cores, jointly.
Obviously that's quite an entertaining load-balancing algorithm, sorta what current GPUs do but magnified because now a context can float to any core, whereas in the load-balanced part of GPUs (the ALU pipes) the context is never going anywhere else (though the data that makes up the context might find itself swapped off die to VRAM).
Just like in a current GPU from ATI or NVidia, most of these "programs" have deterministic execution time (e.g. the rasteriser is a fixed pipeline, the ALU pipeline knows how many ALU instructions it can execute before having to swap the thread out). That makes the load-balancing algorithm's job a lot easier - even if there is some give and take due to the irregularities of off-chip bandwidth/latency. Bear in mind current GPUs aren't perfect in this regard.
On average, as the typical shader's effective ALU:TEX ratio increases, it becomes easier and easier to hide texturing latency (though you can argue what the realworld effective ratio will be in 2009 for a vec4 pixel ALU, 4:1, 10:1, etc - R600 is happy with about a 9:1 ratio as far as I can tell, on complex shaders - but that's not necessarily a reflection of the average for all shaders in the application).
When Larrabee is not setup to "simulate" the D3D graphics pipeline but is running some more general purpose code, it should still be able to apply the same kind of load-balancing - developers should still be expressing their application in terms of scalar, MIMD, SIMD, latency-bound, compute-bound whatever programs that all share n cores...
Jawed