Larrabee: Samples in Late 08, Products in 2H09/1H10

I wasn't thinking of any particular time limit on code being self-modifying, other than its persistence in the instruction cache.

I was curious if the dynamic configuration idea meant Larrabee's threads would alter code paths as they monitored performance events, or if new code would be built in different memory locations and then Larrabee would direct execution to them.

The whole built-in compiler thing isn't particularly new for graphics, so I thought this would be different.
It's going to be an interesting balancing exercise about having interchangeable graphics pipeline stages or static 'uber' stages that can modify their behavior changing some bits here and there..
 
I wasn't thinking of any particular time limit on code being self-modifying, other than its persistence in the instruction cache.
Ever since x86 cores have separate L1 caches for data and instructions they invalidate any instruction cache lines when written to them. They also flush all decoding pipelines. This comes at a significant penatly, so self-modifying code is not an alternative to short branches (like it used to be). I don't think Larrabee would be very forgiving about self-modifying code either, even though it has out-of-order execution.
I was curious if the dynamic configuration idea meant Larrabee's threads would alter code paths as they monitored performance events, or if new code would be built in different memory locations and then Larrabee would direct execution to them.
Could you elaborate on that idea?
 
Could you elaborate on that idea?
I was thinking out loud (or in print, as it were) that a built-in compiler could utilize hardware event monitoring that is becoming more pervasive with multi-core devices.

Hardware conditions could give hints to the compiler about optimal cache blocking or load balancing.
The most immediate method would involve self-modifying code, which would have a large performance penalty.

The other option would be to have the compiler to use hardware feedback to recompile different variants of the program in different pages in memory, then use hardware monitoring to direct each core to a code page with a more optimal formulation.
 
CPUs rely on prefetching to avoid stalls.
Modern GPUs already prefetch data when possible as shader compilers typically schedule fetch instructions earlier in the program, while trying to keep at minimum the number of live registers needed.
 
What is this assertion based on?
Every GPU to date?

4 thread contexts is never going to be enough for texturing, prefetching gets you nowhere if you can't schedule the prefetch far enough ahead of time (which you can't) and running faster in between the stalls will get you nowhere if the stalls occur too close together (which they do). They will either need to provide some secondary form of hardware multithreading (for instance, if three threads are stalled push the dirty registers for the latest one into L1 and get a fresh thread) or have a texturing engine with many more threads and some computational capability (then the shader could be split up into a texturing and a computation phase by the shader compiler, a little like it was in DX8).
 
Last edited by a moderator:
Not to quote myself but,

I'm guess nAo is referring to the fact that current GPU hardware already caches texture accesses about as optimal as you could expect Larrabee to do, and even with a good read-only texture cache, 128 threads is simply not enough to hide the latency of cache misses for texture gather given the amount of vector elements Larrabee is going to have on each core.

Larrabee: 32KB split L1 -> 256KB L2 per core (16-way SIMD per core, 4 threads)

8800 Ultra: 8KB L1 texture -> 128KB? L2 per multiprocessor element (8-way SIMD per multiprocessor, 96 8-way SIMD threads)

128KB? -> number collected from B3D speculation?

Looks like Larrabee would need like at least 24 times the number of threads per core to be able to hide texture cache misses?
 
Stop thinking about DRAM utilization for a moment, do you want your cores to be able to do some work or to stall half of the time?

My point about DRAM utilization is that once you can tolerate enough latency to make your problem bandwidth bound, at that point hiding additional latency won't make the problem compute bound. So, you just need enough latency tolerance to cover memory latency (which, as we discussed above, will increase as the memory becomes more busy). If you're not bandwidth bound, then you need to tolerate enough latency to make your program compute bound.

Of course, you want to build a balanced system. My argument is that Larrabee's 128 threads isn't that far off of being balanced (especially if you consider software prefetching instructions, non-blocking loads, and scatter/gather).

Regarding the number of threads we need to hide memory latency..do you have a rough idea of how many threads current GPUs can handle at the same time in order to hide such a latency? hint: many more than 128 :)

The G80 has 24 blocking threads for each of 16 cores. That is 384 threads. That is 3x Larrabee's thread count. With some scatter/gather, software prefetching, and non-blocking loads, Larrabee should be able to obtain similar latency tolerance.
 
...prefetching gets you nowhere if you can't schedule the prefetch far enough ahead of time (which you can't)...
You can prefetch ahead as far as you like (*). If your previous reads were at address x+0 and x+12 then there's a good chance the following read will be at x+24 or very near to it (you have a cache line width of tolerance anyway). So every idle memory cycle it can speculatively read ahead data it will very likely need later on. Modern x86 CPUs even have several access pattern detectors build in and do hardware prefetching.

(*) The chance of prefetching data that will actually be used just deminishes if you prefetch too far ahead. But graphics has very predictable access patterns.
 
Looks like Larrabee would need like at least 24 times the number of threads per core to be able to hide texture cache misses?
If a Larrabee core has one thread (lets's call it fragment thread C) stalled due to a pending texture request (bilinearly filtered texture result) then while C is stalled, fragment threads A, B and D can run their bilinear texture filter program. Since the bilinear program takes a whole pile of instructions to produce a single texture result, Larrabee will happily fill in C's stall time with work for A, B and D.

Not all "programs" in a graphics system have the same "effective latency". Since there's a wide variety of latencies available in the mix of these programs, Larrabee should be able to find a set of threads, somewhere, that can hide the latency of a single "ALU instruction" - which is the worst-case scenario for a contemporary GPU. So you have a rasteriser program (1 thread, somewhere?), a vertex fetcher program (4 threads?), a Z-cull/Z-test program (8 threads?), a blend program (4 threads?), an MSAA program (8 threads?), a shader-ALU program (64 threads?), a texturing program (16 threads?) etc. and spread them round Larrabee as loading demands, presumably moving threads amongst cores to suit the workload of all the cores, jointly.

Obviously that's quite an entertaining load-balancing algorithm, sorta what current GPUs do but magnified because now a context can float to any core, whereas in the load-balanced part of GPUs (the ALU pipes) the context is never going anywhere else (though the data that makes up the context might find itself swapped off die to VRAM).

Just like in a current GPU from ATI or NVidia, most of these "programs" have deterministic execution time (e.g. the rasteriser is a fixed pipeline, the ALU pipeline knows how many ALU instructions it can execute before having to swap the thread out). That makes the load-balancing algorithm's job a lot easier - even if there is some give and take due to the irregularities of off-chip bandwidth/latency. Bear in mind current GPUs aren't perfect in this regard.

On average, as the typical shader's effective ALU:TEX ratio increases, it becomes easier and easier to hide texturing latency (though you can argue what the realworld effective ratio will be in 2009 for a vec4 pixel ALU, 4:1, 10:1, etc - R600 is happy with about a 9:1 ratio as far as I can tell, on complex shaders - but that's not necessarily a reflection of the average for all shaders in the application).

When Larrabee is not setup to "simulate" the D3D graphics pipeline but is running some more general purpose code, it should still be able to apply the same kind of load-balancing - developers should still be expressing their application in terms of scalar, MIMD, SIMD, latency-bound, compute-bound whatever programs that all share n cores...

Jawed
 
If bilinear texture filtering takes all the time a single threaded pixel shader can stall that presents a whole different kind of problem for Larrabee :) (It would be pretty much all the processing time.) Shaders should be running most of the time, so only shaders are relevant for covering latency ... everything else should be small fry or on dedicated units.
 
You can assume I meant non speculative, software, prefetching.
Dunno about "graphics" but small polygon rendering certainly not.
Yep, and also in 2010 you don't want a GPU that slows down to a crawl cause you're doing this:

output.colour = tex2D(sampler0, tex2D(sampler1, uv).xy);
 
Last edited:
If bilinear texture filtering takes all the time a single threaded pixel shader can stall that presents a whole different kind of problem for Larrabee :) (It would be pretty much all the processing time.) Shaders should be running most of the time, so only shaders are relevant for covering latency ... everything else should be small fry or on dedicated units.
If you broke out the effective programs for G92, I bet you'd find that texturing consumes way way over 50% of the "effective instructions" running across the entire chip (I choose G92 because of its symmetry in TA/TF).

I'm not saying that Larrabee will be high performance. Oh no - I'm merely pointing out that hiding texturing latency when all you have is lots of threads that can only run a single shader ALU instruction (as current GPUs are generally capable) will become less and less relevant to the overall performance of the "D3D pipeline" - ALU:TEX is increasing.

Larrabee is trading worst-case texture-latency hiding against short shaders (e.g. 1 instruction) against the poor utilisation you get on current GPUs with longer shaders as various fixed function units (e.g. ROPs) idle.

I don't know where the cut-over point is - and I'm hardly a Larrabee fanboy - I think its performance will be embarrassing unless it runs at about 6GHz according to what we're currently hearing.

Jawed
 
If a Larrabee core has one thread (lets's call it fragment thread C) stalled due to a pending texture request (bilinearly filtered texture result) then while C is stalled, fragment threads A, B and D can run their bilinear texture filter program. Since the bilinear program takes a whole pile of instructions to produce a single texture result, Larrabee will happily fill in C's stall time with work for A, B and D.

I certainly hope I don't have to compute my bilinear coefficients by hand. Even on X360 right now you have a fixed-function that gives you the lerp factors from a UV pair. Combined with a scatter/gather that can fetch 4 components in a single instruction that could be workable. Similar to how Fetch4 currently works, I guess.
Also, fixed-function for address generation would be a must. For example, you feed normalize/unnormalized UV or UVW and you get back an address, maybe 4 at the same time would be ideal.
All combined it breaks down to: fixed-function for address generation (1 cycle), fixed-function for bilinear/trilinear lerps (1 cycle), Fetch1 to Fetch4 (1 cycle if in cache) and finally bilinear lerps (or any custom processing like PCF for shadow maps etc).
 
I certainly hope I don't have to compute my bilinear coefficients by hand.
I don't see why you should - this would be a "special graphics instruction".

Even on X360 right now you have a fixed-function that gives you the lerp factors from a UV pair. Combined with a scatter/gather that can fetch 4 components in a single instruction that could be workable. Similar to how Fetch4 currently works, I guess.
Yeah, that's various kinds of "unbundling" of traditional texturing concepts to give the programmer more freedom.

Also, fixed-function for address generation would be a must. For example, you feed normalize/unnormalized UV or UVW and you get back an address, maybe 4 at the same time would be ideal.
All combined it breaks down to: fixed-function for address generation (1 cycle), fixed-function for bilinear/trilinear lerps (1 cycle), Fetch1 to Fetch4 (1 cycle if in cache) and finally bilinear lerps (or any custom processing like PCF for shadow maps etc).
G92 (again, because of its convenient TA:TF symmetry) "unbundles" its TMUs. Per clock these can function as 64 bilinears, 32 fp16 bilinears, 32 trilinears, etc.

The way I see it, each Larrabee core consists of a quad of vec4 fp32 ALUs. If these ALUs have extensions in them (say to fp40/int32 - I have to admit my texturing math knowledge is too shaky to know the required precisions across all stages of a TMU), they can perform a similar kind of unbundling as we see in G92 TMUs. So, the baseline is T fp32 bilinears per clock, with 2T fp16 bilinears and 4T bilinears (int8). If G8x/G9x can do this kind of thing, why can't Larrabee?

I presume, for example, that the complexity of LOD and bias calculations won't scale with the format of the texture, e.g. LOD/bias will cost the ~same for both int8 and fp32 filtering.

Now, the question is, what is T in Larrabee?

Jawed
 
The G80 has 24 blocking threads for each of 16 cores. That is 384 threads. That is 3x Larrabee's thread count. With some scatter/gather, software prefetching, and non-blocking loads, Larrabee should be able to obtain similar latency tolerance.

Oops (sorry for my bad logic), looks like you are right in that the G80 schedules warps (4x 8 way SIMD) at a time. So my factor of 24x is more like a factor of 6x for the 16 core Larrabee (64 threads) and a factor of 4x for the 24 core version (96 threads). Of course I'm not taking into consideration clock rate and memory bandwidth here, simply number of threads.

Also we are comparing Larrabee, which is a 2009/2010 product to the G80 of 2006/2007.
 
The G80 has 24 blocking threads for each of 16 cores. That is 384 threads. That is 3x Larrabee's thread count. With some scatter/gather, software prefetching, and non-blocking loads, Larrabee should be able to obtain similar latency tolerance.
According CUDA documentation each multiprocessor processes one block of threads at time (independently from the other multiprocessors), and a block can have up to 512 threads.
Since G80 has 16 multiprocessors it can handle ~8000 threads at the same time (obviously not in the same clock cycle).
This is 64 times more than the number you gave us for Larrabee, that's why I believe Larrabee should have a mechanism to quickly switch from a group of threads to another one.
 
Back
Top