Larrabee: Samples in Late 08, Products in 2H09/1H10

Anyone care to speculate just what these new DX11 features will be? Programmable ROP perhaps?

I'd imagine that the problem with reaching peak performance on Larrabee is going to be the awfully asymmetric instruction set. Given that you really still have to program in assembler to extract the performance potential of SSE2/3, Larrabee's x86 + 2DP/clk + 512bit vector ops, sounds like a nightmare for a compiler to output good code for. It was very refreshing when NVidia went scaler for its unified shader arch, very easy to program optimally in Cg or GLSL compared to any CPU side SSE2/3 stuff.
 
It would be a lot easier if we would plot a timeline graph of GPU speed increase over the years. I propose starting from NV40 to get a bit better idea. Unfortunately I don't know the exact GFLOP ratings of earlier series so someone else has to provide them. Information about later series has already been provided.

If it "has" to be NV40 at any price here you go:


NV40
released in Q2 2004 with 222M Transistors at 130nm IBM
16 SIMD channels * 12 FLOPs (Vec4 MADD + Vec4 MUL) * 0.4GHz = 76.8 GFLOPs

G70
released in Q3 2005 with 300M Tranistors at 110nm TSMC (334mm^2)
24 SIMD channels * 16 FLOPs (Vec4 MADD + Vec4 MADD) * 0.43GHz = 165.0 GFLOPs

G71
released in Q1 2006 with 278M Transistors at 90nm TSMC (196mm^2)
24 SIMD channels * 16 FLOPs * 0.65GHz = 250.0 GFLOPs

(and yes before you say it I obviously miscalculated G71 in my former post)

G80 released in Q4 2006 with 681M Transistors (w/o NVIO) at 90nm TSMC (484mm^2)
128 SPs * 3 FLOPs (scalar MADD + scalar MUL) * 1.35GHz = 518 GFLOPs

The bottomline remains the same; those are mere theoretical peak GFLOP rates; even if you entirely discount the scalar MUL unit on G80, texturing has been de-coupled from ALUs and not like on former architectures as illustrated here:

http://www.beyond3d.com/content/reviews/36/10

It won't be too hard to find corner cases where G80 ends up in real games up to 3x times than G71; with NV40 the ballpark is so huge that a comparison between those two would be ridiculous. But since you really insisted on focusing on a timeline since NV40 here you are: it's 518 GFLOPs vs. 77 GFLOPs in 2 years and one quarter.

***edit: by the way while it's highly interesting to follow debates about arithmetic efficiencies, internal architectural aspects and the likes but I'm more than just worried about IQ aspects.
 
Last edited by a moderator:
I'd imagine that the problem with reaching peak performance on Larrabee is going to be the awfully asymmetric instruction set. Given that you really still have to program in assembler to extract the performance potential of SSE2/3, Larrabee's x86 + 2DP/clk + 512bit vector ops, sounds like a nightmare for a compiler to output good code for. It was very refreshing when NVidia went scaler for its unified shader arch, very easy to program optimally in Cg or GLSL compared to any CPU side SSE2/3 stuff.

Perhaps that is why Intel is developing Ct? Ct is described in a paper from Intel researchers/developers:

Future-Proof Data Parallel Algorithms and Software on Intel® Multi-Core Architecture
 
I totally agree.

Let me clarify my previous post. What I intended to say is that there are new instructions *and* the special purpose hardware ALUs to support those instructions. However, unlike current GPUs, there is no *other* special hardware. No other fragment pipeline or special z-buffering frame buffer (or whatever else GPUs have today). Just many x86 cores with extra vector ALUs for executing the new instructions tailored for graphics processing.

The hardware internals for GPUs are often more flexible than is exposed by graphics APIs and graphics drivers, hence CUDA and CTM.

While there are a number of design wrinkles and hardware designed to help emulate the state machine of the graphics pipeline, a significant amount of the hardware is pretty much agnostic.
It's sort of like how Transmeta's VLIW processors had a fair amount of design decisions that made little sense for VLIW, but were added to allow it to perform the job of emulating an x86 state machine.
The setup engine, a few caches optimized for 2-dimensional accesses, are examples that come to mind that may or may not translate well to other workloads.

Without a die plot and better numbers, the penalty of the specialized hardware is difficult to quantify.

For GPGPU, this might be a problem.
For consumer graphics for quite some time into the future, the lack of specialized hardware for emulating the consumer graphics pipeline may be a detriment to Larrabee.

The key difference is the programming model. For Larrabee, a program can just use inline assembly (or library calls) to insert these vector operations into a regular program. There is no special setup or other low-level implementation-specific poking of the hardware to get the special purpose hardware going. Just as SSE isn't conceptually difficult to add to a program (assuming it has the right sort of data parallelism) these vectors will be similarly easy to use.

Larrabee's flexibility would be mitigated in the consumer graphics scene by the fact that it would be hiding behind an API and driver like the GPUs.

In systems where it is allowed to function as a primary processor, it would have an advantage.

GPUs do suffer from the fact that they do not have similar capability, though consumer graphics has adapted to this model well enough.

Another key point is that Larrabee has coherent caching (just like Intel's other multi-core systems). Unlike a GPU that requires explicit commands to move data around the system and/or flush caches at the right time, all that is done seamlessly in Larrabee. Instead of burdening the programmer in worrying about all these issues, Larrabee really is just a shared-memory multiprocessor on a chip.
GPUs, without SLI and Xfire usually operate with internally partly shared caches. G80 has exposed an explicit parallel data cache, which complicates matters.
R600 has a number of shared caches.

Slides on R7xx seem to indicate a more transparent sharing of separate memory controllers. It seems likely that by 2010 a fair amount will have changed in this area.
R600 was already equipped with TLBs and possesses an internally distributed memory client model.
Memory coherency and synchronization tends to be weakly defined with GPUs between clusters, but signs point towards an evolution towards a model that will be closer to x86 (though likely still distinct).

I don't understand why under SMT one thread would block the other threads. The whole point of threading is to allow the other threads to continue.
I was reinforcing the point that Larrabee's units are likely fully pipelined for most instructions.

If an execution unit is not fully pipelined for a given operation, it cannot start on the next operation until the first instruction has cleared whatever interlock is in place.
If a unit takes 5 cycles for an operation, but is not fully pipelined, then for some number of cycles, it cannot allow any instruction issue at all, irrespective of data dependency. As a hardware hazard, it also spans between threads, whereas data dependencies cannot.

If it is fully pipelined, an operation can begin stage 1 as an earlier operation enters stage 2.
For common instructions, a lack of full pipelining is highly undesirable.
For common instructions compounded by having multiple threads, contention is worse. This is quite possibly worse in consumer graphics, where a lot of threads can be assumed to be working with similar instruction mixes at a given instant.

For SMT, it can result in stalls, especially at 4-way threading.
For FMT, it is an extra scheduling headache because the threads are supposed to cycle regularly.

Although most systems have a hard time reaching peak performance, having 4 threads per processor to suck up ALU bandwidth will help Larrabee get much closer to peak performance than systems without threads (such Intel's current multi-core chips).

Of course, the big down side is that now the programs need to generate 128 threads, which isn't a trivial task.

For graphics work, it isn't too hard to generate threads.
The "thread" counts are in the many hundreds to maybe a thousand for GPUs right now.
A full x86 thread's context is definitely heavier than a single primitive running through a GPU, however.
For consumer graphics, I am unsure heavy threading is the way to go in the long run, and it does run up against a fair amount of inertia in the near and medium term.
 
Last edited by a moderator:
Memory coherency and synchronization tends to be weakly defined with GPUs between clusters, but signs point towards an evolution towards a model that will be closer to x86 (though likely still distinct).

I agree with most of what was said above, especially this. GPUs are becoming more and more general in terms of computation resources, caching, and memory model. I think Intel looked at the trend, tried to extrapolate 10 years out, and built Larrabee in that vision. It might have some deficiencies in the first iteration, but as things evolve it will make more and more sense.

Interestingly, the Larrabee project was actually started out as an internal Intel Venture Capital project. Even now, the organization chart has the visual computing group and the rest of the x86 projects don't have a common boss until pretty high up. Maybe only a step or two away from the CEO. From what I've heard, some of the other groups at Intel aren't so happy with what Larrabee has done.

I was reinforcing the point that Larrabee's units are likely fully pipelined for most instructions.

Ah, I see. Yep.

For graphics work, it isn't too hard to generate threads.
The "thread" counts are in the many hundreds to maybe a thousand for GPUs right now.
A full x86 thread's context is definitely heavier than a single primitive running through a GPU, however.

Yes, an x86 context is pretty heavy weight. For example, the vector register file on each Larrabee core is probably pretty big. Four contexts x n register x 64 bytes. If each thread has, say, 32 vector registers, that would be an 8KB register file! Considering the vector register file would need multiple read and write ports, that will likely be a big structure (much larger than an 8KB single-ported cache).
 
Considering the vector register file would need multiple read and write ports, that will likely be a big structure (much larger than an 8KB single-ported cache).
Sadly, I think that's right. I can't see any way for Larrabee to be single-ported; which is pretty bad, since that is a significant disadvantage against GPUs which *are* single-ported. Otherwise they couldn't have 512KiB register files, obviously...
 
If it "has" to be NV40 at any price here you go:


NV40
released in Q2 2004 with 222M Transistors at 130nm IBM
16 SIMD channels * 12 FLOPs (Vec4 MADD + Vec4 MUL) * 0.4GHz = 76.8 GFLOPs

G70
released in Q3 2005 with 300M Tranistors at 110nm TSMC (334mm^2)
24 SIMD channels * 16 FLOPs (Vec4 MADD + Vec4 MADD) * 0.43GHz = 165.0 GFLOPs

G71
released in Q1 2006 with 278M Transistors at 90nm TSMC (196mm^2)
24 SIMD channels * 16 FLOPs * 0.65GHz = 250.0 GFLOPs

(and yes before you say it I obviously miscalculated G71 in my former post)

G80 released in Q4 2006 with 681M Transistors (w/o NVIO) at 90nm TSMC (484mm^2)
128 SPs * 3 FLOPs (scalar MADD + scalar MUL) * 1.35GHz = 518 GFLOPs

Your missing vertex shader FLOPs there. Thats an extra 50 GFLOPs for G71 in the comparison to G80.
 
Interestingly, the Larrabee project was actually started out as an internal Intel Venture Capital project. Even now, the organization chart has the visual computing group and the rest of the x86 projects don't have a common boss until pretty high up. Maybe only a step or two away from the CEO. From what I've heard, some of the other groups at Intel aren't so happy with what Larrabee has done.
That is interesting.
I can imagine there are a number of groups that would feel uncomfortable with Larrabee's being the cheap FLOPS design that it is.
The throughput and FP performance may impinge on elements of IA64 and standard x86 in HPC.

It would also potentially affect various IBM, Sun, Nvidia, and AMD (especially AMD) products as well, though.


Yes, an x86 context is pretty heavy weight. For example, the vector register file on each Larrabee core is probably pretty big. Four contexts x n register x 64 bytes. If each thread has, say, 32 vector registers, that would be an 8KB register file! Considering the vector register file would need multiple read and write ports, that will likely be a big structure (much larger than an 8KB single-ported cache).

There's also the integer file and the whole mess of process status registers and other bits of state that must be duplicated.
GPUs tend to keep a coarser shared context, though this can have downsides when it comes to debugging.

It's not clear if Intel plans to go beyond the 16 vector registers it already has in long mode, though specifying the expanded vector ops would in theory allow it to add another bit somewhere to point to new registers.


Speculation:

It could be partially mitigated if the L1 data cache is pseudo or fully dual-ported.
x86 mem-reg ops could pull one operand from the cache, and the cache lines are the same length as the longer registers.

The Larrabee slides point to the possibility of a dual-issue core, possibly two vector pipelines.
The peak DP numbers are kind of odd, since they are given as a range: 8-16.

I though this might have meant that they hadn't decided, but it also makes sense if Larrabee lacks enough read ports to access 4 separate registers, but can rely on an operand from memory.
A two-register operation would cause one pipe to block the other.
If both use reg-mem, then peak execution is possible.
That would leave 2 read and 2 write ports to the reg file, which is still rather hefty, but possibly less so if Intel sticks with 16 registers.

If the core's threading uses round-robin, almost all the L1 load-to use penalty would be hidden.
Larrabee would need some kind decoupled pipeline to mostly hide the cycle penalty that results from a memory access.

This approach unfortunately does put pressure on the cache, and I don't know the latencies for all of the various GPU storage locations (general registers, special registers, various caches, indexes, buffers, etc).
GPUs would be running much of their peak operations wholy within the register file, while Larrabee would need to rely on the less predictable L1 that the threads are fighting over.

Sadly, I think that's right. I can't see any way for Larrabee to be single-ported; which is pretty bad, since that is a significant disadvantage against GPUs which *are* single-ported. Otherwise they couldn't have 512KiB register files, obviously...

Going by the vector registers alone, 8KiB x 24 would be 192 KiB. Adding the rest of the registers, it still leaves Larrabee's aggregate register storage at less than last year's GPUs.
The cache capacity is huge (32 KiBx24 L1, 256 x 24 L2), but GPUs have done so well without ultra-massive caches that this may or may not be past diminishing returns for consumer graphics.
 
a couple of thoughts:

1) If Intel is not suicidal they will put some sort of fixed function rasterizer on Larrabee in order to help speeding up rasterization. Unless they don't care about games.
They need something that can perform coarse and fine grained rasterization so that can also efficiently run early rejection algorithms.
I guess all sort of different rendering pipelines will be possible on Larrabee, the question is how efficient such an open/reconfigurable architecture
is going to be compared to what the competition will offer in a couple of years

2) It's nice that a core can process a 16-way vector and 4 threads but this is not nearly enough what it's needed to hide texture fetches latency.
A core will need a mechanism to rapidly switch between groups of 4 threads everytime a long latency data dependency is encountered.
 
Last edited:
2) It's nice that a core can process a 16-way vector and 4 threads but this is not nearly enough what it's needed to hide texture fetches latency.
A core will need a mechanism to rapidly switch between groups of 4 threads everytime a long latency data dependency is encountered.

That would fall under the thread context problem. Being x86, unless Larrabee invents a whole new class of thread, the four threads it runs at a time are going to have to go through a full context switch to move each one out. The scheduler in the driver would have to do more as well, as GPUs seem to be more free to push their lightweight threads around.

Moving just the integer registers out will take a few cycles. If there are 16 int-64 registers, it would take an L1 bus wide enough to load the extended vector operands 2 cycles.
The vector regs would take whatever number of vector registers there are.

Each thread's vector context would be 1KiB if there are only 16 vector registers.
A 32 KiB data cache is going to get cramped if too many threads are in waiting.
 
...the cache lines are the same length as the longer registers.

This seems pretty likely. You wouldn't want 32B cache blocks for 64B registers. 128B cache blocks is a bit too large, so 64B cache blocks sounds about right.

I though this might have meant that they hadn't decided, but it also makes sense if Larrabee lacks enough read ports to access 4 separate registers, but can rely on an operand from memory.

I actually think that Larrabee has some three-input or even four-input vector instructions. Some of those inputs might be non-vector registers (such as a mask register to predicate the vector operation). The good news is that read ports are cheaper than write ports. You can always double the number of read ports by replicating the structure. This might seem wasteful, but it can actually take less area than making a single structure with lots of ports (the wires begin to dominate).

If Larrabee is dual issue (which seems plausible), it would likely replicate the register file for each of the two issue pipelines. This should help the layout, as each ALU can be close to its own copy of the register file. Such clustered designs have already been used (The Alpha 21264 is the classic example), so that might work well.

This approach unfortunately does put pressure on the cache, and I don't know the latencies for all of the various GPU storage locations (general registers, special registers, various caches, indexes, buffers, etc).
GPUs would be running much of their peak operations wholy within the register file, while Larrabee would need to rely on the less predictable L1 that the threads are fighting over.

I think making the cache dual ported is probably more expensive than the register file.

The cache capacity is huge (32 KiBx24 L1, 256 x 24 L2), but GPUs have done so well without ultra-massive caches that this may or may not be past diminishing returns for consumer graphics.

24 * 256KB is 6MBs of L2 cache. That is similar to the amount of cache on a Core 2 Duo (4MBs) or some of today's newer 45nm chips from Intel (6MBs). This may be huge by GPU standards, but it seems reasonable from a multi-core chip perspective. But yes, the largest version of Larrabee they ship (be that 16, 24, or 32 cores) is going to be a big die.
 
2) It's nice that a core can process a 16-way vector and 4 threads but this is not nearly enough what it's needed to hide texture fetches latency.
A core will need a mechanism to rapidly switch between groups of 4 threads everytime a long latency data dependency is encountered.

This quickly becomes a memory *bandwidth* problem, not a latency hiding problem.

Let's assume the memory latency is, say, 50ns. Each of the 128 threads (32 cores x 4 threads) can have an outstanding miss for a 64B cache block. That means, on average, every 50ns the 128 threads will have generated misses for 128*64B = 8192 bytes (every 50 ns). That is 163 bytes/ns, which is 163 GBytes/second. I'm not sure how many memory controllers Larrabee will have, but memory bandwidth in the 100GB/second to 200GB/second is likely.

As such, 128 threads should be enough to saturate the available off-chip bandwidth. Once you've saturated your off-chip bandwidth, there is nothing else you can really do (except someone fix the bandwidth bottleneck).
 
I actually think that Larrabee has some three-input or even four-input vector instructions. Some of those inputs might be non-vector registers (such as a mask register to predicate the vector operation).
I'm aware of the newer SSE4 extensions that have an implied operand, but it's one thing to have an implied mask operand that does not affect the maximum flop count and another arithmetic operand that does.

My base assumption, which could very well be wrong but seems plausible, is that Intel's idea of an x86-based multicore to provide a familiar software target seems moot if the vector extensions are wildly different.
To keep the decoder simple, Intel would try to reuse as much of the x86 decoder as possible for the extensions.
That encourages to similar semantics, and similar code behavior.

If Intel wants a huge break, more power to them and their modal instruction decoder.

That doesn't mean it's not possible. AMD posited something along those lines, but Intel is thus far not interested in SSE5.

The good news is that read ports are cheaper than write ports. You can always double the number of read ports by replicating the structure. This might seem wasteful, but it can actually take less area than making a single structure with lots of ports (the wires begin to dominate).
There are already large physical register files that run with greater port count and higher clocks than what Larrabee will target on a much smaller process.
Replication may not be necessary.

If Larrabee is dual issue (which seems plausible), it would likely replicate the register file for each of the two issue pipelines. This should help the layout, as each ALU can be close to its own copy of the register file. Such clustered designs have already been used (The Alpha 21264 is the classic example), so that might work well.
With Larrabee, it's doubling a 4-way replicated register file.
This extra expenditure will be scaled by a factor of 24 for Larrabee overall.

I think making the cache dual ported is probably more expensive than the register file.
Pseudo-dual ported would reduce much of the penalty.
If the vector extensions keep any x86 semantics at all, they will have the ability to reference memory operands. If the architecture is dual-issue, then it is plausible the load/store units are already present.
The psuedo or fully dual-ported cache would avoid duplicating the 4 register files and be in keeping with common SSE code practice.
Keeping the code compact would save the effort of doubling the size of the instruction cache and register files as well.
This means a more compact core and easier x86 support for non-extended code, which seemed to be the point of Larrabee in the first place.

24 * 256KB is 6MBs of L2 cache. That is similar to the amount of cache on a Core 2 Duo (4MBs) or some of today's newer 45nm chips from Intel (6MBs). This may be huge by GPU standards, but it seems reasonable from a multi-core chip perspective. But yes, the largest version of Larrabee they ship (be that 16, 24, or 32 cores) is going to be a big die.

Compared to logic-heavy GPUs, even more modest CPU caches are very large.
The ratio of cache per functional unit is much higher from the Larrabee slides compared to that of current GPUs. The L1 and L2 caches for Larrabee would dominate the transistor count.

GPU caches are trending upwards in size, but the pursuit of ALU density has successfully held the rate of increase down.

This quickly becomes a memory *bandwidth* problem, not a latency hiding problem.

As such, 128 threads should be enough to saturate the available off-chip bandwidth. Once you've saturated your off-chip bandwidth, there is nothing else you can really do (except someone fix the bandwidth bottleneck).
Actually, the slides put it at 128 GB/sec. Not quite worst-case, but pretty good.
 
My base assumption, which could very well be wrong but seems plausible, is that Intel's idea of an x86-based multicore to provide a familiar software target seems moot if the vector extensions are wildly different.

From what I've heard about Larrabee, the vector extensions are wildly different (which is one of the reasons that the rest of Intel doesn't like it). Larrabee doesn't support MMX or any of the SSE instructions. They actually went back to the microcode from the Pentium. They did extended it to 64-bit x86, but without SSE.

Such a departure isn't unprecedented. The IBM Cell SPEs use a different sort of SIMD instructions (and different number of registers) than the normal PowerPC Altivec stuff.

They basically re-designed the vector instructions from the ground up to be graphics specific. That is how they plan to get away with not having any other specialize graphics hardware on the chip. Just these special vector ALUs. Seems like a big gamble, but I am convinced by the pitch, frankly.

In many ways, perhaps Larrabee is Cell "done right".

Compared to logic-heavy GPUs, even more modest CPU caches are very large.
The ratio of cache per functional unit is much higher from the Larrabee slides compared to that of current GPUs. The L1 and L2 caches for Larrabee would dominate the transistor count.

Transistor count isn't the most relevant issue anymore. The two most important issues are (1) power and (2) die area. Granted, these are related to transistor count, but not always one-for-one.

Intel's 45nm process has very small SRAMs and it has very low power SRAM transistors (by using special low-leak transistors). You can get lots of L2 cache on a chip without burning much power or taking up that much die area. Once you're basically power limited by your ALUs, why not throw some extra cache on the chip if you have enough die area?

Conventional wisdom is that caches don't work for graphics computations. Perhaps Intel has found more locality in graphics applications (in the multi-MB range of caching) than previously thought.
 
That is how they plan to get away with not having any other specialize graphics hardware on the chip. Just these special vector ALUs. Seems like a big gamble, but I am convinced by the pitch, frankly.

No fixed function/specialized hw at all seems a good way to lose their gamble if they plan to be competitive in the games department.
 

No fixed function/specialized hw at all seems a good way to lose their gamble if they plan to be competitive in the games department.

If you look at the progression of GPUs, less and less of the area is being spent on fixed function units, and more and more of it is taken up by programable units.

As shaders become more sophisticated, they spend less of their overall time doing the actual fixed function part of the computation. Advanced features like game physics acceleration probably make little use of fixed function hardware.

Also, let me quote from TomF's blog:

"The SuperSecretProject is of course Larrabee, and while it's been amusing seeing people on the intertubes discuss how sucky we'll be at conventional rendering, I'm happy to report that this is not even remotely accurate"

But perhaps the really interesting thing he says is:

Frankly, we're all so used to working within the confines of the conventional GPU pipeline that we barely know where to start with the new flexibility. It's going to be a lot of fun figuring out which areas to explore first, which work within existing game frameworks, and which things require longer-term investments in tools and infrastructure - new rendering techniques aren't that much use if artists can't make content for them.

I'm not sure, but I think the general-purpose cache-coherent nature of Larrabee is what TomF is likely referring to. I think that Larrabee will allow gaming frameworks to transcend the GPU straightjacket, letting some really interesting innovation occur.

We should all revisit this thread in two years and re-evaluate then. :)
 
If you look at the progression of GPUs, less and less of the area is being spent on fixed function units, and more and more of it is taken up by programable units.
That's exactly why you want to spend a very small area of a chip for something that doesn't map well at all to your programmable units.
I wouldn't be surprised if Larrabee doesn't have specialized hardware to perform texture filtering or frame buffer blending, at the same time it makes much more sense to me to have specialized units to evualuate edge equations and walk over triangles, or compute texture LODs and generate texture sample addresses.
I'm not sure, but I think the general-purpose cache-coherent nature of Larrabee is what TomF is likely referring to. I think that Larrabee will allow gaming frameworks to transcend the GPU straightjacket, letting some really interesting innovation occur.
IMHO he refers to the fact that once you have Larrabee lets you implement new and exotic rendering pipelines while still being competitive without graphics hardware solutions.
I'm all for increased flexibility as long as it makes sense, I don't think in two years we will have reached the point where we can waste ALU cycles just to inefficiently run a rasterizer.
 
Why do smart ppl need to re-iterate the same %#^@ all the time?
[in before AndyTX comments ;)]
 
Back
Top