Larrabee: Samples in Late 08, Products in 2H09/1H10

Interesting, is 40 double precision GFLOPS/core for real (16 DP/clock at 2.5 GHz)? If so we are talking about retiring 2x 8-wide SIMD DP operations per clock, or perhaps the 8-16 range of DP is simply normal ops at 20 GFLOPS, with a Mul+Add instruction counting for the 40 GFLOPS/core.
 
There are more details about Nehalem, Sandy Bridge, and Larrabee floating around.

Nehalem looks like it is working towards better handling of lock instructions, and Sandy Bridge should be given a new vector instruction set: one with 3 operand non-destructive instructions and up to 256-bit vectors.
(No detail on how the reg/mem operand stuff will work with three operands)

Larrabee and Sandy Bridge should be relatively contemporaneous, though if Larrabee is as independent as stated earlier it might not share the exact same extensions.

Sandy Bridge's new extensions have vectors that are too narrow to fit Larrabee, though if Larrabee can use 3-operand instructions and the IEEE standard for FMAC is ratified, FMAC may be present for Larrabee's extra wide vector extensions.
FMAC would also enhance other special functions that could use the same hardware common in graphics workloads, increasing performance and saving space.

The shinier the future gets for the new ISA extensions, though, the less I see the advantage of the x86 portion bringing in any real advantages by leveraging current tools.
We can see how much bigger these instructions get with the extra features tacked on, and just how alien AMD's SSE5 instructions turn out to be.
 
Sandy Bridge should be given a new vector instruction set: one with 3 operand non-destructive instructions and up to 256-bit vectors.

This helps explain the rumblings I've heard internally from various engineers at Intel about the Larrabee's vector extensions. It seems as if Sand Bridge's new SSE instructions are likely contemporary designs, and thus competing designs.

If both Sandy Bridge's 256-bit SSE instructions and Larrabee's 512-bit vector instructions get added to future x86 cores, it will just further the strangeness that is x86.
 
It's going to be hard to leverage x86 toolsets if the x86 toolsets bifurcate. Larrabee and Sandy Bridge competing could potentially cause Intel's x86 division to cannibalize itself.

Developers could try to code towards Larrabee, though if the extensions are too incompatible they'll wind up weakening Sandy Bridge.
They could code for Sandy Bridge, and in the process leave Larrabee out in the cold.
They could code for AMD's SSE5...but probably will not.

It's a little rough going with Intel's attempts to bring x86 dominance into the realm of GPUs when there seems to be a measure of eating one's young going on.
 
Yeah, I'm missing the "why use x86" angle for Larrabee. It is not as if Larrabee is going to be running Windows, or perhaps that's where I am wrong. Perhaps the goal is exactly that, to replace both the CPU and the GPU with Larrabee, perhaps even go as far as the next XBox is just Larrabee with no other CPU or GPU. Wouldn't be out of the realm of possibility, going from a {3 core / 2 hyperthread / 4 SIMD} to a {32 core / 4 hyperthread / 16 SIMD}.
 
I thought Larrabee to be the first generation in a series of consolidating CPU and vector/stream/GPU functionality into one device. But if Sandy Bridge and Larrabee has different SIMD extensions it quite clearly isn't.

*scratches head*

Cheers
 
Yeah, I'm missing the "why use x86" angle for Larrabee.
Because it's an experiment for what CPUs in the distant future will look like. Intel is trying to answer the question of what CPU people will buy in 5-10 years time. 20 GHz single-cores are pretty much out of the question, but how many cores is right, should they be identical or hybrid, complex cores or simplified ones? Currently opinions are divided and roadmaps show a bit of everything. Intel would much rather sell the same chip to everyone. That's the only way to keep their dominant position and have x86 survive the next decade. Larrabee will answer many questions to converge to something they can put in each and every system.

I'm pretty sure Intel isn't that much interested in discrete GPU sales. But Larrabee will pay for its own R&D costs by selling it as a streaming/graphics processor. They got nothing to lose. I'm sure the Larrabee division has its own ambitions and if something extra comes out of it that's great. But it's still Intel we're talking about. They sell CPUs, and divergence would be their doom. So x86 is key for Larrabee.

For the same reason I'd be really surprised if Larrabee's 512-bit SIMD isn't in fact 2 x 256-bit AVX.
 
Intel would much rather sell the same chip to everyone. That's the only way to keep their dominant position and have x86 survive the next decade.
Actually, the Sandy Bridge/Larrabee split indicates that not everyone at Intel is on the same page.

They got nothing to lose. I'm sure the Larrabee division has its own ambitions and if something extra comes out of it that's great. But it's still Intel we're talking about. They sell CPUs, and divergence would be their doom. So x86 is key for Larrabee.
x86 will be defined as whatever goes into the dominant CPU, and that will be Sandy Bridge.
Larrabee's going to be an almost x86, and I'm wondering if there are those who aren't trying too hard to allow Larrabee to succeed.

For the same reason I'd be really surprised if Larrabee's 512-bit SIMD isn't in fact 2 x 256-bit AVX.
The rumors I've run across indicate it's not.
There is overlap in functionality, but the encoding, internal state, and instruction behavior do not match.
 
Actually, the Sandy Bridge/Larrabee split indicates that not everyone at Intel is on the same page.
Well I can definitely see how a division experimenting with terraflop architectures turned itself into a GPU division. That doesn't take away that the CPU division(s) need answers for the next decade. At least it explains why x86 was used as a starting point.
x86 will be defined as whatever goes into the dominant CPU, and that will be Sandy Bridge. Larrabee's going to be an almost x86, and I'm wondering if there are those who aren't trying too hard to allow Larrabee to succeed.
Correct me if I'm wrong but it looks like the first iteration will be aimed at developers. They need/want working hardware by the end of the year. That's two years before Sandy Bridge. So it's quite unavoidable that there will be more differences than intended. It doesn't mean that later iterations won't closely match mainstream x86.
The rumors I've run across indicate it's not. There is overlap in functionality, but the encoding, internal state, and instruction behavior do not match.
Minor differences can easily be bridged with compiler changes. It doesn't fundamentally alter the software developed for Larrabee, if at all.
 
Out of curiosity what fraction of x86 chip space is typically used for the x86 instruction decode ... or to be more specific, all the stuff needed to translate x86 into the internal hardware instructions actually executed by the core?

Might as well include register rename in this as well under the assumption that combined ALU+MEM opcodes probably get decoded into 2 or more actual operations depending on address mode.
 
Minor differences can easily be bridged with compiler changes. It doesn't fundamentally alter the software developed for Larrabee, if at all.

Irrelevant to whether Larrabee's vector extensions are just doubled AVX.
There's evidence that they are not.

The two sets of extensions are (allegedly) not encoded the same, don't behave the same, and may have functionality present in one that is not found in the other.
 
I don't see AVX supporting fixed function texture sampling :)
edit: not in 1-2 years at least
 
Out of curiosity what fraction of x86 chip space is typically used for the x86 instruction decode ... or to be more specific, all the stuff needed to translate x86 into the internal hardware instructions actually executed by the core?

Might as well include register rename in this as well under the assumption that combined ALU+MEM opcodes probably get decoded into 2 or more actual operations depending on address mode.
The Pentium III Katmai core was 128 mm² at 0.25 micron and featured out-of-order execution and SSE. And since it obviously didn't spend all of its die size decoding instructions I think it's fair to say that instruction decoding won't take a major amount of die space for an in-order processor at 45/32 nm. Besides, it also wins you code size and I doubt next-generation GPUs won't have any form of instruction decoding.
 
Irrelevant to whether Larrabee's vector extensions are just doubled AVX. There's evidence that they are not.
I must have missed that.
The two sets of extensions are (allegedly) not encoded the same, don't behave the same, and may have functionality present in one that is not found in the other.
That's not an unsurmountable obstacle. Intel has years of experience in running vertex shaders using SSE.
 
Speaking of fixed function, I had a thought about Larrabee repurposing or extending the IO instructions for calling fixed function units.

The basic data unit is a register, which in Larrabee is 1 cache line already, and the IO space is more than good enough with space for 65k units.
In a GPU setup, Larrabee's ring bus could snatch those messages up and link to a given unit.
 
I don't see AVX supporting fixed function texture sampling :)
edit: not in 1-2 years at least
GPUs are moving towards programmable texture sampling. CPUs are already there. A parallel gather instruction would be great though, but it's not worthless without it.

It all doesn't matter that much as long as AVX is "equivalent" to Larrabee's SIMD ISA. It's going to use dynamic compilation anyway so one executable can run on multiple platforms as long as the compiler itself is x86 compatible (the same reason why software renderers don't care if you have SSE version X or not). What matters more is how the software deals with the differences in architecture. If the number of cores is just another parameter you can choose, great. Else there's little to no reason why Larrabee should be x86.
 
Out of curiosity what fraction of x86 chip space is typically used for the x86 instruction decode ... or to be more specific, all the stuff needed to translate x86 into the internal hardware instructions actually executed by the core?

Might as well include register rename in this as well under the assumption that combined ALU+MEM opcodes probably get decoded into 2 or more actual operations depending on address mode.

http://www.realworldtech.com/page.cfm?ArticleID=RWT021300000000&p=4

In 2000, with the chips on-hand Paul DeMone estimated the x86 decoders of the day were in the realm of 1-2 million transistors, not including other architectural features elsewhere on the chip that were required to keep x86's wrinkles from killing performance.
(Katmai, the P3 without the L2 on-die was about 9.5 million transistors.)
edit: The K7 core + L1 prior to Thunderbird was 22 million, a large portion of this was in the significantly larger L1.

More modern x86 decoders have a much wider range of instructions to decode and are potentially wider, though the cores themselves are significantly larger as well.
A big contributor to the size would be going superscalar with them, which is significantly more difficult to manage with the variable-sized and unaligned instructions.
 
Last edited by a moderator:
GPUs are moving towards programmable texture sampling. CPUs are already there. A parallel gather instruction would be great though, but it's not worthless without it.

I think the point of programmable texture sampling on the GPUs has to do with two things, texture offset and manual filtering (ie the sampler returns the 4 pix and you manually filter, or use). Still the fixed function texture unit is doing texture decompression, address translation (swizzle, etc), and handling the cache. And manual filtering isn't something you would be doing in the general case, only perhaps in the case where you need something higher than the 8-9 bits of precision that the FF filtering gives you. I wouldn't want to do any of the FF texture stuff in software, just like you probably wouldn't want to emulate all the x86 address modes and fused ALU+memory opcodes in software either...
 
Out of curiosity what fraction of x86 chip space is typically used for the x86 instruction decode ... or to be more specific, all the stuff needed to translate x86 into the internal hardware instructions actually executed by the core?
I'll ask a bit different question: how many x86 CPU pipeline stages are wasted for x86?
I remember old Ars Technica article comparing latest x86 gen from AMD and Intel with PowerPC (which was OOeO too). One of outstanding differences was number of pipeline stages (IIRC).

By the way, is there any serious, comparative study that proves that x86 does indeed compress the code? The only support for this claim I've ever seen was like "it's CISC and variable-length, therefore it's more compact, period".

I always found this hard to believe, because coding for x86 involves things that defy code space efficiency. Inserting instructions that do no useful work, like register swapping due to ISA asymetry, reg spills, reg copy due to destructive update, reg clear to avoid false dependency, etc. And let's not forget that x86 CPU vendors recommend non trivial subset of the actual x86 to feed their modern CPUs. The rest is considered slow, kept for compatiblity. This fact alone indicates wastage of opcode space. Also there is still no way to write FP code in sane way, you either deal with ancient 87 or with glutonous prefixes in SSE.
 
Back
Top