Larrabee at Siggraph

armchair_architect · Aug 5, 2008

hoom said:
Since a Pentium is only about 3M, the x86 core must be just a runt beside that Vec16

Which is kinda the point. Imagine, using all those transistors Moore's Law has been handing us for actual execution units

. To me, Larrabee demonstrates what we've been giving up all these years so those lazy programmers could keep sitting on their single-threaded butts

.

Jawed · Aug 5, 2008

TimothyFarrar said:
And this is only enough for 256 strands (or NVidia threads) per core. So L2 miss cost is also going to be somewhat important as well.

G80 has 96 ALU clocks per scalar instruction of latency hiding (24 warps maximum, 4 clocks) and a mere 32KB of register file space for all 24 warps.

Larrabee looks fine.

Jawed

Jawed · Aug 5, 2008

hoom said:
Since a Pentium is only about 3M, the x86 core must be just a runt beside that Vec16

The paper says that the vec16 unit is 1/3 of the area of the core.

Jawed

nAo · Aug 5, 2008

Jawed said:
G80 has 96 ALU clocks per scalar instruction of latency hiding (24 warps maximum, 4 clocks) and a mere 32KB of register file space for all 24 warps.

Larrabee looks fine.

Yes and no.. G80 will do the job automatically, while Larrabee won't (though I guess LRB's shader compiler + renderer will take care of this) and we still need to know how many vector registers are available as this will ultimately determine how many 'entities' can be processed at once.

3dilettante · Aug 5, 2008

More fun with numbers:

Intel's reasons for fixed-function texturing used 8-bit component texturing as one argument.

Going with Int8 on a 4-channel pixel format, we get 32 bits or 4 bytes.

Neat numerology:

Communications with the texture units goes through the L2, (reserved addresses?), which was listed elsewhere as having 64 byte lines.

If we assume traffic between cores and texture units works best with that granularity, we could expect 16 resulting 32-bit values.

The ring bus for a 16-core chip is listed as being 128 bytes, 64 each way.

If a thread were to run a filtered fetch, the line width, vector width, and bus width would be best utilized at 16.

In theory, two banks of texture units for a throughput of 32 filtered pixels would be the most the ring bus could support for peak throughput, if the physical layout of the chip is close to what some of the block diagrams show.

What happens beyond 16 cores is unclear to me, the max for a single ring is 16, and connected rings are used beyond that.

Jawed · Aug 5, 2008

nAo said:
Yes and no.. G80 will do the job automatically, while Larrabee won't (though I guess LRB's shader compiler + renderer will take care of this) and we still need to know how many vector registers are available as this will ultimately determine how many 'entities' can be processed at once.

Larrabee uses L2 to send interpolated texture coordinates to the TEX units, and to hold texture results - effectively extending the register file.

Interpolation, itself, will hide a lot of latency. The paper makes no mention of transcendentals so it seems interpolation and transcendentals will both hide a lot of texturing latency.

Jawed

Scali · Aug 5, 2008

ArchitectureProfessor said:
The SIMD available in current microprocessors is much weaker than what Larrabee supports. Larrabee's vectors support full scatter/gather, vector masks, more vector registers, and bigger vector registers. In the past, going to SSE would maybe give you a 2x or 3x speedup in practice (with around a 4x maximum speedup). Larrabee's more general vector operations will make it much easier to vectorize loops (even ones with simple flow control) and give much more performance versus non-vector code.

That wasn't my point though. Nick responded to a comment on SIMD with a reference to multi-core CPUs and how parallelization wasn't required until recently. I was just saying that he was confusing two different types of parallelization. The reference to multi-core didn't make sense in the context.

Hannibal · Aug 5, 2008

doubly wrong

Damnit Jon Stokes, why did you start this rumor?

Larrabee's individual cores have no relation to P54c, or any other (relatively) early pipelined x86 core.

You're incorrect on two counts. First, I didn't start this rumor. I heard the P54C/Pentagon details from two separate sources within Intel. Then I sat on the info until someone else pushed it out there, and I merely confirmed it and added what extra color I had.

Second, you're incorrect that Larrabee's individual cores bear no relation to P54C. But I noticed that Intel's fact sheet suggested a relationship, and when someone quoted it in here you didn't bother to respond and subsequently dropped the issue

ArchitectureProfessor · Aug 5, 2008

3dilettante said:
The render engine would have to sit on top of some kind of driver.

I guess it depends what you call a driver. Sure, because Larrabee is an off-chip GPU (treating more like an I/O device), you'll need some low-level way to communicate with it. However, once Larrabee-like cores are in the same coherent memory domain as the main cores (say, on the same QuickPath or Hypertransport fabric as the CPU), the driver code becomes even thinner.

Just as we don't have a "driver" for today's multicore CPUs, I predict we won't need one for the future many-core post-Larrabee systems.

3dilettante said:
If and only if there is only Intel, and even then it is only guaranteed if there is a single Larrabee revision to program for.

If I was AMD/ATI, I would start a small skunkworks project looking at a simple x86 core with the same ISA extensions as Larrabee. Intel and AMD have a history of copying each other's ISA extensions. AMD implemented Intel's SSE instructions; Intel implemented AMD's 64-bit extensions. Although competitors, AMD and Intel have an interest in preventing x86 from fracturing further.

In the 64-bit case, it was Microsoft that put its foot down and told Intel that it must adopt AMD's 64-bit extensions. Perhaps Microsoft/Apple or the big game developers will tell AMD (and the rest of Intel design teams, for that matter) that it must adopt Intel's Larrabee extensions going forward.

If AMD & Intel standardize on such a software-based rendering architecture, that leaves NVIDIA out in the cold (at least on the desktop gaming domain).

Of course, this isn't what AMD's fusion is about right now....

3dilettante · Aug 5, 2008

Hannibal said:
You're incorrect on two counts. First, I didn't start this rumor. I heard the P54C/Pentagon details from two separate sources within Intel. Then I sat on the info until someone else pushed it out there, and I merely confirmed it and added what extra color I had.

I'm trying to figure out what hasn't been ripped out or refactored since the P54C.

The decoder's not the same, the data paths have been widened, the front end is multithreaded, the scatter/gather capability would have added to the mem pipe, the vector pipe is new, the int units are wider.

I suppose the x87 unit wouldn't need too much changing.

ArchitectureProfessor · Aug 5, 2008

TimothyFarrar said:
Personally I think all those who are thinking that they can somehow just lazy program (serial OO C++ code) Larrabee like they do with their PC and expect some amount of magical performance are fooling themselves... good thing Intel has some rather very smart people doing drivers, and it is still going to take a huge mindset change in developers before any normal folk start to understand how to program this beast.

You could replace "Larrabee" with "multicore" and give the same pessimistic assessment.

Game developers are probably ahead of the curve of dealing with multicore programming, but the rest of the computer industry is going to have a tough time with the shift to multicore systems as the primary driver of performance.

In that respect Larrabee isn't really that much worse on software developers than multi-core chips (primarily because Larrabee *is* just a multi-core chip...).

pcchen · Aug 5, 2008

3dilettante said:
I'm trying to figure out what hasn't been ripped out or refactored since the P54C.

It looks like the scalar part still has the P54C style U+V configuration.

Hannibal · Aug 5, 2008

I'm trying to figure out what hasn't been ripped out or refactored since the P54C.

Your guess is as good as mine. I just passed along what I heard, and if I get any more clarity on the nature of this P54C connection I'll pass that along, as well.

3dilettante · Aug 5, 2008

ArchitectureProfessor said:
Just as we don't have a "driver" for today's multicore CPUs, I predict we won't need one for the future many-core post-Larrabee systems.

I'm going to hedge on the reverse coming to pass, as the x86 manufacturers are trying to install lightweight hardware monitoring and virtualization, while OS vendors are looking to virtualize.
All parties want to make those many cores useful without constraining hardware evolution or fragmenting the software base.
A lightweight control layer and VM that allocates computation might be the final result.
(Minor quibble, there is sort of a driver for the Speedstep functionality).

If I was AMD/ATI, I would start a small skunkworks project looking at a simple x86 core with the same ISA extensions as Larrabee.

That's been mooted before.

If AMD & Intel standardize on such a software-based rendering architecture, that leaves NVIDIA out in the cold (at least on the desktop gaming domain).

That's where APIs and translation come in.
Nvidia could work something out with VIA, or get help from ARM.
I hope someone tries an ARM Larrabee, just for the comparison.
Atom's design was rumored to be an estimated +15% transistor heavy just because of x86 compatibility.
Larrabee's design might reduce the percentage down, but we wouldn't know without the comparison.

ArchitectureProfessor · Aug 5, 2008

3dilettante said:
I'm trying to figure out what hasn't been ripped out or refactored since the P54C.

The decoder's not the same, the data paths have been widened, the front end is multithreaded, the scatter/gather capability would have added to the mem pipe, the vector pipe is new, the int units are wider.

I suppose the x87 unit wouldn't need too much changing.

From what I've heard, they started with the Pentium microcode for handling all the nasty x86 instructions that Larrabee actually supports. Most of the things mentioned above are datapath elements. Datapath is easier to change than control elements. A decoder for x86 is nasty enough that adding new instructions to it probably wasn't that difficult.

Perhaps another example of this is the PentiumPro/Pentium II/III/Core/Core 2 progression. All of those chips use the same basic microcode and decoder. Sure, in that time the microarchitecture went from 32-bit to 64-bit, added wider vectors, etc. But having something to start with that handles the entirety of x86 is a real help.

If you wanted to get something out the door quickly, but yet be fully x86 compatible, starting with a known-good design such as the Pentium seems like a good idea. Ed Grochowski (one of the authors of the SIGGRAPH paper) was one of the chief architects of the original Pentium (source), so the Larrabee team certainly had access to someone that was familiar with the original Pentium.

Mintmaster · Aug 5, 2008

TimothyFarrar said:
Given that Larrabee's latency hiding is just "software vectorization / loop unrolling" (ie 32 or 64 or 128 wide SIMD), those are some rather key numbers which we don't know yet. One side effect of this is that branch granularity in practice is going to need to be large for anything with poor cache performance or higher latency (texture fetches), and register pressure will probably be high as well. L1 does seem exactly as just an extended register file for vector operations, 32KB divided by 4 hyperthreads then divided by say 4 for "software vectorization / loop unrolling" for latency hiding gives 32 vector sized slots in L1 and just 256 vector sized slots in L2 (for the core's 256KB). And this is only enough for 256 strands (or NVidia threads) per core. So L2 miss cost is also going to be somewhat important as well.

This is the biggest issue, IMO, for Larrabee, and I won't really be convinced until I see actual hardware instead of simulation. We don't know if L1 size or L2 BW are big enough for tougher shaders.

The renderer in the paper just cycles through the threads in a fixed order, and that won't be good enough for more complex code and variable fetch latency. Better scheduling schemes will cost cycles.

Jawed said:
G80 has 96 ALU clocks per scalar instruction of latency hiding (24 warps maximum, 4 clocks) and a mere 32KB of register file space for all 24 warps.

How many times do I have to tell you that clocks per ALU instruction doesn't matter? If you're TEX limited, latency hiding equals # threads divided by texture throughput. If you're ALU limited, you have more hiding ability, so it doesn't matter. G80 can hide 392 core clocks of latency (i.e. 900 ALU clocks) for shaders with low register use (10 FP32 scalars).

Now having said that, it's unlikely that Larabee's texture unit does more than 4 bilinear samples per core per clock, so the analogy with G80 is apt if you note that there is 64KB of register space per cluster.

The question is whether Larabee's L2 is fast enough. Average ALU clause length is critical because it will determine how fast stuff needs to move between the L2/L1/registers.

Jawed said:
Larrabee uses L2 to send interpolated texture coordinates to the TEX units, and to hold texture results - effectively extending the register file.

Interpolation, itself, will hide a lot of latency. The paper makes no mention of transcendentals so it seems interpolation and transcendentals will both hide a lot of texturing latency.

How so? You still need to hold the register values of all pixels in flight. The only thing that will increase latency hiding is reduced texture sampling throughput.

ArchitectureProfessor · Aug 5, 2008

3dilettante said:
I'm going to hedge on the reverse coming to pass, as the x86 manufacturers are trying to install lightweight hardware monitoring and virtualization, while OS vendors are looking to virtualize.
All parties want to make those many cores useful without constraining hardware evolution or fragmenting the software base.
A lightweight control layer and VM that allocates computation might be the final result.
(Minor quibble, there is sort of a driver for the Speedstep functionality).

I'll buy that. Having some sort of lower-level VMM certainly makes sense. Yet, I wouldn't call that a "driver" in the sense that a driver is something you add on to an operating system or a "driver" in the sense of some pretty sophisticated software that translates high-level DX/OpenGL into hardware commands. Certainly virtualizing and managing parallelism is a real issue for future VMMs and operating systems. I personally am really interested to know more about this "Grand Central" technology that Apple is building into its next version of its OS.

I hope someone tries an ARM Larrabee, just for the comparison.

That is an interesting idea. Adding a big Larrabee-like vector unit to ARM makes lots of sense.

Atom's design was rumored to be an estimated +15% transistor heavy just because of x86 compatibility.

I'd believe that. When asking some Intel designers about Atom, I was told that they saw x86 as costing them extra transistors but a negligible cost in terms of power or performance. Basically, it came down to a fabrication cost issue, and Intel has the edge there.

Yet, Atom is clearly a disappointment. It is a chip aimed at a market that doesn't yet exist: something between a mobile PDA/phone/iPhone device and a full-blown laptop. Unless this new market segment takes off, I can't see Atom doing so well.

My question to Intel is: where is the really low power x86 to compete with ARM (but I guess that isn't really on-topic...)

Larrabee's design might reduce the percentage down, but we wouldn't know without the comparison.

I'm sure x86 does cost Larrabee something. Without such a big vector unit, the overheads would have likely killed them. At least with big vectors, they are able to amortize the cost of x86 support over a pretty big vector and texture unit. Yet, I see diminishing returns beyond 16-element vectors, so if Larrabee is going to scale, it needs to scale in terms of number of cores. If I was designing Larrabee, I would have been tempted to rip out lots of the legacy x86 stuff, but for some reason they really wanted all the Larrabee cores to be a fully compatible x86 core.

TimothyFarrar · Aug 5, 2008

ArchitectureProfessor said:
You could replace "Larrabee" with "multicore" and give the same pessimistic assessment.

Game developers are probably ahead of the curve of dealing with multicore programming, but the rest of the computer industry is going to have a tough time with the shift to multicore systems as the primary driver of performance.

In that respect Larrabee isn't really that much worse on software developers than multi-core chips (primarily because Larrabee *is* just a multi-core chip...).

Not pessimistic, actually looking forward to how programming will have to change to make use of Larrabee (and other massively parallel hardware in general). I don't think multicore is the problem, but rather vectorization... and vectorization beyond the hardware vector size (which is required for any kind of memory latency hiding).

Jawed · Aug 5, 2008

Mintmaster said:
The renderer in the paper just cycles through the threads in a fixed order, and that won't be good enough for more complex code and variable fetch latency. Better scheduling schemes will cost cycles.

Two things:

each core will have non-pixel-shading batches (at least one hardware thread) running in addition to pixel-shading
attribute interpolation in the core will hide a lot of latency as Intel has not implemented fixed-function interpolation

How many times do I have to tell you that clocks per ALU instruction doesn't matter? If you're TEX limited, latency hiding equals # threads divided by texture throughput.

I was just stating the baseline for the ALU:TEX ratio.

If you're ALU limited, you have more hiding ability, so it doesn't matter. G80 can hide 392 core clocks of latency (i.e. 900 ALU clocks) for shaders with low register use (10 FP32 scalars).

Only if the code's ALU:TEX is >= 2.35:1 (or > ~9:1 in scalar terms). You do realise you've just re-stated my 96 ALU clocks statistic in another form, don't you?

Anyway, what'll be interesting about Larrabee is that for "GPGPU" it'll sometimes be worse than real GPUs because it seems it'll have a fraction of their bandwidth. So any "streaming" application that is bandwidth constrained is going to work relatively badly on Larrabee.

Now having said that, it's unlikely that Larabee's texture unit does more than 4 bilinear samples per core per clock, so the analogy with G80 is apt if you note that there is 64KB of register space per cluster.

To be strict, and I always forget this, there's 16KB of PDC in G80 per SM too, enlarging per-fragment state.

The question is whether Larabee's L2 is fast enough. Average ALU clause length is critical because it will determine how fast stuff needs to move between the L2/L1/registers.

As a clue, the paper says that the VPU can fetch one operand from L1 per clock, which (sort of) implies one operand per clock from L2. Now that isn't much of a clue because we don't know if L2 can also concurrently:

support a read and a write by the texture unit
accept incoming data from memory and/or writes from the core
accept coherency copies from and send coherency copies to other L2s

How so? You still need to hold the register values of all pixels in flight. The only thing that will increase latency hiding is reduced texture sampling throughput.

An interpolated texture coordinate needs to be stored somewhere before the texture unit can use it. In G80 this consumes a register. In Larrabee this will be L2. Same applies to texture results. Normally these results are put into registers. Larrabee will alias this to a memory address and keep the texture result in L2 until it's needed (i.e. streamed through L1, just in time for its use as an operand).

Jawed

psurge · Aug 5, 2008

3dilettante said:
More fun with numbers:
In theory, two banks of texture units for a throughput of 32 filtered pixels would be the most the ring bus could support for peak throughput, if the physical layout of the chip is close to what some of the block diagrams show.

What happens beyond 16 cores is unclear to me, the max for a single ring is 16, and connected rings are used beyond that.

The TUs are said to be separate from the cores, but this doesn't necessarily mean that texture results are shippied over the ring. Is it possible that the ring is used by the TU soley to make memory requests, but that results are fed directly to the L2 over a private connection (so basically each core will have it's own TU)?

Larrabee at Siggraph

armchair_architect

Jawed

Jawed

nAo

Nutella Nutellae

3dilettante

Jawed

Scali

Hannibal

ArchitectureProfessor

3dilettante

ArchitectureProfessor

pcchen

Moderator

Hannibal

3dilettante

ArchitectureProfessor

Mintmaster

ArchitectureProfessor

TimothyFarrar

Jawed

psurge

Similar threads