Larrabee at GDC 09

Yep, agree with all that - nothing looks difficult in the context of Larrabee. It has similarities to the way in which a thread requests TEX operations, and receives status updates or results. Though I suspect a TU is dedicated to a core, which is simpler than the one:many gather:worker setup I'm contemplating. Still, message passing between threads across Larrabee seems pretty fundamental.

For added fun, the gather thread(s) could sort addresses from extant requests to achieve some degree of coalescing :p

Jawed
 
True, one thing that's not clear yet is whether there's any LOD, bias and addressing computation in the TUs. The justification for making them dedicated was based upon decompression and filtering.
I don't think that the TUs will be just dumb samplers & filters (i.e. limited to bilinear) because it would force them to do anisotropic filtering in software. If they want to be competitive with GPUs they will need adaptive anisotropic filtering done on the TUs and thus they will have to accept more parameters than just UV coordinates.
 
So what does Larrabee do badly in terms of general purpose code? (Sorry for being noob and vague.) If one was comparing a general purpose CPU to one like this, what sacrafices in terms of general purpose computing have been made to speed it up in the way it has been and how far could specific targeted programming overcome that shortfall?
 
So what does Larrabee do badly in terms of general purpose code? (Sorry for being noob and vague.) If one was comparing a general purpose CPU to one like this, what sacrafices in terms of general purpose computing have been made to speed it up in the way it has been and how far could specific targeted programming overcome that shortfall?

I wonder if it could even be used as a general purpose CPU or if its limited to graphics and GPGPU only?

I.e. could you run Windows on this thing as well as all the other regular PC software and how would it compare to something like Core i7? I'm assuming not favourably if its possible at all?
 
Windows 7 requires a minimum of a 1ghz processor to 'run'. So im wondering if it requires any instructions which may not be present in the P1/Larrabee archiecture as im assuming they haven't updated the general purpose instructions with SSE etc, or have they?
 
I wonder if it could even be used as a general purpose CPU or if its limited to graphics and GPGPU only?

I.e. could you run Windows on this thing as well as all the other regular PC software and how would it compare to something like Core i7? I'm assuming not favourably if its possible at all?

I think the very first Larrabee .pdf had some sort of comparison between a theoretical Larrabee core and a Conroe/Penryn core.

According to the slides from this presentation (here) Larrabee is 'fully capable of running operating systems'. (here)

Depending on the frequency of the actual chip I guess it will compare the best to an Intel Atom, given that the wide vector will be relatively useless in terms of regular desktop use like web browsing, text editing and such.
 
How will Larrabee cope with MMX/SSE?

That functionality seems to be completely missing and would require some kind of translation into LRBni - or it just won't work it seems.

Jawed
 
Yes but since it isn't there in hardware, probably LRB isn't backward compatible with code having sse/mmx instructions.
 
Intel at one time showed a potential design where Larrabee sat on a board as a system processor.

Later statements indicate that Intel does not currently want to allow Larrabee to be visible outside of an expansion slot.
Any physical implementation of Larrabee at this point might not have the necessary connections for interrupts and system signals present or enabled for actual use as a host.

This appears to be a largely artificial restriction. Intel might not want to hurt its margins in HPC and fragmenting the ISA even further with yet another incompatible extension set.
 
The curious thing being, though, that LRBni appears to be the kind of destination that AVX is a step towards. I know so little about MMX/SSE though...

In the scalar core, what kind of instruction differences are there between Larrabee and Core-2 or Pentium 4?

Jawed
 
Wikipedia has a listing of x86 instructions.

The bulk of the scalar integer instructions are there.
CPUID might be there.

Things like conditional moves and certain fast system call instructions didn't appear until the Pentium Pro or Pentium MMX.
Conditional moves aren't necessary for code to run, but any software using them would have to be refactored to split each CMOV into a branch statement.
I don't know how often SYS type instructions are used in current code.

MMX and SSE are the bulk of the later instructions, which are of course not present in Larrabee.
 
If the head of Intel's HPC division has any say, no.
Intel does not seem to want Larrabee exposed to the system, or at least not Larrabee I.

Intel's positions on Larrabee have shifted several times, and it has been hard to get a coherent position from the company.

Based on public quotes, Larrabee has transitioned from a mid-range solution, to a high end power-hungry enthusiast solution, then to a more modest solution with an emphasis on power-efficiency (an argument that isn't done unless there is no higher pinnacle to reach).
Let's not forget the schizophrenic ray-tracer/rasterizer PR stance back in the day.

The latest change to Larrabee's target bracket hints to me that somebody's bubble got popped, either by the economy or by tapeout results.
 
Based on public quotes, Larrabee has transitioned from a mid-range solution, to a high end power-hungry enthusiast solution, then to a more modest solution with an emphasis on power-efficiency (an argument that isn't done unless there is no higher pinnacle to reach).

Sounds spookily similar to their aims/goals/projections for Itanium a decade ago.
 
Conditional moves aren't necessary for code to run, but any software using them would have to be refactored to split each CMOV into a branch statement.

A better replacement is the SET instruction which conditionally sets a register to 0 or 1, which is available since the 386. It'll be a few more instructions than the cmov, but at least avoids the branch.
 
I'm really curious to see how/if LRB scales downward. With all the talk of doing rasterization and triangle setup in software that's gotta slow things down tremendously when the architecture is stripped down to an entry level configuration. I'm assuming that those fixed function bits don't scale much in either direction in current GPU families.
 
With all the talk of doing rasterization and triangle setup in software that's gotta slow things down tremendously when the architecture is stripped down to an entry level configuration.
Why? It simply rebalances itself. There are no bottlenecks, just hotspots. As long as there is very high utilization of the cores, that's more optimal than a GPU (where it's either a bottleneck or wasted silicon - often both).
 
Back
Top