Larrabee at GDC 09

Impress Watch has lots of details (with what I assume are slides from Tom Forsyth's presentation)

http://pc.watch.impress.co.jp/docs/2009/0330/kaigai498.htm

Google Translate said:
mul, add, sub, abc, sbb, subr, and, or, xor, madd (multiply-add), multiply-sub that contains more general instructions such as arithmetic, vector compare instruction, Alain / ANARAIN store / load , SUKYATTA / Magic, bit manipulation and instruction there.

Larrabee to support scatter/magic!
 
http://pc.watch.impress.co.jp/docs/2009/0330/kaigai498_018.jpg said:
Relies heavily on software to get it to live up to its potential
Why is that listed as if it's a good thing? We all know how well it turned out the last time Intel released a new architecture where that was true.
 
Why is that listed as if it's a good thing? We all know how well it turned out the last time Intel released a new architecture where that was true.

Well I think that's why they apparently have 500 programmers working on this? ;)

Seriously though, I think this is kind of what we need to evolve the graphics space in the transition phase from a pure rasterizer to new technologies and hybrid forms blending all sorts of physics and rendering techniques.
 
They can do more than 1fmac/cycle if they build sufficiently wide alu's. By this I mean that since alu's are pipelined, if they build 2 vector alu's per core, they still could dual issue a fmac instruction. Whether they do it, is an entirely different matter.

Every disclosure from Intel has referred to the vector unit in the singular sense, and all the diagrams only show one.
It doesn't look likely at this point.
 
They can do more than 1fmac/cycle if they build sufficiently wide alu's. By this I mean that since alu's are pipelined, if they build 2 vector alu's per core, they still could dual issue a fmac instruction. Whether they do it, is an entirely different matter.

Either I don't get your point or we're talking different things here. I did not want to say that the entire core or the entire chip cannot do more than 1 FMAC/cycle, but rather that their FMACs would need at least 1 cycle to complete as do everyone else's.

And surely, if they can do any number of FMACs in parallel, that'd be factored into their peak TFLOPS rate, wouldn't it?
 
While they can do more than 1 fmac/cycle, I don't think they will. I was merely demonstrating a hypothetical possibility, not that I expect it to happen with lrb.
 
The paper from the less hyped "Physics on Larrabee" is online here:

http://software.intel.com/en-us/articles/game-physics-performance-on-larrabee/
EDIT:[strike]Aha, that paper's been updated from the version previously published.[/strike]

Interestingly enough it seems to say that gather isn't really performing any kind of nice packing:

To cope with indirect memory accesses, Larrabee includes gather/scatter instructions, which gather/scatter data from non-contiguous memory locations, based on the addresses contained in the vector of addresses; 16 elements are loaded from or stored to up to 16 distinct memory addresses. The speedup of gather/scatter is limited by the cache, which can usually access one cache line per cycle, which may result in 16 accesses to the cache in the worst case. However, if the gathered or scattered data is co-located in the single cache line, a frequently observed scenario in many graphics and non-graphics workloads, these elements can be gathered in a single access to L1, thus substantially reducing number of cache accesses and improving application performance.
so it's quite possible for a gather to waterfall over 16 successive fetches from L1. But at least L1 fetch latency can be hidden by hardware-thread switching. So it's a half-way house solution. I suspect NVidia can also hide, or partially-hide, this kind of fetch latency, but not sure.

Jawed
 
Last edited by a moderator:
The load hardware could determine relatively quickly how many cache lines will need to be accessed based on the calculated addresses.
Determining this per gather/scatter might inject some latency in the process.
 
so it's quite possible for a gather to waterfall over 16 successive fetches from L1. But at least L1 fetch latency can be hidden by hardware-thread switching. So it's a half-way house solution. I suspect NVidia can also hide, or partially-hide, this kind of fetch latency, but not sure.

Gather / Scatter performance between LRB and NV will be interesting.

If I remember correctly NVidia gt2xx in the "waterfall" to 16 independent fetches gather case, can reduce those independent fetches to 32 bytes (vs 64 bytes for LRB). Latency hiding like texture fetch latency hiding, so as long as enough ALU work in there, seems easy enough to keep ALUs busy.

With LRB, if you do hit this worst case waterfall, you are effectively killing 1/8 of this threads L1 cache for just one worst case gather (32KB cache / 4 hyperthreads / 64B line / 16 lines = 8 ). I don't know the LRB set associativity, if you have 4 hyperthreads doing this at the same time, things might get very bad for the L1 ... not sure how useful the gather prefetch will be in such a condition. Would seem to me that with 4 threads doing worst case waterfall gathering, that ALUs would stall on L1, even if data locality fit into L2?

Jawed you have any thoughts here?
 
If Larrabee decomposes gathers and scatters as serialized accesses, then perhaps a minimum of 4-way associativity should keep things relatively civil between threads.

It might be more. The Pentium had a 2-way 8KiB cache, and Larrabee's is up by a factor of 4.
Intel has shown a preference for increasing associativity with the increase of capacity in other designs.
 
With LRB, if you do hit this worst case waterfall, you are effectively killing 1/8 of this threads L1 cache for just one worst case gather (32KB cache / 4 hyperthreads / 64B line / 16 lines = 8 ). I don't know the LRB set associativity, if you have 4 hyperthreads doing this at the same time, things might get very bad for the L1 ... not sure how useful the gather prefetch will be in such a condition. Would seem to me that with 4 threads doing worst case waterfall gathering, that ALUs would stall on L1, even if data locality fit into L2?
Nice analysis. I honestly don't know what to add.

Someone needs to show a Larrabee raytracer with rays intersecting all over the place :p

I'm in over my head to be honest. In graphics, intense multi-level dependent texturing, e.g. as often seen in perlin noise shaders is a recipe for lag.

Jawed
 
If Larrabee decomposes gathers and scatters as serialized accesses, then perhaps a minimum of 4-way associativity should keep things relatively civil between threads.
It would keep the misses down, but the pure cycles necessary even on hits is still a problem. For that the only solution is having multiple banks with smaller width ports (or just plain add more ports, but that's obviously a bit too expensive). A little like the texture cache in fact. Would make sense to put a separate scatter/gather unit in there.
 
It would keep the misses down, but the pure cycles necessary even on hits is still a problem.
The tenor of Intel statements is that the design assumes that most gathers will have high locality, and that those that don't will incur the penalties of moving cache lines around.

For that the only solution is having multiple banks with smaller width ports (or just plain add more ports, but that's obviously a bit too expensive).
The memory architecture, such as the way the ring bus seems to be optimized for 64 byte transfers, might not be the best for such an arrangement.
Separate ports are separate memory coherence requests. Fragmenting traffic at the vector memory pipe would have knock on effects all the way down the memory hiearchy.
 
Would it be practical to have one or more hardware threads dedicated to "grouping and assembling" L1 lines. Give this thread 16-way sets of indices and let it fetch and build single cache lines in response which then get copied to your local L1?

This wouldn't solve latency, but it would at least keep all the L1-trashing to a corner of Larrabee, out of the way of real work. And the effective gain in L1 space for worker threads would allow them to hide the longer latency of this technique.

Such dedicated threads don't sound particularly different from texturing hardware and its own private texture cache. So the next question is, can the texture units take on this kind of workload?

Jawed
 
Would it be practical to have one or more hardware threads dedicated to "grouping and assembling" L1 lines. Give this thread 16-way sets of indices and let it fetch and build single cache lines in response which then get copied to your local L1?
In other words, we have a thread doing memory copies from throughout memory into a single contiguous buffer space.

Larrabee's flexible, so it could probably be done.
The scheme is pretty complex. We're either adding a whole other layer of indirection per cache line so that shader code can properly determine what it has accessed, or we've regimented the software renderer to operate on the collected data format.

Such dedicated threads don't sound particularly different from texturing hardware and its own private texture cache. So the next question is, can the texture units take on this kind of workload?
The methodology described for texturing is more of a command/response relationship between a core and the texturing unit, and the texture unit sends back filtered data as opposed to heavily manipulating the cache lines.
The exact amount of intelligence present in the texturing units is unclear.

Since texture units don't handle faults that might pop up while striding through memory constructing the optimum L1 arrangement, the texture hardware might not be flexible enough.
 
In other words, we have a thread doing memory copies from throughout memory into a single contiguous buffer space.

Larrabee's flexible, so it could probably be done.
The scheme is pretty complex. We're either adding a whole other layer of indirection per cache line so that shader code can properly determine what it has accessed, or we've regimented the software renderer to operate on the collected data format.
It doesn't seem particularly complex to me: the originating thread has to generate a vector of 16 indices for a gather, so all that's happening here is that the vector is being sent to another thread.

LRBni seems to have all the bit-wise masking/shifting support necessary for the gather thread to extract and pack the requested data to assemble a vector of fetched data to send back to the worker thread.

The methodology described for texturing is more of a command/response relationship between a core and the texturing unit, and the texture unit sends back filtered data as opposed to heavily manipulating the cache lines.
The exact amount of intelligence present in the texturing units is unclear.
True, one thing that's not clear yet is whether there's any LOD, bias and addressing computation in the TUs. The justification for making them dedicated was based upon decompression and filtering.


Seiler said:
Larrabee includes texture filter logic because this operation cannot be efficiently performed in software on the cores. Our analysis shows that software texture filtering on our cores would take 12x to 40x longer than our fixed function logic, depending on whether decompression is required. There are four basic reasons:
  • Texture filtering still most commonly uses 8-bit color components, which can be filtered more efficiently in dedicated logic than in the 32-bit wide VPU lanes.
  • Efficiently selecting unaligned 2x2 quads to filter requires a specialized kind of pipelined gather logic.
  • Loading texture data into the VPU for filtering requires an impractical amount of register file bandwidth.
  • On-the-fly texture decompression is dramatically more efficient in dedicated hardware than in CPU code.
The Larrabee texture filter logic is internally quite similar to typical GPU texture logic. It provides 32KB of texture cache per core and supports all the usual operations, such as DirectX 10 compressed texture formats, mipmapping, anisotropic filtering, etc. Cores pass commands to the texture units through the L2 cache and receive results the same way. The texture units perform virtual to physical page translation and report any page misses to the core, which retries the texture filter command after the page is in memory. Larrabee can also perform texture operations directly on the cores when the performance is fast enough in software.

So, ahem, it sounds like the texture units would be no help. Though the "pipelined gather logic" for a quad implies some kind of walk over disparate addresses to gather the texels required.

Jawed
 
It doesn't seem particularly complex to me: the originating thread has to generate a vector of 16 indices for a gather, so all that's happening here is that the vector is being sent to another thread.
The data has to come back.
Sending off the vector doesn't tell the original thread what address the compiled cache line is in, nor does it tell the shader thread when the worker thread is done.
The shader thread has to be told which address contains the desired results, and then the shader has to perform a read at this unknown address to make the data migrate back.

edit: And the base address needs to be sent. There's not enough bit space per index to derive the address otherwise.
 
Back
Top