C++ implementation of Larrabee new instructions

Why would you unroll the latter loops? Unless they are incredibly short it doesn't make much sense.

Say you have a loop that reads a (small) number of samples and then averages them (with or without some weighing factor.) A filter basically. I think it would make sense here to unroll to have multiple reads in flight. The issue is probably not how short the shader is but how much data it need to retain in registers.
 
Say you have a loop that reads a (small) number of samples and then averages them (with or without some weighing factor.) A filter basically. I think it would make sense here to unroll to have multiple reads in flight. The issue is probably not how short the shader is but how much data it need to retain in registers.
Nah, cause it would make more sense to just fiber across different pixels/invocations rather than trying to decide where to unroll internally in a shader... it's a similar concept, but fewer special cases when applied to the pixel loop. It's also a similar concept to instruction-level parallelism vs. "scalar" parallelism across pixels, the latter being easier and more robust in most cases.
 
Say you have a loop that reads a (small) number of samples and then averages them (with or without some weighing factor.) A filter basically. I think it would make sense here to unroll to have multiple reads in flight. The issue is probably not how short the shader is but how much data it need to retain in registers.
I don't see why you need to unroll a loop in order to have multiple reads in flight. Loop unrolling makes sense when it allows to remove some of the constant per loop overhead and when (more importantly) it gives an opportunity to extract more ILP. Multiple reads are entirely orthogonal to loop unrolling.
 
I don't see why you need to unroll a loop in order to have multiple reads in flight. Loop unrolling makes sense when it allows to remove some of the constant per loop overhead and when (more importantly) it gives an opportunity to extract more ILP. Multiple reads are entirely orthogonal to loop unrolling.
Code:
loop(4)
   sum += value(x,y)
   x = ...
   y = ...
end

If sum is a register and you have a simple in-order, no-renaming but has non-read-blocking pipeline, then you can only have 1 read op in flight. By unrolling, you'll have 4. How are you going to get multiple reads in flight for the case above without unrolling?

I must be missing something?

Andrew Lauritzen said:
Nah, cause it would make more sense to just fiber across different pixels/invocations rather than trying to decide where to unroll internally in a shader... it's a similar concept, but fewer special cases when applied to the pixel loop. It's also a similar concept to instruction-level parallelism vs. "scalar" parallelism across pixels, the latter being easier and more robust in most cases.
Agreed, when you're specifically talking GPU architectures. But even there, there may be cases where you can get benefits from unrolling. E.g. it may be a worthwhile trade-off in case of short, simple shaders when you have some kind of restriction wrt the max number of fibers. Or if the reads are localized, a smaller number of fibers with an unrolled burst of consecutive reads may be more cache efficient.
 
Code:
loop(4)
   sum += value(x,y)
   x = ...
   y = ...
end

If sum is a register and you have a simple in-order, no-renaming but has non-read-blocking pipeline, then you can only have 1 read op in flight. By unrolling, you'll have 4. How are you going to get multiple reads in flight for the case above without unrolling?

I must be missing something?
If your compiler is smart enough to unroll this loop and store value(xn,yn) in different registers (to be later accumulated) I don't see why it cannot do exactly the same keeping the whole work in a loop except for sum = ...
"Renaming" sum it's obviously not tied to unrolling the whole thing. Moreover to read(x,y) might require a lot of cycles (texture sampling?), in that case loop unrolling won't buy you a lot of mileage, you would need some serious fibering, which again doesn't make much sense to unroll (unless you want to take part in a I$ misses party..)
 
In fact I think 32/HW thread is a pretty reasonable number for most shaders.
That's what I've been saying. ;)
Ah I see... in this context I'd consider that more directly related to software fibering (i.e. "unrolling" statically or dynamically the implicit outer loop over pixels), rather than unrolling shader loops, which is how I interpretted your original comment. And yes, that can certainly increase register pressure, but it's not exactly hard to spill/restore the register context at a really low cost on typical fiber switches.
Actually the term I was looking for is software pipelining, not loop unrolling (although they are often used together). Sorry for the confusion. Anyway, if the latencies of the instructions are really that low then it probably doesn't help much. In which case 32 registers would indeed be hard to exhaust. 16 is too few though, so I guess it's not a terrible waste of die space either. I assume it's not that different from modern GPUs in the end, though you're not getting the balancing effect between shaders that use many registers and those that use few.
 
LRB related GDC presentations (page bottom):
http://software.intel.com/en-us/articles/intel-at-gdc/

Micheal Abrash's article about LRBni on Dr. Dobb's Journal:
http://www.ddj.com/architect/216402188

LRB is sure beautiful and it is amazing to see the bright geniuses involved with its creation... this was a case of CPU designers really involving software people expert the field (Abrash, Sweeney, etc...) and listening to them, working with them instead of imposing on them an architecture.
 
Some notable bits from the Dr Dobb's article:

Besides, if a program can be successfully parallelized across all those Larrabee cores, it shouldn't in principle be any more difficult to parallelize it across the threads as well. However, while this is true to a considerable extent, in actual practice issues arise because there is only one set of most core resources -- most notably caches and TLBs -- so the more threads there are performing independent tasks on a core, the more performance can suffer due to cache and TLB pressure. The graphics pipeline code on Larrabee works around this by having all the threads on each core work on the same task, using mostly shared data and code; in general, this is a fertile area for future software architecture research.
Does this imply that the model we've discussed, before, of binning/shading/back-end all running on a core, with screen-space effectively tiled per core is not what they're planning to do? This paragraph makes it sound like they'll have binning cores, shading cores and back-end cores.

vbitinterleave11pi and vbitinterleave21pi allow the interleaving of bits from two registers; vbitinterleave11pi alternates bits from the two sources, starting with bit 0 of the last source, and vbitinterleave21pi alternates one bit from the last source with two bits from the first source. Bit-interleaving is useful for generating swizzled addresses, particularly in conjunction with vinsertfield, for example in preparation for texture sample fetches (volume textures in the case of vbitinterleave21pi). The following sequence generates 16 offsets into a fully-swizzled 256x256 four-component float texture from 16 8-bit x coordinates stored in v1 and 16 8-bit y coordinates stored in v2, as in Figure 16:
Is this saying that texel addresses are formed by the core, not the texture-sampler/filter unit?

Jawed
 
Some notable bits from the Dr Dobb's article:


Does this imply that the model we've discussed, before, of binning/shading/back-end all running on a core, with screen-space effectively tiled per core is not what they're planning to do? This paragraph makes it sound like they'll have binning cores, shading cores and back-end cores.
I'm not sure if it was ever discussed in depth how the threads would be pinned.
The statement makes sense.

The more unique tasks each core attempts to do at once, the more that those tasks will thrash the shared resources.

Raster threads that are closely aligned in screen space would lead to high locality and a maximum utilization of L1 and TLB space.
Tossing in the binning code would reduce Icache capacity and send the binning thread into separate regions of memory at the expense of thrashing the data structures being used by the raster threads.
The bin setup code might be kept separate because of this.


Is this saying that texel addresses are formed by the core, not the texture-sampler/filter unit?

The exact manner by which the texture units are controlled hasn't been discussed.
Perhaps the texture units are very limited in the duties they perform.
Address calculation might not be the big consumer of time for Larrabee compared to the other parts the texture units do shine in: unaligned accesses, format handling, and low precision math.
 
Larrabee 's scatter/gather

The new Larrabee vectors look pretty interesting. I was actually surprised there were not more vector operations that were special purpose (ones that were really graphics-specific). There are perhaps some, but not many.

I found the scatter/gather instructions pretty interesting. As someone mentioned above, they do seem limited to a 32-bit offset, which could hurt them at some point down the road.

From the slides somebody pointed to above: http://pc.watch.impress.co.jp/docs/2009/0330/kaigai498_p108.jpg and http://pc.watch.impress.co.jp/docs/2009/0330/kaigai498_p125.jpg

"Scatter/gather must take a mask register... afterwards, mask will be all-zeros". This is their answer to allowing vector loads/stores to be interruptible. Say one of the memory locations has a page fault? They can actually interrupt the vector memory operation mid-way through by just adjusting the mask to reflect what has been done and what is left to be done. When they come back from the page fault, they just restart the instruction and it just handles it. Very simple; very clever.

It also says: "L1$ [first-level cache] can only handle a few access per clock." Since they didn't say exactly "two" or something, what I bet they did is bank it four ways. That would give "a few accesses per block" on average.

It also says: "Offsets referring to the same cache line can happen on the same cycle". So, it sounds like if it does a dense vector load or store it actually goes pretty quickly, but even if you do a scatter/gather, it can take advantage of some locality, too.

When you add four threads per core to suck up various pipeline latencies (branch mis-predictions, dependent operations, cache misses) plus scatter/gather prefetching instructions (giving lots of memory-level parallelism), it seems like this could be reasonably easy to program (perhaps not to get absolute peak performance, but getting reasonable performance doesn't sound so hard).
 
The back end, Figure 8 in Seiler et al, shows a core doing a screen-space-tiled render based upon binned primitives: doing interpolation, early/late-Z, pixel shading and output merge. Looking at this list right now it seems pretty coherent in data use and so innocuous.

So I was incorrectly including binning there as it's performed by other cores/earlier.

Figure 7, the front end, shows vertex/geometry shading and assembly with rasterisation/binning, again the text implies these tasks are running within a core (though it places no limit on the number of cores thus configured). Again the combination seems reasonably innocuous.

Section 4.3 also says that rasterisation can occur in either front or back end or split across both.

Overall the paper seems to describe cores doing more than one kind of task simultaneously.

Whereas this article seems to caution that it's relatively easy to overload a core's infrastructure. At the end of the article are we seeing a reflection of this kind of overloading?:

For example, real pixel-shader code running on simulated Larrabee hardware is getting 80% of theoretical maximum performance, even after accounting for work wasted by pixels that are off the triangle but still get processed due to the use of 16-wide vector blocks.
Or should we be impressed by 80%? The simulation in this paper doesn't seem to have done any kind of on-die or off-die memory modelling. They counted bandwidth. Are they now simulating memory properly? Has that resulted in the cautionary tone?

So it's down to a question of what a task really means in these two contexts :???:

Jawed
 
I found the scatter/gather instructions pretty interesting. As someone mentioned above, they do seem limited to a 32-bit offset, which could hurt them at some point down the road.
http://www.ddj.com/architect/216402188?pgno=4 :

The address for a gather or scatter is formed from the sum of a base register and the elements of a scaled index vector register, as in Figure 12. This is the only case in which a vector register can be used to address memory. More precisely, for each element to be loaded, the address is the sum of the base register and the sign-extension to 64 bits of the corresponding element of the index vector register, optionally scaled by 2, 4, or 8. Note that the 32-bit size of the elements used for the index results in a 4 GB limit on the range for gather/scatter (or larger if scaling by 2, 4 or 8).

What if your gather targets aren't all contained within a 4 GB range? Then you need to wrap another loop around the basic gather loop, in order to step through the 4 GB ranges touched by the gather addresses, which is somewhat more complicated, but not unduly so.
Jawed
 
Does this imply that the model we've discussed, before, of binning/shading/back-end all running on a core, with screen-space effectively tiled per core is not what they're planning to do? This paragraph makes it sound like they'll have binning cores, shading cores and back-end cores.
I think the important message here is: they don't know yet. It doesn't come as a surprise to me; they don't have to know yet and it wouldn't be wise to try only one approach. Larrabee is a pretty unique architecture and in the end they'll just use whatever works best. Fortunately it's flexible enough to guarantee that the end solution will have high efficiency. It depends a lot on inter-core bandwith and latency, but unlike 'fixed' GPU architectures they can work around bottlenecks at a later point and even adapt to future applications.
Is this saying that texel addresses are formed by the core, not the texture-sampler/filter unit?
It makes sense. Address generation is straightforward vector code.

I'm actually starting to think that even the fixed-function texture samplers might go away in the future. Programmable filtering is coming one day and Larrabee could already be using its generic cores for floating-point texture sampling (which mainly requires gather and MAC). If you split textures into their RGB components the (swizzled) gather reads can be highly efficient. So the things left for which you need fixed-function are very limited (and even for those I can think of instructions that would make it doable).
 
I'm actually starting to think that even the fixed-function texture samplers might go away in the future. Programmable filtering is coming one day and Larrabee could already be using its generic cores for floating-point texture sampling (which mainly requires gather and MAC).
Programmable texture filtering is already there (i.e. you do it in the pixel shader) but it doesn't make the fixed-function filtering obsolete.
If you split textures into their RGB components the (swizzled) gather reads can be highly efficient. So the things left for which you need fixed-function are very limited (and even for those I can think of instructions that would make it doable).
The fixed functionality in the texture units is already very limited but very hard to replace in software unless you are prepared for paying a huge area and power cost. Filtering is easy to replace but very hard to make cheap in software. Decompression is probably the harder nut to crack and with pretty much every texture - beside the ones used as render targets - being compressed you cannot ignore it. Adaptive filtering is also nasty, you cannot limit yourself to bi/tri-linear filtering and brute-force anisotropic filtering is too costly. If you're doing it in software you have to take a variable number of samples depending on the spatial compression which obviously requires branching and more often than not incoherent branching.
 
"Scatter/gather must take a mask register... afterwards, mask will be all-zeros". This is their answer to allowing vector loads/stores to be interruptible. Say one of the memory locations has a page fault? They can actually interrupt the vector memory operation mid-way through by just adjusting the mask to reflect what has been done and what is left to be done. When they come back from the page fault, they just restart the instruction and it just handles it. Very simple; very clever.
It's a simplifying design choice, one that has been discussed as being a likely choice.
Larrabee does not implement scatter and gather as single instructions, if I recall some text I read (possibly in an article by Abrash).
Rather they have a microcode component.

The implementation details can be interesting. The microcode engine in other CPUs can potentially block instruction fetch completely for a thread, and the serialization of gathers and stores also means there are questions on the atomicity of the process.
As nice as gather and scatter are, some people did hope for atomic vector operations, as difficult as that might be to implement.

It also says: "L1$ [first-level cache] can only handle a few access per clock." Since they didn't say exactly "two" or something, what I bet they did is bank it four ways. That would give "a few accesses per block" on average.
I'm curious if it has something to do with the fact there's a vector memory pipe and the standard 32-bit ports used by the x86 front end. Perhaps certain operations that have been masked to only use a certain number of bits can borrow the other ports?

It also says: "Offsets referring to the same cache line can happen on the same cycle". So, it sounds like if it does a dense vector load or store it actually goes pretty quickly, but even if you do a scatter/gather, it can take advantage of some locality, too.
It seems to indicate that the vector memory pipe uses whole cache-line loads as an input, and that it has a very wide memory port.
Sorting offsets after that through the already present multiplexing hardware would seem to follow.

When you add four threads per core to suck up various pipeline latencies (branch mis-predictions, dependent operations, cache misses) plus scatter/gather prefetching instructions (giving lots of memory-level parallelism), it seems like this could be reasonably easy to program (perhaps not to get absolute peak performance, but getting reasonable performance doesn't sound so hard).
It potentially can and potentially can't.

The software rasterizer being used has some interesting restrictions that do indicate areas where Larrabee's supposedly generalist architecture enforce through practicality certain outlines of the implementation.
Larrabee won't necessarily fail on other schemes (unless they need atomic scatter/gather for some reason), but they could still be suboptimal.

Threads of a certain type should run together and fixed to a core.
Threads should try to maintain locailty as much as possible.
Threads should try to minimize the amount of invalidation of shared data.
Threads should not migrate, hardware context switches are ruinous.
It is better to have redundant work than try to arbitrate between cores.

The binned renderer design takes all of these concerns into account, it pins threads to a single core, it keeps raster threads associated with their tiles which keeps their data set local physically and in terms of screen space and screen data, and primitives at the edge of tiles are worked over multiple times if need be. It is possibly the optimal scheme for Larrabee.
 
...and the serialization of gathers and stores also means there are questions on the atomicity of the process.
As nice as gather and scatter are, some people did hope for atomic vector operations, as difficult as that might be to implement.

What would the use case be for truly atomic scatter/gather vector ops? I can't currently think of what computation would benefit.

It seems to indicate that the vector memory pipe uses whole cache-line loads as an input, and that it has a very wide memory port.

I'm not sure, but I think for a non-gather (that is, consecutive) vector load, it can read an entire cache block in parallel. It sounds like there is a pretty wide interface to that cache. And why not? Sure, it adds some wires and would complicate the cache layout and such, but it doesn't otherwise have much impact on the power consumption (if you're not using the extra width in a given cycle, those transistors don't fire up). Seems like a reasonable design choice to make.

It potentially can and potentially can't.

Threads of a certain type should run together and fixed to a core.
Threads should try to maintain locailty as much as possible.
Threads should try to minimize the amount of invalidation of shared data.
Threads should not migrate, hardware context switches are ruinous.
It is better to have redundant work than try to arbitrate between cores.

The above lists is basically true of any multi-core programming going forward. However, I wouldn't focus so much on "threads". In a task-based system (like Intel's Thread Building Blocks) the worker threads are do not migrate or context switch (as there aren't any other threads in the system, there is no need to switch them). The task scheduler already helps in capturing locality. I suspect that cache-to-cache sharing latency on Larrabee is reasonably fast (perhaps faster than a miss to memory), so sharing of data shouldn't totally kill performance. Using such a task-based programming model would make it reasonable to extract reasonable performance from a Larrabee-like system without lots of crazy sophisticated fine-grained locking and such.

The binned renderer design takes all of these concerns into account... ... It is possibly the optimal scheme for Larrabee.

I'm fairly certainly this isn't a coincidence. :)
 
What would the use case be for truly atomic scatter/gather vector ops? I can't currently think of what computation would benefit.
I'm trying to track down a paper where an implementation of atomic vector operations helped increase performance significantly in certain parallel operations in a multiprocessor setup.

In more complex vector processing situations, perhaps it enables the use of more lockless programming?


The above lists is basically true of any multi-core programming going forward. However, I wouldn't focus so much on "threads". In a task-based system (like Intel's Thread Building Blocks) the worker threads are do not migrate or context switch (as there aren't any other threads in the system, there is no need to switch them). The task scheduler already helps in capturing locality. I suspect that cache-to-cache sharing latency on Larrabee is reasonably fast (perhaps faster than a miss to memory), so sharing of data shouldn't totally kill performance.
This begs the question of the utility of a fully generalized approach if it is then stipulated that there is only a limited number of ways it can be best used.

After all, a hammer can be used to do virtually anything so long as it's applying blunt force to a localized point.

Having a task scheduler already enforces certain assumptions about what will and will not be useful, and it likely is one of the factors used to determine what is considered "reasonably fast" when it comes to cache to cache sharing.

Using such a task-based programming model would make it reasonable to extract reasonable performance from a Larrabee-like system without lots of crazy sophisticated fine-grained locking and such.
From a reaonably detached standpoint that reasonably overlooks peculiarities of implementation that appear to be reasonable to discard because it provides reasonable performance over a reasonably wide subset of other reasonable schemes that are reasonable because they provide reasonable performance on this reasonable scheme.
 
I suspect that cache-to-cache sharing latency on Larrabee is reasonably fast (perhaps faster than a miss to memory), so sharing of data shouldn't totally kill performance.
I don't think Larrabee cache-to-cache latency is going to be low. In fact I think it's going to be fairly high (probably only slightly lower than a memory access) and depend heavily on the load as they will contend ring-bus access with regular memory <-> core, memory <-> TU and TU <-> core transfers and the associated CC-protocol overhead.
 
This begs the question of the utility of a fully generalized approach if it is then stipulated that there is only a limited number of ways it can be best used.

Who ever said this is fully general? No massively multicore beast (be it a traditional GPU, a multi-core, or Larrabee) is fully general. In fact, programming these multicore things and using wide vectors is tricky.

My point was that Larrabee has several attributes that will make it easier to realistically achieve high performance. I think Larrabee's inclusion of multithreading (to help hide pipeline and memory latency), gather/scatter vector operations, and cache-coherent shared memory will also make it easier on the programmer. I would say that it will be easier to extract performance from Larrabee that, for example, Cell. Are both Cell and Larrabee "fully general", of course not. Yet, I would rather program Larrabee than a Cell-like system.


Having a task scheduler already enforces certain assumptions about what will and will not be useful...

I think you're underestimating how natural and flexible the task-based model is. Tasks can spawn sub-tasks, each core has its own local work queue, but threads can steal work from remote queues when they run out. Many common parallel idioms can be directly mapped to a task-based model. A good task scheduler isn't a limitation as much as an enabler.

...and it likely is one of the factors used to determine what is considered "reasonably fast" when it comes to cache to cache sharing.

I would say if cache-to-cache misses are as fast as a miss to main memory, that would be "reasonably fast". Depending on the latency of the off-chip DRAM access, it might be significantly faster. Also, multithreading helps hide the latency of cache-to-cache misses as well, so even if it slower than we'd like, a modest amount of sharing shouldn't have that much of an impact on performance.

From a reaonably detached standpoint that reasonably overlooks peculiarities of implementation that appear to be reasonable to discard because it provides reasonable performance over a reasonably wide subset of other reasonable schemes that are reasonable because they provide reasonable performance on this reasonable scheme.

Wow.
 
Back
Top