C++ implementation of Larrabee new instructions

I'm trying to track down a paper where an implementation of atomic vector operations helped increase performance significantly in certain parallel operations in a multiprocessor setup.

In more complex vector processing situations, perhaps it enables the use of more lockless programming?

There was a paper from Intel at ISCA 2008 titled: "Atomic Vector Operations on Chip Multiprocessors". At least in this paper, the atomic operations they are talking about allow for atomic read-modify-writes on a per-lane basis (rather than saying all the loads or store of a single gather/scatter occur atomically). The key thing this lets you do is vectorize a loop with some sort of synchronization in it (such as a lock or lock-free data-structure). Basically, the proposal is a load-linked/store-conditional pair (like used for MIPS, Alpha, PowerPC), but in vector form.

If that is the sort of atomic vector operation you're talking about, I too would like to see them have it. Perhaps Larrabee II will add such support.
 
Who ever said this is fully general? No massively multicore beast (be it a traditional GPU, a multi-core, or Larrabee) is fully general. In fact, programming these multicore things and using wide vectors is tricky.
It's implied by various statements that deemphasize any possible Larrabee performance shortfall by saying "it'll be more flexible".
The question is what they mean by that if the architecture's limitations mean that the binning rasterizer Larrabee uses is one of a few optimal solutions.

It's kind of bleh if programmers are given all the flexibility to implement the same rasterizer + an extra bell or whistle.

At least the real time raytracer proponents would have their sights set on something interesting.

My point was that Larrabee has several attributes that will make it easier to realistically achieve high performance.
In various situations, I'd also believe this to be the case.
The claim was qualified with the word "reasonable", which I'll get to later.
The relative utility of those cases won't be determined until an actual implementation is put through its paces.

Many common parallel idioms can be directly mapped to a task-based model. A good task scheduler isn't a limitation as much as an enabler.

There are certain underlying assumptions that a task model on Larrabee would suggest to a programmer that are not univerally necessary, but reflect quirks of the design.

I just feel it's more honest to state that while we can decry how GPUs in various places constrain the programmer, that the alternative is not unconstrained either.

I would say if cache-to-cache misses are as fast as a miss to main memory, that would be "reasonably fast".
I think it'd be faster than that. The texture units and cores talk to each other over the same interconnect the caches use to transfer data.
It wouldn't do to have the act of just asking for a texture access taking several hundred cycles.

Depending on the latency of the off-chip DRAM access, it might be significantly faster. Also, multithreading helps hide the latency of cache-to-cache misses as well, so even if it slower than we'd like, a modest amount of sharing shouldn't have that much of an impact on performance.
It can, though 4 threads isn't an embarrassment of riches when it comes to latency hiding, particularly since the threads on a core would tend to be running the exact same code and so would hit the same hot spots in the same time frame.


That was just a non-serious point about the use of the word "reasonable", which I feel can be a rather squishy and pliable term.

There was a paper from Intel at ISCA 2008 titled: "Atomic Vector Operations on Chip Multiprocessors". At least in this paper, the atomic operations they are talking about allow for atomic read-modify-writes on a per-lane basis (rather than saying all the loads or store of a single gather/scatter occur atomically). The key thing this lets you do is vectorize a loop with some sort of synchronization in it (such as a lock or lock-free data-structure). Basically, the proposal is a load-linked/store-conditional pair (like used for MIPS, Alpha, PowerPC), but in vector form.

If that is the sort of atomic vector operation you're talking about, I too would like to see them have it. Perhaps Larrabee II will add such support.

I think that's the one.
 
Last edited by a moderator:
Are we sure atomic vector operations are not supported on LRB, because these are core to DX11 compute?

I'll give a very practical example of why one would want atomic vector gather/scatter. If LRB's strength is flexibility to move beyond the traditional GPU like rendering, two likely paths are raycasting and point based rendering, or a mix of both, say raycasting to hole fill, and point based g-buffer reprojection (and some special image space magic to deal with the problems of reprojection) to cut the number of raycasts. I'm a year or two into research on these hybrid methods. Anyway one example of where atomic vector scatter is quite important would be point rendering for g-buffer reprojection. Points get batched up by locality (from some hierarchical data structure, or screen tiles for my reprojection example above) into SIMD width groups. Then one does the following,

(1.) For each point, pack (integer_z << 32)+(point_address) into one 64bit integer. Then for each point, atomicMin vector scatter this 64bit int into a 64bit z-buffer using DX11 CS5.0 InterlockedMin().

(2.) After scattering all points, unpack the point_address, and vector gather the g-buffer contents from the previous frame.

This is just one tiny example. Vector atomics are very important IMO for the new DX11 CS or OpenCL massively parallel programming.
 
Not just sorting, but concurrent lock-free hashmaps and queues are extremely useful and fundamental for many algorithms. See Cliff Click's implementation for example (http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf) which provides a marked improvement on a 768-core machine.

In general, people using shared memory make fatal mistakes whether they're using lock-free or lock-based synchronization, and dead-locks are all too common an occurrence. There's no real 'hassle free' concurrency model except for 'shared nothing' message passing schemes, like the Actor model in Erlang, or of course, pixel shaders. :) Transactional Memory is an interesting development, that permits shared atomic modifications with database-like isolation levels, but without hardware support, it's not very efficient (Sun's Rock processor will have HW support, but their CPU architecture makes it easier, since it is based on speculative threading with out-of-order commits).

I'm of the opinion that 95% of developers should not be using shared memory. It is often overused in cases where process isolation or message passing would be just as good, but with a much lower defect rate. Then, for the 5% of the time where shared memory in an SMP architecture would dramatically speed up your algorithm, commit to implementing it very very carefully.

What happens when you give programmers easy ability to do concurrency (like with Java threads and the synchronized keyword) is they end up using threads to fire off tasks for everything, even in cases where it is of little to no benefit, but greatly increasing the risks of race conditions and dead locks.
 
That's why I'm not really enthusiastic about anything making lock-free programming more efficient ... replacing buggy threading code with buggy transactional code doesn't seem like progress to me, yet that seems where everyone is headed.

As I said before, it's disappointing Intel isn't doing much with languages.
 
(1.) For each point, pack (integer_z << 32)+(point_address) into one 64bit integer. Then for each point, atomicMin vector scatter this 64bit int into a 64bit z-buffer using DX11 CS5.0 InterlockedMin().
FUGLY! Is this really faster than simply rendering points with a 32 bit z-buffer?

In Larrabee's case, would an instruction which checks a vector of addresses for collisions and returns a collision free bitmask be enough? (Which you'd use in a loop until all collisions were resolved.) You'd still have to render in tiles of course.
 
FUGLY! Is this really faster than simply rendering points with a 32 bit z-buffer?

Actual point rendering is likely either triangle setup or vertex output limited, which gives lots of extra free ALU or TEX work in the vertex shader, but low amount of points per sec (in best case limited by a function of GPU MHz since triangle setup is serial). Also ALU is wasted by what ends up being a pass-through pixel shader. I'm doing point rendering now for DX10 generation cards, using the trick of doing all math in the vertex shader to avoid the minimum 4x ALU cost (2x2 quad as minimum) in the pixel shader.

At least with atomic operations, one has the chance of returning to being bandwidth bound. I've been waiting for DX11/OpenCL to transition to compute shaders (however I could start testing with CUDA if I'd go out and buy a GT2xx card). On GT2xx I guess you'd be only utilizing about 75% of your bandwidth for fully divergent scatters of 64bit (8byte) values (32byte minimum transaction size). Perhaps add another 15% reduction for doing an atomic operation. So the fugly packing trick is free bandwidth wise. Without testing, I'd guess on current top of the line NV cards that with the fugly atomic scatter that 10x more points could be drawn per sec than pure rendering (might change with DX11 however)!

In Larrabee's case, would an instruction which checks a vector of addresses for collisions and returns a collision free bitmask be enough? (Which you'd use in a loop until all collisions were resolved.) You'd still have to render in tiles of course.

Yes, I guess as long as all hyperthreads on all cores are working on output which is insured to not be screen space intersecting (ie working on different tiles, and clipping all point output to the tile). But, per lane collision test doesn't seem cheep ALU wise.
 
Ok, just to be clear, are we talking about "atomic vector ops" in the sense of "a single gather/scatter performs all its operations atomically" or "vector atomic ops" as in "support for vector operations that do a compare-and-swap or read-modify-write on each element vector.

If we are talking about the second option above (which from some of the recent comments above seems likely), I can see the use cases. Basically, anywhere you'd want to do a scalar atomic op (lock-free data structures, accumulating, acquiring a lock, etc.) would also apply to vector atomic ops.

P.S. transactional memory is an interesting research proposal (caveat: I've published on it as well). Yet, I've never seen an idea go from some academics saying "hey this is an interesting idea" to such universal hype. Now the backlash has really kicked in. We're certainly in the "Trough of Disillusionment" of the hype cycle. Yet, if we're going to debate the merits of transactional memory, perhaps we should start a new thread...
 
Yes, I guess as long as all hyperthreads on all cores are working on output which is insured to not be screen space intersecting (ie working on different tiles, and clipping all point output to the tile). But, per lane collision test doesn't seem cheep ALU wise.
Without a scalar (banked) interface to the memory/cache atomics don't seem cheap either. With a scalar interface things all of a sudden become a lot easier, which brings me back again to questioning Intel's decision to go for vector wide ports :)

PS. how does InterlockedCompareStore work exactly? What you really want is an atomic way to store an arbitrary amount of data based on the comparison (or at least a floating point Z value together with pixels in an arbitrary amount of multitarget framebuffers). If your solution is the best you can manage with DX11 then it's far from ideal.
 
Last edited by a moderator:
Compare And Swap works like an atomic version of this (semi C like),

CompareAndSwap( address, compare, new ) {
old = address[0]; address[0] = ( old == compare ? new : old );
return old; }
 
If your solution is the best you can manage with DX11 then it's far from ideal.

BTW, I'm not 100% sure that AtomicMin() of 64-bit ints works on GT2xx. They do have a 64-bit Compare And Swap however according to the CUDA 2.2 docs.

The solution I'm talking about (for point drawing) is effectively to do a pre-Z pass with atomic scatter of {Z,index}, followed by a second pass where rendering of the point is conditional on gather of {Z,index}. It isn't too far from what we do now with triangles. In fact it would be easy to do hierarchical coarse Z cull with this method in software (coarse cull batches of points at a time).

The difference is that the index must be packed with the Z to avoid collisions of the same Z. This is because two threads could in theory have the same Z at a point, and be running at the same time, writing to the same position in multiple places (say multiple virtual render targets). Output might then end up as a mix of parts of both points. Might be able to do a 24bit Z with a 8bit salt value to stay with a 32bit AtomicMin() operation.

If anyone has a better way I would love to hear about it. In theory this should be exactly the type of thing that LRB would excel at given it's caches.

There are other possible crazy ideas, such as having a deep Z buffer and adding fragments to per pixel queues, but that still involves an atomic operation per pixel.
 
I know what it does but having to give up floating point for Z and wasting time comparing irrelevant bits? Fugly.

Efficient on Larrabee? Points will as a rule not map to successive addresses ... you are word addressing on a memory subsystem optimized for vector addressing ... it can scatter/gather but it's not optimized for it. This maps better to a GPU with local shared memory (you'd have to tile of course) than Larrabee.

PS. how do you know for sure InterlockedCompareStore is Compare and Swap? That sounds more like InterlockedCompareExchange to me.
 
Efficient on Larrabee? Points will as a rule not map to successive addresses ... you are word addressing on a memory subsystem optimized for vector addressing ... it can scatter/gather but it's not optimized for it. This maps better to a GPU with local shared memory (you'd have to tile of course) than Larrabee.

You'd be word addressing if you were writing out to say multiple 32-bit render targets, however if you pack to one very deep render target (perhaps 128-bit), scatter is only 1/4 as bad (on LRB). Still have to pack/unpack in the shader however... either case isn't the best option. Texture units provide the cached word addressing for vector gather, hence the original suggestion of packing Z+id so that a second pass could do an efficient gather on the data for proper fully vectorized post gather computation.

PS. how do you know for sure InterlockedCompareStore is Compare and Swap? That sounds more like InterlockedCompareExchange to me.

I don't. I was thinking InterlockedCompareStore was InterlockedCompareExchange without the return value.
 
If the designers opted to make the LOCK prefix do so, they could.

Given that only a subset of x86 allows the use of the prefix, the designers could also have decided to not use it as they saw fit.
 
Back
Top