Larrabee: Samples in Late 08, Products in 2H09/1H10

That sounds truly fascinating! I wonder though, how does this compare to lock-free software algorithms? They too don't aquire a real lock and check for 'collisions' before committing their results, or roll back and try again.

Lock free algorithms are a really good idea in some contexts, and they are the best you can do in many cases without additional hardware support. However, they are fiendishly complicated. For example, you can actually publish an academic paper if you come up with a new lock-free data structure. It is that tricky. (Then you can publish a second paper in which you formally prove it correct, but I digress.)

Recall all the x86 memory ordering issues we were talking about earlier in this thread? Well, if you write lock-free code, you really need to consider all those nasty memory ordering issues. There is also an issue called the "ABA" problem in which the a pointer to an object has been recycled and reused at just the wrong time. In such cases, doing a simple comparison to confirm the "old" and "new" values isn't enough, you need an epoch timestamp as well.

Also, since the programmers of Larrabee are very experienced I'm not sure if it's necessary to make it software friendly to that degree (or will it be opened to the public)?

Although we won't know until more hardware with speculative locking exists, but I bet it would help even experienced programmers. The problem is, if you make locks too fine grained, the lock acquire/release overhead kills you. If you make locks too coarse, the lock contention kills you. In fact, the right granularity of locking likely depends on the number of cores and the relative cost of acquiring a lock (and cache-to-cache memory sharing). Even for experienced programmers, this is a tough balance to strike.

The promise of speculative locking is to break this fine-grain/coarse-grain locking problem. The hope is it will give the parallelism of fine-grain locking, the low-overhead of coarse-grain locking, and the programming simplicity and maintainability of coarse-grain locking.

Consider a simple binary tree. Sure, you could put locks on every node and do hand-over-hand locking to get lots of parallelism. In fact, you could do read/writer locks to allow multiple read-only lookups of the tree to proceed in parallel.

But, in a system with speculative locking, a simple tree with a single lock will likely be both faster and more concurrent.

Let's just consider the read-only case. Even if you're just grabbing a reader/writer lock for read-only access, you're still writing a shared memory location. Thus, you're going to take misses on these blocks are you obtain write permission to them as you walk the tree. Under SLE, all of the tree locations could be cached in read-only mode by all the processors' caches, allowing each of them to in essence walk their private copy of the tree. Of course, if an occasional update occurs, speculative locking will still do the right thing.

In the cases of mixed updates and lookups, speculative locking has the same effect of eliminating the overhead of actually acquiring all the locks as you traverse down the tree. In also gives more parallelism, because it is unlikely that two threads are modifying the exact same part of the tree at the same time (assuming a big enough tree).
 
Scatter/gather vector operations do not need to be made 'fast'. They are not there for that purpose, they are there to make extracting memory level parallelism cheap. ... It might not be faster but it's certainly much cheaper both from a power and transistor-budget perspective.

Ah, yes, of course, that makes perfect sense to me (once pointed out to me, that is). It does seem that scatter/gather would be a great way to get lots of memory-level parallelism (MLP) cheaply.

In a word: texturing. Texture access patterns are not regular especially when using higher levels of anisotropic filtering.

I see. Certainly texturing is a key GPU operation. Maybe someday I should learn about what texturing actually does... :) It really isn't something for which I have good intuition.
 
Perhaps not now, but the average programmer of the future (? years from now) will have to. Even in single processor machines, it hasn't been a good idea to stay single-threaded (or single processing) since the advent of time sharing. Eventually your single thread/process is going to block on IO (perhaps even just a virtual memory page read), and the CPU will go idle (when you could still be doing useful work).

While threaded (or multi-processing via fork) programming might be a relatively "new" topic for game developers, it has been a way of life for most "server" programmers for a long time. There is a definite trend towards lighter weight locking also. Linux for example provides the futex system call, where locking is done in pure userland (read extremely fast), and only invokes an actual system call (read slow) on contention. BTW, while unix related, a great overview on methods for concurrent programming, http://www.kegel.com/c10k.html.

As I have done both (Multithreading to solve IO latency for a plant data processing system; multithreading to get the most from a multicore system for our upcoming game) I need to say that even if it sounds similar there a big differences. Another lesson I learned is that using non blocking IO channels outperform blocking IO channels with multiple threads on a single core CPU.

Sure you can take some knowledge from one side to the other but it is better to question everything you believe to know about multithreading.
 
Sure you can take some knowledge from one side to the other but it is better to question everything you believe to know about multithreading.

Any gems to share about what you learned about multithreaded programming for multicores? What was hard? What was easy? I'd love to hear some war stories from the trenches. (Seems relevant enough to the GPU vs CPU discussion in this thread in the context of Larrabee's programming model)
 
I see. Certainly texturing is a key GPU operation. Maybe someday I should learn about what texturing actually does... :) It really isn't something for which I have good intuition.

It does many and always in the same way.

- Resolve the addressing mode for coordinates outside the 0-1 range.
- Use texture coordinates and gradients to calculate the LOD, the line of anisotropy.
- calculate the number of bilinear samples needed to fill the line of anisotropy (in 2 lod levels if necessary) and their weights.
- calculate memory addresses for all 4 texel that are need for one sample.
- fetch the texels. Uncompress and or convert (data formats/gamma) if necessary.
- blend the texels together.
- blend the samples together.
- convert in the shader floating point format if necessary.

Therefore the textureing units are one of the last large fixed functions blocks in modern GPUs. The next fixed block blocks are the rasterizer which converrts the vertices to pixel and subsamples. Finnaly we have the output megers (aka Raster operators/ROPs) that does the final depth test, blending and stencil operations.

But it looks like that the output mergers could be the next ficed function unit that is droped.
 
Any gems to share about what you learned about multithreaded programming for multicores? What was hard? What was easy? I'd love to hear some war stories from the trenches. (Seems relevant enough to the GPU vs CPU discussion in this thread in the context of Larrabee's programming model)

I don’t think that my war stories are different from what other teams have learned. Maybe I had an easier start as I am already know enough about threads and synchronizations. But when it comes to squeeze most out of the CPU it was more the classical trial and error with different approaches like OpenMP, thread pools, …
 
It is similar, but speculative locking is much more general. In load-link/store-conditional (LL/SC), you're acting on a single address, and checking that single address before performing the conditional store. In addition, some ISA (such as Alpha but not PowerPC) don't even allow any other memory operations between the LL and SC, so it is really limited to constructing simple atomic operations (swap, compare-and-swap, atomic increment), the kind that are primitives under other ISA such as x86.

In speculative locking (and transactional memory), you can start speculating and perform hundreds or thousands of instructions on hundreds of cache blocks. For example, a search of a binary tree or a lookup in some more sophisticated linked-list or graph structure. If any other processor modified something you've read (or read something you've modified) it detects it and recovers by doing the right thing.

The key observation is that locks are mostly not needed, but they have to be used just in case.

"Doing the right thing" is a deceptively simple way to describe it, no ?

What do you do on a conflict in the critical section ? Replay from the entry point ? What about modified memory state?

This paper adds support in the write buffers to support speculative stores. Write-buffers are expensive and thus a limited resource. That's why we have special stores today that bypass them (non-temporal stores). Using speculative lock elision makes non-temporal stores illegal instructions within the critical section.

It also relies heavily on the cache coherence protocol, specifically it forces all cache lines touched in the critical section into an exclusive state in order to be able to observe acesses to these lines from other contexts. That makes it at least as expensive as a simple test-and-set instruction to aquire a lock in the first place for all but the most trivial (academic) cases.

Note that the real cost of test-and-set is aquiring exclusive access to the cache line holding the lock in the first place , the subsequent read-modify-write is dirt cheap in comparison. Speculative lock elision has the exact same cost in the best case.

Transactional memory as a software concept is valid IMO. Fast locking (in hardware) is essential.

Cheers
 
Last edited by a moderator:
I think the key difference is that speculative locking uses the familiar threads-and-locks model, but makes it faster by (1) reducing lock contention and (2) reducing lock acquisition overhead.

One obvious question here is how much fine grain locking is really needed, or does it just serve the purpose of making it easier for programmers. Even with non-GPU programming, fine grain locking is a sure sign of an inefficient (and non-scalable parallel) algorithm.

The Intel approach seems to be geared towards trading better latency and fine grain synchronization at the expense of a reduction of computation power given a set number of transistors. Where as the GPU approach has favored computation power at the expense of high latency and course grained synchronization.

IMHO, in some ways Intel has been a victim of their success, and previously been unable to break away from the mainstream treads they have helped establish. IA-64 for example in breaking binary x86 compatibility never really caught on. I think we have already established that the ISA of Larrabee is bound to be different than x86 (16 wide vectors, must have some kind of texture gather, etc). While Larrabee might have an advantage in performance/watt for some kind of GPGPU stuff, it sure isn't going to have a performance/watt advantage in rendering, simply because rendering doesn't require the many of the features which Larrabee is using its transistors to accomplish.

Likewise it seems as if all GPU vendors are currently tied to the path they have already established (mainly the DX standard). From a programming perspective, can any major developers venture into a new programming model for Larrabee to extract the true performance of the chip?

It is somewhat sad really in that we will see 3 primary awesome GPGPU APIs from the 3 major vendors (CUDA, CTM, and whatever Intel builds), all of which are not portable, leaving each as only a solution for niche markets.
 
"Doing the right thing" is a deceptively simple way to describe it, no ?

I was intentionally being vague as I was trying to describe the high-level idea. I'll give more details in this post.

There are two issues in speculative locking (and transactional memory): conflict detection and version management.

Conflict detection leverage the cache coherence protocol. When you read a block while speculating, you set a bit in the tag array that indicates it has been "speculatively read". You only need read-only permission to read such a block. When you write a block, you set the corresponding "speculatively written" bit. A conflict occurs when another processor comes along and tries to read a block you've written or write a block you've read. In *both* cases, the coherence protocol would need to send a message to invalidate the cache of the processor that has accessed them speculatively. Once a conflict happens, some sort of abort likely follows (who should abort to give the highest performance is also a consideration, but not something I'm going to discuss right now). That is conflict detection.

The second component is version management. Here you have two choices.

One choice is to do lazy versioning. In such a system all speculative writes are written to a "speculation buffer". If the speculation aborts, you haven't modified anything in the cache, so you just restore the register checkpoint and discard what was in the speculation buffer. On a commit, you atomically copy everything from the speculation buffer into the data cache. This is done atomically by not ingesting any incoming coherence protocol messages until the commit is complete. Note that the primary data cache will already have write permissions to all the blocks (something that was obtained before the store retired into the speculation buffer), so the commit should be reasonably quick.

The other choice for version management is eager version management. One way to implement this is to write all speculative store values into the cache. The first time you write to a block in the speculative region, you check to see if the block is dirty. If the block is dirty, the processor "cleans" it by writing it back to the second level of cache. This allows the processor to then write the speculative value right into the cache. On a commit, you just flash-clear all the read/write bits in the cache, and the newly-committed data is already in the cache. On an abort, you flash invalidate all speculatively written blocks in the cache. This means that any read to these blocks will miss and force the pre-speculation value to be pulled in from the second-level cache. In this approach, you don't need any sort of speculation buffer (the L1 and L2 caches are working together to buffer new and old values, respectively).

Ok, with all that said, let me respond to your specific comments:

What do you do on a conflict in the critical section ? Replay from the entry point ? What about modified memory state?

You throw away the speculative state and restore the pre-speculative state. Register values are restored from a checkpoint. You then replay again starting at the entry point.

This paper adds support in the write buffers to support speculative stores. Write-buffers are expensive and thus a limited resource.

Write buffers are just used to cache blocks while they are gathering permissions from the coherence protocol. (It one of the hardware optimizations that causes problems in re-ordering instructions and the memory ordering model, which were discussed early in this thread.) There aren't that many of them, but they aren't a specifically scarce resource. That said, even the paper you reference does talk about the possibility of buffering the speculative state in the cache (rather than a separate buffer).

That's why we have special stores today that bypass them (non-temporal stores).

Non-temporal stores bypass the primary *data cache*. They are just a hint to the hardware to not bother caching them.

Using speculative lock elision makes non-temporal stores illegal instructions within the critical section.

Nope. The hardware is always allowed to just ignore the non-temporal nature of a store if it wishes. It really is just a performance hint to the hardware.

It also relies heavily on the cache coherence protocol, specifically it forces all cache lines touched in the critical section into an exclusive state...

Nope. If you only need the block in read-only (aka, "shared") state if you only read it. Only the blocks that are *written* speculatively need to be in exclusive state.

...in order to be able to observe acesses to these lines from other contexts.

If you have a read-only copy of the block, before some other thread writes it, it must invalidate you. That is the fundamental rule of coherence. As such, you will see an invalidation (allowing you to detect a conflict) if some other processor attempts to write the block.

Note that the real cost of test-and-set is aquiring exclusive access to the cache line holding the lock in the first place , the subsequent read-modify-write is dirt cheap in comparison.

This I agree with. It is the miss that is expensive. However, as I explained, you only need read-only access to blocks that haven't been written. As all the cores can keep blocks in read-only mode, that means you don't need these locations to bounce back and forth (as you would need with a lock).

Speculative lock elision has the exact same cost in the best case.

As described above: Nope.
 
Likewise it seems as if all GPU vendors are currently tied to the path they have already established (mainly the DX standard). From a programming perspective, can any major developers venture into a new programming model for Larrabee to extract the true performance of the chip?

The “DX standard” has changed in the past. It can change again to fit the hardware. The only problem for new hardware is that it need at least compatible in some way to the older DX versions.

I am sure that you can gain more performance if you going deep but this may be only a option if your software is special enough to dictate the hardware it needs.

It is somewhat sad really in that we will see 3 primary awesome GPGPU APIs from the 3 major vendors (CUDA, CTM, and whatever Intel builds), all of which are not portable, leaving each as only a solution for niche markets.

Right. We need a common API for this kind of processors better sooner as later.
 
One obvious question here is how much fine grain locking is really needed, or does it just serve the purpose of making it easier for programmers. Even with non-GPU programming, fine grain locking is a sure sign of an inefficient (and non-scalable parallel) algorithm.

Let me give a simple example. Consider a software queue between two threads doing some sort of pipelined computation (or whatever). It seems you have a few choices to make this fast: (1) you could build a special-purpose hardware queue, (2) use a software queue with a single lock, (3) use lock-free algorithms to build a fast concurrent queue, or (4) use a single lock for the queue with hardware support for speculative locking.

Option #1 requires special purpose hardware. Fast, but special-purpose. Option #2 prevent concurrent dequeue and enqueue from the queue. Even worse, the lock variable will repeatedly ping-pong back and forth between the caches. Option #3 is a pretty good solution, but is somewhat complicated to get right. Like option #3, option #4 will give you the concurrent enqueue/dequeue and reasonably low overheads. It is also the easiest of all four options for the programmer.

The Intel approach seems to be geared towards trading better latency and fine grain synchronization at the expense of a reduction of computation power given a set number of transistors. Where as the GPU approach has favored computation power at the expense of high latency and course grained synchronization.

Let me fan the flames further. Perhaps Intel believes that better latency and fine grain synchronization can enable algorithms that are more efficient in terms of the raw computation they need. Maybe they can reduce the number of operations needed by doing more sophisticated Z-sorting, better culling, or whatever else. If they can save enough raw FLOPS (or memory bandwidth) needed to do the same computation, maybe the 10% of the die area and power budget spent on hardware-managed caches, speculative locking, and cache coherence really *does* makes sense (for GPU, not just for GPGPU).

The GPUs are really tackling graphics by brute forcing them using lots of FLOP throughput, fixed-function hardware blocks, and lots of memory bandwidth. Perhaps there is a "work smarter, not harder" way of doing graphics?

Ok, I'm out of my area of expertise here. Flame away. :)

IA-64 for example in breaking binary x86 compatibility never really caught on.

IA-64 was a disaster from the start. Back in graduate school in 1996 or 1997 when they released some of the ISA, we all just scratched our heads and said: "WHAT are they THINKING?". It was clear to the academics I knew at the time that IA-64---at least from a technical point of vew---solved none of the important problems in chip design. That said, IA-64 single-handedly scared many of the high-end RISC vendors out of business, so maybe that is a little silver lining to what is otherwise an unmitigated disaster.

From a programming perspective, can any major developers venture into a new programming model for Larrabee to extract the true performance of the chip? ... It is somewhat sad really in that we will see 3 primary awesome GPGPU APIs from the 3 major vendors (CUDA, CTM, and whatever Intel builds), all of which are not portable, leaving each as only a solution for niche markets.

I suspect that we're going to see lots of innovation in terms of GPGPU and even GPU APIs over the next few years. I think we're going to see lots of different options, and who knows which model (and likely models) will be most embraced. It will be interesting.
 
The GPUs are really tackling graphics by brute forcing them using lots of FLOP throughput, fixed-function hardware blocks, and lots of memory bandwidth. Perhaps there is a "work smarter, not harder" way of doing graphics?
GPUs try hard to be efficient at what they do, while it's certainly true that you can always do better modern GPUs employ many techniques in order to avoid unnecessary memory read/writes and computations.
 
I don’t think that my war stories are different from what other teams have learned. Maybe I had an easier start as I am already know enough about threads and synchronizations. But when it comes to squeeze most out of the CPU it was more the classical trial and error with different approaches like OpenMP, thread pools, …

You definitely have to think differently when programming on Windows vs Linux/Solaris/Unix in that Windows is heavy weight for many of its kernel interfaces (minimizing threads can become important). Just the number of options is bound to confuse anyone at the beginning, like Mutex vs CriticalSection, Threads vs Fibers, Overlapped IO vs blocking, etc. With server programming you are trying to maximize your IO throughput, with games you are mostly trying to insure a stable FPS (more of a scheduling ordering and latency problem, isn't this why CELL developers tend to lock down SPUs to a specific process). On Windows there are many tools to coerce the scheduler to do what you want, such as adjusting thread priority, adjusting thread cpu affinity, forcing thread cpu affinity, and insuring a thread blocks or yields before the threads quanta elapses.

Be interesting to see what they do with Larrabee's GPGPU API for threading and synchronization...
 
Last edited by a moderator:
For those drawing parallels to SGI you might want to consider that SGI served a low-volume niche market.

NVIDIA on the other hand is a mass-market company and an extremley efficient one at that. Compare NVIDIA gross margin to ATI pre-AMD and you willl note gross margin differences rarely exhibited by 2 direct competitors.

Another way of looking at Arun's insightful ASP analysis is to note that the ASP for a NVIDIA chip is somewhere in the low $30s. I don't know about the average die size of GPUs versus CPUs, but it would surprising if GPU's weren't bigger. So as of now Intel might be a node or a half node ahead, but it is likely far behind in cost competitiveness with the combination of NVIDIA and TSMC. Put another way, consider how high NVIDIA's margins would be if they were selling the same chip on average for $150 or $200 (stratospheric).

Then there is probably the desire in the industry for a real competitor to Intel. So as long is NVIDIA is helping add technological and economic value to the industry it hard to imagine PC makers being completely steered by Intel in terms of bundling at this juncture. NVIDIA, is after all, the only financially strong alternative to Intel. This is why I'm betting NVIDIA will see strong uptake in IGPs for Intel systems far before Intel launches Larrabee.
 
Hi Architecture Professor!

I really dont know much about the fine details you are talking about, at least yet :)... And what I ask will seem a bit offtopic, but I would really apreciate if you provide some insight :)

I can see that you are really insightful towards Larrabee, but it would be really nice if you had a similar product to compare against. So, do you know anything about Fusion? If you know, can you compare the architecture of both? It would be really nice, since up to now Larrabee seems so inovative, as you present it, that it seems doomed to a total failure. So, by comparing two competitors (Fusion vs. Larrabee), it would be nice to know if it is really the market trend, or just another delusional grand project from Intel, aka. IA-64.

PS.: The way you talk it seems that larrabbee will be a competitor to fusion :S
 
Last edited by a moderator:
Hi Architecture Professor!

I really dont know much about the fine details you are talking about, at least yet :)... And what I ask will seem a bit offtopic, but I would really apreciate if you provide some insight :)

I can see that you are really insightful towards Larrabee, but it would be really nice if you had a similar product to compare against. So, do you know anything about Fusion? If you know, can you compare the architecture of both? It would be really nice, since up to now Larrabee seems so inovative, as you present it, that it seems doomed to a total failure. So, by comparing two competitors (Fusion vs. Larrabee), it would be nice to know if it is really the market trend, or just another delusional grand project from Intel, aka. IA-64.

PS.: The way you talk it seems that larrabbee will be a competitor to fusion :S
From what I know about both, they are completely different. Fusion is a CPU and a GPU on one die (or is it one package? I don't remember), but still separate. Larrabee is completely different (each core on Larrabee will be far weaker than your average CPU core, but it is still a CPU core capable of doing CPU things as well as GPU things via vector operands).
 
For IGPs, the 'easy' solution is embedded memory in my opinion (ideally Z-RAM, but hey, I'll survive if it's 'just' eDRAM). Of course, knowing me, I'd be capable of degenerating this thread into an embedded memory lovefest, so I'll just keep my mouth shut here instead...

Or something else that doesn't necessarily need embedded memory and would derail the thread by the same analogy. If you need X amount of transistors for embedded ram in order to give a hypothetical IGP more ram and bandwidth, it's not necessarily the easier sollution.
 
I can see that you are really insightful towards Larrabee, but it would be really nice if you had a similar product to compare against.
Actually Larrabee will look a lot more like the UltraSPARC T2 than Fusion processors. You can find more about it here:

http://www.opensparc.net/pubs/preszo/06/04-Sun-Golla.pdf
http://www.sun.com/processors/UltraSPARC-T2/index.xml
http://www.opensparc.net/opensparc-t2/index.html

The main difference between the T2 and Larrabee is that the T2 is heavily geared towards server usage so it integrates encryption engines, dual 10-Gigabit ethernet ports and an 8x PCI-Express port. Larrabee is much more likely to integrate graphics-related hardware like fixed-function texture samplers/filters and the like.
 
Or something else that doesn't necessarily need embedded memory and would derail the thread by the same analogy. If you need X amount of transistors for embedded ram in order to give a hypothetical IGP more ram and bandwidth, it's not necessarily the easier sollution.
Indeed - let's end this discussion right here before it's too late though! :)

Talking of such things: is it just me or might compression (both DXTC & framebuffer) be pretty hard to do efficiently on Larrabee? I sure as hell hope they're planning on fixed-function hardware for that, because it seems to me using programmable cores for that isn't a particularly bright idea due to the sheer overhead. But then again, maybe the scalar path in Larrabee is good enough to handle that decently, hmm. Would help to know exactly what tricks are employed in a modern GPU, I guess!
 
Indeed - let's end this discussion right here before it's too late though! :)

Talking of such things: is it just me or might compression (both DXTC & framebuffer) be pretty hard to do efficiently on Larrabee? I sure as hell hope they're planning on fixed-function hardware for that, because it seems to me using programmable cores for that isn't a particularly bright idea due to the sheer overhead. But then again, maybe the scalar path in Larrabee is good enough to handle that decently, hmm. Would help to know exactly what tricks are employed in a modern GPU, I guess!

What kind of algorithms are used for DXTC and framebuffer based compression?

I'm assuming this isn't GZIP, since the data should have lower entropy than plain ol' crap data.

DK
 
Back
Top