Larrabee: Samples in Late 08, Products in 2H09/1H10

Perhaps not now, but the average programmer of the future (? years from now) will have to. Even in single processor machines, it hasn't been a good idea to stay single-threaded (or single processing) since the advent of time sharing. Eventually your single thread/process is going to block on IO (perhaps even just a virtual memory page read), and the CPU will go idle (when you could still be doing useful work).
No, I'm really not convinced that the average programmer will master low-level multi-core programming in the future. Note that with "average programmer" I'm talking about the gross of desktop application programmers, and with low-level multi-core programming I'm referring to understanding MESI, the ABA problem, read-write reordering, etc. I think it's comparable to assembly programming (or even lower level), something many just don't want to deal with. And they shouldn't have to...

Asynchronous IO is a different story. It's an easy to understand abstraction and like you say it's more related to time sharing than multi-core. Furthermore, a millisecond is a small delay for asynchronous IO, while for the multi-core programming I'm talking about it's an eternity.
While threaded (or multi-processing via fork) programming might be a relatively "new" topic for game developers, it has been a way of life for most "server" programmers for a long time.
Again it's not the same thing we're talking about here. In a server, threads are largely independent, and just serving different users. The only locking that happens is when threads access shared resources outside of the processes (e.g. databases). Also, these resources (or their drivers) are written by highly experienced programmers. The programmers of the services generally don't have to deal with the details of having multiple processors, they essentially deal with single-threaded code and pass data through highly abstracted messages.

The kind of multi-core programming game, multimedia and driver developers have to deal with is a single application running as many threads as there are cores, striving to advance execution as fast as possible. They have to deal with things like spin loops doing no useful work and context switching having a significant overhead. There's a lot of fine-grain inter-thread communication that requires a deep understanding of hardware and O.S. mechanisms.

Anyway, my main point was and still is that most programmers need frameworks to abstract locks-and-threads into something like dependencies-and-tasks.
 
What kind of algorithms are used for DXTC
DXTC is a block based algorithm (for which you only need to do decompression on the GPU), and is also known as S3TC. Here's the wikipedia article on that: http://en.wikipedia.org/wiki/S3TC

While the operations are extremely simple, we're talking about 2-bit indices being used to linearly interpolate two 32-bit integer values, and that's obviously incredibly cheap from a hardware perspective. On the other hand, that is a fairly high number of operations and using 32x32 INT (or FP32!) units to do it would be rather, let us say, inefficient. So my guess is that'll be done in the texture sampler unit, and shouldn't be too much of a problem then.

Compression works like this: http://www.sjbrown.co.uk/?article=dxt

and framebuffer based compression?
Those algorithms are proprietary, and what is used on the latest GPUs is very much anyone's guess. It's a kind of secret sauce that is constantly being perfected; compression ratios in the 8800 GT (G92) have been improved a fair bit compared to G80, for example.

There are many ways to achieve compression, including storing triangle planes but also much more exotic techniques that likely do plenty of operations on integer sizes that are relatively small and wouldn't be very efficient on a programmable core. Whether Larrabee would be at a major disadvantage or not if it didn't have fixed-function hardware for this depends on whether the best algorithms would be computable on its cores in a midly efficient manner or not. And that's anyone's guess given how little of this is public knowledge...

I'm assuming this isn't GZIP, since the data should have lower entropy than plain ol' crap data.
DXTC and depth buffer compression are lossy fwiw, and GZIP wouldn't work anyway because it assumes knowledge of all prior bits. If you want pixel 459321, you don't want to read 459320 pixels you don't need to get it!
 
Lock free algorithms are a really good idea in some contexts, and they are the best you can do in many cases without additional hardware support. However, they are fiendishly complicated.
Yes, but "fiendishly complicated" doesn't necessarily justify adding hardware. If they bite the bullet and use lock-free algorithms where necessary and balance the granularity they might be able to reach the same performance at a lower hardware cost.
Recall all the x86 memory ordering issues we were talking about earlier in this thread? Well, if you write lock-free code, you really need to consider all those nasty memory ordering issues. There is also an issue called the "ABA" problem in which the a pointer to an object has been recycled and reused at just the wrong time. In such cases, doing a simple comparison to confirm the "old" and "new" values isn't enough, you need an epoch timestamp as well.
Been there. :) I had a number of lock-free algorithms that worked fine with two threads but failed with three or more. The problem was exactly that the third thread breaks a lot of assumptions (e.g. having only one possible writer while reading the data) that were crucial to its correcness. This is one of the reasons I believe that going multi-core isn't necessarily simpler than going dual-core.

I can see that speculative locking makes things a whole lot easier, but I'm not sure yet if it can make something essentially lock-free that would have been impossible (or highly inefficient) without the hardware support. But maybe I'm overestimating the hardware cost and it's the best idea to keep scaling the number of cores.
 
It is somewhat sad really in that we will see 3 primary awesome GPGPU APIs from the 3 major vendors (CUDA, CTM, and whatever Intel builds), all of which are not portable, leaving each as only a solution for niche markets.
Larrabee will be most portable of all. By sticking to a CISC ISA that has been around for decades they can easily transfer code from multi-core CPU's to Larrabee, and vice-versa. So it becomes attractive for HPC software developers to develop only for x86.
 
What kind of algorithms are used for DXTC and framebuffer based compression?

I'm assuming this isn't GZIP, since the data should have lower entropy than plain ol' crap data.

DK

The compression schemes are optimized for coherent random accesses.

These papers summarize some of the compression methods used in GPUs for framebuffer (color and Z buffer) compression, based on information in the patents.

http://graphics.cs.lth.se/research/papers/2007/cbc/cbc2007.pdf
http://graphics.cs.lth.se/research/papers/depth2006/dbc.pdf

Also, Z buffer compression is lossless. What many Z compression implementations seem to be doing is storing the buffers in small tiles and compressing them with the all the compressors they have, and selecting the compressor for that tile that gives the best lossless compression. I think it makes sense that all these compressors are tried on the tile in parallel, I think this is something that should be done in fixed function block.
 
Last edited by a moderator:
Will Larrabee processor run Windows, or will it be just be a video card?

From what I understand, Larrabee is just a video card (GPU). Perhaps that is the biggest difference between Larrabee and AMD's Fusion.

There are two inter-related but different trends going on here: integrating the GPU on the CPU chip (AMD Fusion) and using a many-core CPU-like chip for GPU computations (Larrabee). The first trend (integration) is less controversial, especially in embedded or laptop domains in which power and space are at a premium. The second trend (using many general-purpose cores for GPU computations) is much more controversial.

Fusion makes sense as a first step for AMD because of its ATI heritage. Larrabee's approach, if it makes sense at all, make the most sense for Intel (because of its all-x86-all-the-time heritage).

The next post-Larrabee step for Intel will likely look more the Fusion (GPU & CPU together), but I would bet that in a post-Larrabee product, you'd have some big x86 cores together with some smaller Larrabee cores on the same chip. As they would all be x86, it would a many-core x86 chip that would also excel at graphics.

In this post-Larrabee world (if it materializes) the lines between GPU/many-core CPU/GPGPU blur to the point of disappearing.

Ironically, I think Intel is internally divided as to what this post-Larrabee world should actually look like. Different parts of Intel, I think, have lots of different opinions on that. Internal Intel politics could still screw things up enough (or delay them) to prevent this from materializing.
 
Also, Z buffer compression is lossless.
[strike]That's a bit optimistic; I think it's much more fair to say it's "barely lossy".[/strike]
EDIT: Oops, see below posts.
What many Z compression implementations seem to be doing is storing the buffers in small tiles
Yes, that's also to handle memory bursts really. G8x, for example, does everything in 16x16 blocks (the batch size is actually 16x2).
and compressing them with the all the compressors they have, and selecting the compressor for that tile that gives the best lossless compression. I think it makes sense that all these compressors are tried on the tile in parallel
Hmm, maybe. That doesn't seem like a very area-efficient way to handle the problem though! :) But since bandwidth is so expensive, it may be justified.
 
No, I'm really not convinced that the average programmer will master low-level multi-core programming in the future.

Sadly, your statement above is likely true even if the word "multi-core" is removed. :)

I agree that application-level programmers will never master multi-core programming. There are too many subtle correctness and performance issues. However, the hope is that CS majors aren't only application-level programmers, but also systems programmers (of which I would including operating systems, databases internals, compilers, and game engines). I'm not sure that undergraduates (or even MS students) are being well prepared for programming today's multi-core chips.
 
Will Larrabee processor run Windows, or will it be just be a video card?
It might run Windows, but if it does it won't be something to write home about. From the little information available at the moment it won't be a video card either. What it really looks like is an x86 heavily tgeared toward multi-threaded workloads and with extensions tailored specifically for graphics. I.e. a specialized processor just like Sun's UltraSPARC T1 & T2.
 
From what I understand, Larrabee is just a video card (GPU). Perhaps that is the biggest difference between Larrabee and AMD's Fusion.

From the little information available at the moment it won't be a video card either. What it really looks like is an x86 heavily tgeared toward multi-threaded workloads and with extensions tailored specifically for graphics.

These are aparently contradicting statements. Even because crystall says there is available information, whereas ArchitectureProfessore is already an informed voice from the Underground world of chips.

Any way, if ArchitectureProfessore is right, it will be just one more reason why Larrabee will fail. It is just a powerful add on, just like 32x was in its time, when attached to a Genesis console, without any significative existing or potential market for it (WINDOWS GAMES!).
 
I can see that speculative locking makes things a whole lot easier, but I'm not sure yet if it can make something essentially lock-free that would have been impossible (or highly inefficient) without the hardware support.

Ok, try this one. It is a big contrived, but it give an idea. You have two "set" data structures. It supports just just "addKey()", "isKeyPresent(), and removeKey()". Now, you want to remove one item for set A and insert it into set B. You want to preserve the property that any other thread will see the item in set A or B (but not neither and not both). How do you make a lock-free data structure for that? Sounds hard.

With speculative locking, you can just acquire locks on both sets and do the updates. The speculative locking will handle the nesting and let you do this operation totally in parallel with other non-conflicting threads:

Code:
Set A, B;
A.lock()
B.lock()   // watch out for deadlock!
A.removeKey(key);
B.insertKey(key);
B.unlock()
A.unlock()

This is actually the poster-boy for transactional memory (which is similar to speculative locking is). Under transactional memory, you would just say:

Code:
Set A, B;
atomic {
  A.removeKey(key);
  B.insertKey(key);
}

One big advantage of the transactional memory version is that you don't need to worry about deadlock that can occur if you don't acquire the locks to A and B in the same order each time (for example, if some thread is doing the inverse operation on a different key, the lock-based version I gave above could actually deadlock.

But maybe I'm overestimating the hardware cost and it's the best idea to keep scaling the number of cores.

In terms of die area (and power), I really think the cost is minimal (if you already have cache coherence). The biggest cost I see is really the design complexity issues. Of course, design complexity is a fixed cost (as opposed to a per-chip cost), so the design cost can be amortized over millions of units (if successful).
 
Any way, if ArchitectureProfessore is right, it will be just one more reason why Larrabee will fail. It is just a powerful add on, just like 32x was in its time, when attached to a Genesis console, without any significative existing or potential market for it (WINDOWS GAMES!).

I think the idea is that you'll be able to go to your favorite computer store and see NVIDIA, ATI/AMD, and Intel discrete GPUs all in boxes on the same shelf. You'll be able to buy one and just plug it into a PCI Express slot. Games that target DX or OpenGL will (via software drivers) run on Larrabee just as the same interface is supported by the different internals of the ATI and NVIDIA cards. To MS Windows, Larrabee will look just like another discrete GPU card.

All the fancy internals we're talking about? It will mostly invisible, even to many game developers. It will just bet YAGPU (yet another GPU).

Intel is really engaging in full-on battle with NVIDIA on NVIDIA's home turf of high-end discrete GPUs. Brilliant or crazy? Likely both. :) I guess we'll see. It is certainly a bold move on Intel's part.
 
It might run Windows, but if it does it won't be something to write home about. From the little information available at the moment it won't be a video card either. What it really looks like is an x86 heavily tgeared toward multi-threaded workloads and with extensions tailored specifically for graphics. I.e. a specialized processor just like Sun's UltraSPARC T1 & T2.

Intel call Larrabee an Intel Architecture (that is x86) processor optimised for Visual Computing and HPC.

SGI Ultraviolet systems may use Larrabee alongside future Xeons and Itaniums, using QuickPath to connect Larrabee to a UV node.
 
That's a bit optimistic; I think it's much more fair to say it's "barely lossy".
If they can't achieve lossless compression on a tile at any bit rate lower than full uncompressed z bit rate, they simply don't compress the tile. I think the compression rates they give are best/average case.
 
Last edited by a moderator:
Lossy z-buffer compression doesn't make much sense to me, unless you can guarantee that you are never going to have rendering errors introduced by your compressed z-buffer, which it's not really doable on an generic immediate mode renderer.
 
Indeed, forget everything I said there. I just realized that even though there might be error thresholds in some algorithms, those have to be incredibly conservative and are effectivelly lossless given the bitrate.
 
What AMD/ATI is thinking

Here is an image I found on-line that shows what AMD/ATI is thinking:

amdfad01.jpg


This is from the article AMD outlines plans for future processors at The Tech Report.

I think the "dedicate xPU" and "generalized xPU" on the far right is the most interesting part of the slide. Not sure what the means, but it sounds interesting.
 
Lossy z-buffer compression doesn't make much sense to me, unless you can guarantee that you are never going to have rendering errors introduced by your compressed z-buffer, which it's not really doable on an generic immediate mode renderer.

Lossy z-buffer compression could be used for early optimistic rough granularity z-culling only ... for example hierarchical z is basically a lossy z compression (keeping the min or max of the lower 4 pix per level depending on the depth compare mode).
 
Back
Top