Analyst: Intel's Larrabee chip needs reality check

Not so fast. That's the gist of a report that says Intel's future graphics chip will face a grueling battle to gain ground against entrenched and very capable competitors.

When Larrabee was disclosed in August, it sent shockwaves through the chip industry. After all, this was Intel, the world's largest chipmaker, announcing its intention to assail one of the last major PC chip markets it has yet to dominate: standalone "discrete" graphics processors--or GPUs. Larrabee is due to ship in the 2009-2010 time frame.

Intel already dominates the market for integrated graphics silicon: graphics functions integrated into Intel chipsets come virtually free on tens of millions of PCs shipped worldwide every year, an offer that many PC vendors find hard to refuse. The resulting less expensive PCs are, in turn, welcomed by consumers.

But the discrete graphics market is a different creature. It is dominated by Nvidia and AMD's ATI graphics chip unit. Both companies supply chips that easily rival--or best--any Intel chip in complexity. Nvidia's latest chip, the GTX 280, boasts 1.4 billion transistors and 240 stream processors. In short, it is an extremely complex parallel-computing engine.

News Source: http://news.cnet.com/8301-13924_3-10039973-64.html

Mod edit: Trimmed quote down to "Fair Use" size.
 
Good to see both the analysts and Intel getting it pretty much on the nose, although it's still a bit of a struggle to see why there's so much chatter without any real idea about how they'll approach productisation.
 
I'd say the analyst is right. Although at the same time, I'm not sure the analyst understands all technical detail either. For a GPU the choice of ten simple "antiquated" cores over two modern cores makes a lot of sense. I don't think this is where Intel went wrong. I think they went wrong when they decided to make it x86. They are starting a brand new design that will have a lot of baggage on day one, and by exposing x86 directly they are forced the maintain binary compatibility for all future generations making the problem worse over time as new features are added, whereas ATI and Nvidia can redesign their stuff from ground up with essentially no backward compatibility on the hardware level. At the same time I understand why they made that decision. Larrabee is mostly motivated by business reasons rather than technical ones. GPGPU is eating into their most profitable business, and they need to fight back. The quickest solution they can do is to reuse existing technology. If they were to design a better architecture from scratch they would lose another year or two, which they can't afford.
 
I think they went wrong when they decided to make it x86. They are starting a brand new design that will have a lot of baggage on day one...
How much baggage are we talking about really? The first x86 CPU with most of this baggage is the 386, which has a measly 275,000 transistors. Sure, a modern 64-bit SMT architecture needs more, but if Intel was able to keep x86 afloat in the 386 days, when there was still lots of competition from other so-called superior architectures, how bad can it really be?

x86 isn't particularly elegant, but I really wonder where the idea comes from that it has too much baggage or is even crippled. Every ISA has its curiosities, but unless you are a compiler back-end writer that shouldn't affect you much if at all.

If you were to design a massively parallel multi-purpose CPU, what ISA do you believe would do significantly better than x86? Give me numbers...
...and by exposing x86 directly they are forced the maintain binary compatibility for all future generations making the problem worse over time as new features are added, whereas ATI and Nvidia can redesign their stuff from ground up with essentially no backward compatibility on the hardware level.
As cores become more complex and ever more programmable, I doubt that the choice of x86 really makes things worse. People will always want more features, and unless there's some revolutionary hardware or design breakthrough I doubt you'll be able to do more with less. Heck, NVIDIA and ATI still have to invest a lot more transistors to be able to do everything Larrabee will be capable of on day one.
At the same time I understand why they made that decision. Larrabee is mostly motivated by business reasons rather than technical ones. GPGPU is eating into their most profitable business, and they need to fight back. The quickest solution they can do is to reuse existing technology. If they were to design a better architecture from scratch they would lose another year or two, which they can't afford.
And you think that for ATI and NVIDIA it's really a good idea to ditch the ISA every generation and start over from scratch? I can already hear a cheer of joy from the code generation teams and others who have to inspect assembly or binary from time to time... I don't know that the future will bring, but it does look to me like with G80 and CUDA NVIDIA opted for high compatibility between generations. It will be interesting to see ATI's approach.

Anyway, note that Intel will be able to present new Larrabee products at the same pace as CPUs. Having existing x86 tools and knowing that software written today will also run unmodified on future generations also helps keeping a high pace. Not to mention compatibility with systems not equipped with a Larrabee card but still having a powerful multi-core x86 CPU...

The software side is clearly becoming ever more important, so it doesn't necessarily hurt to settle with an extendable ISA even if it has its flaws. Intel has already proven that poor choices made in the past can be corrected to a large degree.
 
Well, it would be nice if they used an ISA which could do without speculative execution of branches when unnecessary ... although they could simply extend x86 for that.
 
And you think that for ATI and NVIDIA it's really a good idea to ditch the ISA every generation and start over from scratch? I can already hear a cheer of joy from the code generation teams and others who have to inspect assembly or binary from time to time...
They are certainly cheering, without changes they wouldn't have a job after all ;) Seriously, from what I can tell from AMD documentation releases it seems to me that the shaders' ISAs went through an evolutionary process without starting from scratch every time.
 
They are certainly cheering, without changes they wouldn't have a job after all ;) Seriously, from what I can tell from AMD documentation releases it seems to me that the shaders' ISAs went through an evolutionary process without starting from scratch every time.

The problem you got when you publish your ISA and allow 3rd parties to use it is that you hardly can remove things as part of the evolutionary process. With every generation this will require more and more transistors for the instruction decoder to translate old ISA to the real ISA that the chip uses.

CUDA as example use a byte code (like Direct3D) that need to be translated by the driver. This way the ISA of the GPU can be changed without breaking compatibility with current software.
Another problem I see is that we all know that it would take a long time before we see another GPU that can execute x86. If we consider that Larrabee uses even special extensions we might never see a GPU from any other IHV that can execute these programs.

Therefore this “Direct Mode” isn’t that much different from CUDA.
 
Well, it would be nice if they used an ISA which could do without speculative execution of branches when unnecessary ... although they could simply extend x86 for that.
As far as I know Larrabee doesn't do speculative execution, just like the original Pentium. Besides, it's a hardware feature, much less a software feature. Or did I miss your point?
 
The problem you got when you publish your ISA and allow 3rd parties to use it is that you hardly can remove things as part of the evolutionary process. With every generation this will require more and more transistors for the instruction decoder to translate old ISA to the real ISA that the chip uses.
It doesn't matter as long as the transistor budget grows significantly faster than the number of transistors needed to decode the extra instructions. Intel certainly hasn't had too much trouble to change the architecture significantly (pipelining, superscalar, out-of-order, speculative execution, wider execution units, etc) while still extending the instruction set. Instructions that have become almost irrelevant have been turned into microcode.

It's the old RISC versus CISC discussion again. :)
 
I may not matter in the CPU business were the target was always to improve the single thread performances and each core is large. But those massive parallel computation chips use many small cores. Therefore you want as much of them as possible. This leads to a disadvantage if the competition can build smaller cores with the same calculation power per core. Adding a backward compatible instruction decoder may not take much space for a single core but it sums up.
 
I may not matter in the CPU business were the target was always to improve the single thread performances and each core is large. But those massive parallel computation chips use many small cores. Therefore you want as much of them as possible. This leads to a disadvantage if the competition can build smaller cores with the same calculation power per core. Adding a backward compatible instruction decoder may not take much space for a single core but it sums up.
Sure, the rules change somewhat when going multi-core. But with Larrabee Intel succeeded in cramming many more cores into the space of one CPU core, support the same legacy x86 instructions, and added a wide vector unit using yet another ISA extension. So again, how big is this baggage really?

Having a thousand cores that can each execute only a handful of instuctions clearly isn't the answer either. So the benefit of software compatibility might really offset the complexity of keeping the same ISA. Even something like CUDA needs another translation pass and consumes additional bandwidth. They might eventually integrate it into the hardware as well, so what do you win? You pay the price of compatability either way, and Intel's approach hasn't been so bad for them so far.

Looking at the long-term future, some extra transistors for the GPU's instruction decoders really shouldn't matter. Lots of product development cycles are already dominated by the software, so not having to rewrite some key components can really reduce time to market.
 
Last edited by a moderator:
As far as I know Larrabee doesn't do speculative execution
Branch prediction == speculative execution, and the pentium certainly did that. As for it being a pure hardware issue, no ... I'm talking about data dependent branch hints, the programmer always has more information than the dumb processor.
 
Branch prediction == speculative execution, and the pentium certainly did that. As for it being a pure hardware issue, no ... I'm talking about data dependent branch hints, the programmer always has more information than the dumb processor.
Ah, you're right, the original Pentium did do basic branch prediction.

The Pentium 4 also added branch prediction hints in the form of prefix bytes for conditional jump instructions; 2Eh and 3Eh (also serving as segment override prefixes when used with memory operands).

Edit: I guess you already knew that... What are data dependent branch hints?
Edit 2: I found a document describing it as a hint that refers to the loop counter register, which makes branching out of a loop entirely predictable. Interesting. I believe modern x86 CPUs already do this automatically though, by detecting loops and keeping track of the counter register used.
 
Last edited by a moderator:
Well, LRB supports multiple threads per core, so it could defer a branch until the result of the condition is known, - or at least almost.

Cheers
 
Edit 2: I found a document describing it as a hint that refers to the loop counter register, which makes branching out of a loop entirely predictable. Interesting. I believe modern x86 CPUs already do this automatically though, by detecting loops and keeping track of the counter register used.
There's more to life than loops. While this kind of special case circuitry makes a lot of sense for a serially optimised processor executing a legacy ISA ... it doesn't make a lot of sense for Larrabee IMO. I'd rather have an ISA extension.
 
How much baggage are we talking about really?

I don't have Intel's chip layouts, but I would say it's probably not an insignificant portion of the CPU logic. If it was I don't think Intel would have chosen to go with explicit level parallelism for the Itanium.

x86 isn't particularly elegant, but I really wonder where the idea comes from that it has too much baggage or is even crippled.

x86 is nasty in many ways, and all kind of weird useless ways to execute will have to be supported. You have instructions with odd side effects. For the Pentium this is less of an issues since it's in-order, but for out-of-order superscalar CPUs the dependencies between instructions is a bowl of spaghetti. You also have variable size instructions. Unaligned memory accesses. Self-modifying code. Branching to any address within your variable size including in the middle of an instruction, potentially turning the code into something completely different than the previous time you executed through those bytes. Loads of old useless instruction (hello decimal arithmetic, string instructions etc.). A very odd register model.
Then you need interrupts, I/O and all kinds of elaborate stuff that CPUs do.

Heck, NVIDIA and ATI still have to invest a lot more transistors to be able to do everything Larrabee will be capable of on day one.

The question is if they want to. More flexibility, sure, close to CPU eventually, but do we really need to be able to call fopen() from a shader core? I don't think so, and I would rather have those transistors spent on something useful.

And you think that for ATI and NVIDIA it's really a good idea to ditch the ISA every generation and start over from scratch?

Probably not all the time, and they don't have to, but they have the choice to redesign things when it makes sense. R600 wouldn't be able to run a R580 shader. If it was forced to for binary compatibility's sake, that would be a waste of silicon.

Anyway, note that Intel will be able to present new Larrabee products at the same pace as CPUs.

Which is another problem. For GPUs you can't have the same time to market as with CPUs. Even if Larrabee is competitive on launch it'd quickly become irrelevant if they don't present another generation for another three years or so, ending up several generations behind ATI and Nvidia.
 
What are data dependent branch hints?
Edit 2: I found a document describing it as a hint that refers to the loop counter register, which makes branching out of a loop entirely predictable. Interesting. I believe modern x86 CPUs already do this automatically though, by detecting loops and keeping track of the counter register used.
Proper DDBH is basically the ability to say "Next branch if r0 < r1" and then some cycles later take the branch. So you can retire the evaluation of the condition a long time before you take the branch and already know which way you're going.

I thought MIPS had this, but I've been reading up that ISA recently and it doesn't, it only has static hinting.

I think there is quite a bit of baggage. AFAIK it's things like prefixes and variable instruction length that really start to cost you in the decoders - you need to spend quite a lot of silicon just trying to work out where the next instruction starts. (Or next three instructions for a superscalar decoder). The large number of partial write cases (requiring resolves) is also a cost that you wouldn't take on board if you had the chance.
 
In a sea of cache and 26 bit multipliers it's not all that important.

Although personally I think Larrabee is rather tame. Personally I want a multiprocessor with really compact cores, more compact than even SPUs ... dual issue in order cores with only a single FMAC and small private caches :) Then ISA would start to matter.
 
Branch prediction == speculative execution, and the pentium certainly did that. As for it being a pure hardware issue, no ... I'm talking about data dependent branch hints, the programmer always has more information than the dumb processor.

While compiler researchers always like to point this out, they've been proven wrong over and over and over and over again and again and again and again.

This even included application space were one would think that the compiler could do a good job. The problem is really, the processor does have more information than the programmer, esp in any application of reasonable size.
 
Which is another problem. For GPUs you can't have the same time to market as with CPUs. Even if Larrabee is competitive on launch it'd quickly become irrelevant if they don't present another generation for another three years or so, ending up several generations behind ATI and Nvidia.

Don't know how much you've been following things, but GPU design cycles have been increasing lately and CPU design cycles have been shrinking. Both are releasing new designs/shrinks roughly every year.

Realistically both our bound by the underlying fab cycles, which are roughly every 18 to 30 months depending on vendor and luck.
 
Back
Top