Analyst: Intel's Larrabee chip needs reality check

I thought MIPS had this, but I've been reading up that ISA recently and it doesn't, it only has static hinting.

MIPS has branch delay slots. An architecture wart on the side of a somewhat clean architecture. BDS are in general a very bad design.
 
While compiler researchers always like to point this out, they've been proven wrong over and over and over and over again and again and again and again.
I said programmer, not compiler (among other things the good programmer will also know when the limited amount of runtime information available to a specific processor will be good enough ... branch hints are a cheap option to add, but you don't have to use them all the time). There is nothing wrong with branch delay slots on a one off ISA without branch prediction, they just weren't very forward looking.
 
Last edited by a moderator:
Sean Maloney said:
"This is something that will unfold over a few years. I don't think anything's going to happen overnight."

Best thing they've said yet. It's got to be all about continuing to execute every day, week, month, etc over a significant period of time. That's the *only* model that has been proven to work in the non-bundled GPU industry over the last 15 years. Dillettantes, weekend warriors, and "every three years whether we need to or not" (Yes, Matrox, I'm looking at YOU) need not apply.
 
Whee, just saw the Richard Huddy interview where he promised that ATI/AMD & Nvidia would kick Larrabee's rear by 2-4x on rasterization performance per mm2. Presumably that's even with the process advantage we all assume Intel will have initially.

"We'll thrash them --absolutely thrash them" is what he said. Don't hold back, Richard, tell us what you really think. :smile:
 
I think they went wrong when they decided to make it x86. They are starting a brand new design that will have a lot of baggage on day one, and by exposing x86 directly they are forced the maintain binary compatibility for all future generations making the problem worse over time as new features are added[...]
No the majority of Larrabee is the L2 cache and the 16-way SIMDs, which are specifically jigged for throughput computing - both of these are fresh, shiny and new and are not x86. For graphics, the x86 scalar core is "demoted" to being just a control processor architecture, much like the command processor and the sequencer in R6xx.

Jawed
 
I said programmer, not compiler (among other things the good programmer will also know when the limited amount of runtime information available to a specific processor will be good enough ... branch hints are a cheap option to add, but you don't have to use them all the time). There is nothing wrong with branch delay slots on a one off ISA without branch prediction, they just weren't very forward looking.

The problem with data dependent branch hints (or targets) is that you get data at the end of your pipeline, but the branch unit needs the information at the front. In latency sensitive situations you're normally better off with prediction, since that breaks the dependence. In situations where latency is less significant just add hardware contexts to your multithreaded core and switch to another thread whenever a branch is encountered.

Cheers
 
Last edited by a moderator:
Don't know how much you've been following things, but GPU design cycles have been increasing lately and CPU design cycles have been shrinking. Both are releasing new designs/shrinks roughly every year.

Realistically both our bound by the underlying fab cycles, which are roughly every 18 to 30 months depending on vendor and luck.


Since the release of the first C2D, Nehelm hasn't launched yet, what exactly has changed in Intels CPUs other than fsb, cache and speeds? Not much as far as I can tell hopefully someone with a bit more knowledge can explain, mean while during that same time period we have seen 3 generations of GPUs.
 
Three 'generations'? What's a generation, then? The GPU guys are just more happy to slap a new name on an old can of dog food. G92? G80 die shrink. GT200? A lot of G80 on a single die with a few functional extensions.

Kinda reminds me of that whole 45nm shrink with SSE4.2 and various core enhancements, frequency improvements.. Except it retained the Core 2 name. With all the improvements in Penryn they could easily have kept the Penryn SKUs at high frequencies and given them new names, but that's not their model.
 
Three 'generations'? What's a generation, then? The GPU guys are just more happy to slap a new name on an old can of dog food. G92? G80 die shrink. GT200? A lot of G80 on a single die with a few functional extensions.

Kinda reminds me of that whole 45nm shrink with SSE4.2 and various core enhancements, frequency improvements.. Except it retained the Core 2 name. With all the improvements in Penryn they could easily have kept the Penryn SKUs at high frequencies and given them new names, but that's not their model.

Core 2 might have been a "revolution" compared to the Pentium 4, but when you compare it to its real ancestor, Core Duo/Yonah, Pentium M and Pentium III, etc, it looks very much like an evolution.
 
The problem with data dependent branch hints (or targets) is that you get data at the end of your pipeline, but the branch unit needs the information at the front. In latency sensitive situations you're normally better off with prediction, since that breaks the dependence.
It's not an either-or situation, whether it's the assembly programmer or the compiler ... they will both know very well when it makes sense to use.

I kinda doubt Larrabee will have branch predictors which autodetect loops, that would be a far cry from the Pentium ... on the other hand, maybe they will finally support the LOOP instruction for real again ... that would be nice too.
 
The problem you got when you publish your ISA and allow 3rd parties to use it is that you hardly can remove things as part of the evolutionary process. With every generation this will require more and more transistors for the instruction decoder to translate old ISA to the real ISA that the chip uses.

CUDA as example use a byte code (like Direct3D) that need to be translated by the driver. This way the ISA of the GPU can be changed without breaking compatibility with current software.
Another problem I see is that we all know that it would take a long time before we see another GPU that can execute x86. If we consider that Larrabee uses even special extensions we might never see a GPU from any other IHV that can execute these programs.

Exactly. Demirug sees clearly.

This is the real reason to use the x86 instruction set. Lock in.
The only other reason to use x86 would be the (somewhat questionable) ability to execute legacy Windows code directly on Larrabee, although for the life of me I can't see why anyone would want to be that wasteful.
 
How much baggage are we talking about really? The first x86 CPU with most of this baggage is the 386, which has a measly 275,000 transistors. Sure, a modern 64-bit SMT architecture needs more, but if Intel was able to keep x86 afloat in the 386 days, when there was still lots of competition from other so-called superior architectures, how bad can it really be?
There were many non-technical reasons for x86's success.
x86's reputation for poor performance wasn't entirely unearned, given how badly it lagged in performance against other architectures until manufacturing might and economies of scale turned the tide against more performant but more niche machines.

A lot of structures that are not in the decoder are affected by supporting x86 at speed.
Until we know more, it is more likely that the P54 core is the appropriate starting point.
That chip had 3.3 million in its core+L1.
Comparisons are difficult to make, but a dual issue Alpha weighed in at 1.68 million.

If you were to design a massively parallel multi-purpose CPU, what ISA do you believe would do significantly better than x86? Give me numbers...
If you mean a many-core design, an ISA that can yield a core with similar issue width and features with a smaller transistor or area cost would be sufficient to do better.
There were RISC designs contemporaneous to the Pentium that did as much.

As cores become more complex and ever more programmable, I doubt that the choice of x86 really makes things worse.
In manycore, the cores are not becoming more complex...


It doesn't matter as long as the transistor budget grows significantly faster than the number of transistors needed to decode the extra instructions. Intel certainly hasn't had too much trouble to change the architecture significantly (pipelining, superscalar, out-of-order, speculative execution, wider execution units, etc) while still extending the instruction set. Instructions that have become almost irrelevant have been turned into microcode.
That argument doesn't quite work with Larrabee.
It worked when large monolithic IPC-oriented cores could amortize the cost of their large decoder blocks over the exploding transistor budgets for their many execution units and issue logic.
It took a long time to jump to a wider issue decoder after the PPro went 3-wide until Conroe went 4-wide.
The transistor budget was much bigger by that point.
Many-core forces a much more linear relationship in decoder count versus everything else.
Many-core's emphasis on area and power efficiency also means all those transistors that formerly masked the overhead of compatibility are no longer present.

No the majority of Larrabee is the L2 cache and the 16-way SIMDs, which are specifically jigged for throughput computing - both of these are fresh, shiny and new and are not x86. For graphics, the x86 scalar core is "demoted" to being just a control processor architecture, much like the command processor and the sequencer in R6xx.
In a GPU setup, there might be 2-4 of those.
For Larrabee 16,32,64?
It makes for interesting tradeoffs.

One would imagine some graphics designers in the Larrabee III timeframe wistfully looking at the vast stretches of opcode space they can never use... :(
 
In manycore, the cores are not becoming more complex...
Umh? I look at GPUs historical trends and I've seen cores getting more and more complex (not just growing in number), I don't see why this trend will suddenly stop.
 
Umh? I look at GPUs historical trends and I've seen cores getting more and more complex (not just growing in number), I don't see why this trend will suddenly stop.

The "cores" are straightforward in-order pipelines that run instructions from a very straightforward collection of threads.
Half or more of a core's work isn't even in the shader units, instead it's in the decoupled TMU and separate ROP partitions.

AMD's superscalar execution is done through VLIW.
Nvidia's superscalar is a bastardization of the term.
Intel's in-order superscalar is a limited 2-way, and it might be the high-water mark for complexity, since Intel likes the idea of just slathering on a billion of them as opposed to making any single one much more than a unit to run a tight software loop.

The more interesting parts of the system are in the stuff that binds them together, but that doesn't contribute to each core's complexity.
 
The "cores" are straightforward in-order pipelines that run instructions from a very straightforward collection of threads.
Half or more of a core's work isn't even in the shader units, instead it's in the decoupled TMU and separate ROP partitions.
We might disagree on what is straightforward or not but GPU cores complexity is going up, not down.
Do you really expect the next major architectural shift from NVIDIA and AMD to not have more complex cores? It will eventually happen that more and more fixed function units will be absorbed by programmable cores, and complexity will go up just to accomodate extra features.

From what we know LRB won't be out before other 12-18 months and in that time frame I expect NVIDIA to offer sometihng more similar to LRB than to G8x (x86 aside). Just because NVIDIA is not talking about G300 it doesn't mean they are not exactly on the same route. Do you remember what they were saying about unified shading? I do :)
Intel's in-order superscalar is a limited 2-way, and it might be the high-water mark for complexity, since Intel likes the idea of just slathering on a billion of them as opposed to making any single one much more than a unit to run a tight software loop.
While you might be right this is an aspect where I expect Intel to shine, especially with regards to NVidia. Since no one expects Intel to be initially as good as Intel at graphics I don't see why we should expect NVidia to be as good as Intel at designing such general purpose cores.
And they WILL eventually get there, we just don't know when, but I suspect that day is coming very soon.
 
Last edited:
Don't know how much you've been following things, but GPU design cycles have been increasing lately and CPU design cycles have been shrinking. Both are releasing new designs/shrinks roughly every year.

The GPU market is still operating at twice the speed of the CPU market. Core 2 is over two years old now.

No the majority of Larrabee is the L2 cache and the 16-way SIMDs, which are specifically jigged for throughput computing - both of these are fresh, shiny and new and are not x86.

I expect the 16-way SIMD to be a significant chunk of the logic. If the L2 cache on the other hand is large I count that as mistake number two for a GPU design.

Speaking of the 16-way SIMD. If Larrabee II comes with even newer and shinier 32-way SIMD, old apps will likely operate at half the speed of what the chip could potentially do. For ATI and Nvidia this is not a problem.
 
We might disagree on what is straightforward or not but GPU cores complexity is going up, not down.
Considering that GPU "cores" are crap cores that (IMHO) don't fully count as cores in the first place, that's not saying much.
If we're going to rate things in the general-purpose core continuum, the shader clusters have to be compared against real general-purpose cores like Penryn.

Do you really expect the next major architectural shift from NVIDIA and AMD to not have more complex cores? It will eventually happen that more and more fixed function units will be absorbed by programmable cores, and complexity will go up just to accomodate extra features.
If the fixed function is absorbed into the cores directly, it's just another unit.
I can put a swiss army knife in my pocket, but I'm not any more complex than I was prior.

If the designers don't want to constrain the generality of the design, more work will be done using simple operations to synthesize complex operations. That lends itself to simpler hardware.
The stated desire to eliminate a lot of the peculiarities of data structure, fixed pipeline, data formats, and specialized storage in many ways makes the hardware's job simpler.
If it's all "fetch from generic memory, execute generic operation, store to generic memory", then we're going back what Von Neumann did.

From what we know LRB won't be out before other 12-18 months and in that time frame I expect NVIDIA to offer sometihng more similar to LRB than to G8x (x86 aside). Just because NVIDIA is not talking about G300 it doesn't mean they are not exactly on the same route. Do you remember what they were saying about unified shading? I do :)
It would be quite a gamble to copy Intel when not even Intel knows how well its gamble will pay off once present in final silicon.

While you be right this is an aspect where I expect Intel to shine, especially with regards to NVidia. Since no one expects Intel to be initially as good as Intel at graphics I don't see why we should expect NVidia to be as good as Intel at designing such general purpose cores.
In my opinion, general purpose cores are generally being overrated, as far as graphics is concerned. A shader array isn't a massive failure if it can't run string comparison instructions or transition between software priviledge levels.

The P54 is decades-old and nowhere near cutting-edge as a CPU.
I don't see anyone having a problem matching it within the confines of consumer graphics.
The cores themselves are not as important as what connects and coordinates them.

I'm reserving judgement until I see the physical implementation of Larrabee from a manufacturer with a history of cheaper but in many ways uncompetitive and overweight x86 cores (of which Larrabee is a direct descendant), and so far uninspired many-core integration and nascent massively parallel software.

Those parts are untried, fetch/decode/execute is not.
 
The only other reason to use x86 would be the (somewhat questionable) ability to execute legacy Windows code directly on Larrabee, although for the life of me I can't see why anyone would want to be that wasteful.
It's about the tool chain. You can create absolutely anything for Larrabee from the day it launches, you just have to write the application (or scale an existing app). You already have the O.S. type functionality, runtime libraries, compilers, powerful debug and profiling tools, frameworks, etc. And developers are already well acquainted with them. Compared to that something like CUDA is still in its infancy and NVIDIA has still over a decade of work ahead of it to simply write the pile of software developers expect to be available. Just think of the man-years needed to write a full-fledged O.S. kernel from scratch, and I think you get the idea.

Now, obviously, Larrabee doesn't need all this for classic rasterization. But I believe things do look really exciting once we look at other applications and other approaches at 3D rendering. It's not like future generations of GPUs won't be capable of the same things at some point, but you'll have to write much of the software from scratch. As the applications get more complex, you'll need several layers of software abstraction, and for Larrabee most of those already exist or would be much easier to supply. Having an architecture as generic as x86 really means you don't have to work around hardware limitations all the time.

And even if NVIDIA and ATI launch products as generic as Larrabee at around the same time, they still have to convince the world to buy the products and start coding for it. By sticking to x86 Intel ensured that it's already leaps ahead of that. And developers won't have to wait five years before their software can run on consumer systems. They can go right ahead and write for instance a physics engine that runs both on multi-core CPUs and on Larrabee.
 
It's about the tool chain. You can create absolutely anything for Larrabee from the day it launches, you just have to write the application (or scale an existing app). You already have the O.S. type functionality, runtime libraries, compilers, powerful debug and profiling tools, frameworks, etc.
Larrabee's launching as a PCI-E add-in board. What OS allows that kind of application activity and software/hardware system access from an expansion slot?

As the applications get more complex, you'll need several layers of software abstraction, and for Larrabee most of those already exist or would be much easier to supply.
Sounds interesting. Which current abstraction layers do you mean?

And even if NVIDIA and ATI launch products as generic as Larrabee at around the same time, they still have to convince the world to buy the products and start coding for it. By sticking to x86 Intel ensured that it's already leaps ahead of that.
AMD has a few old x86 cores lying around, so at least one competitor could try it, assuming it still exists at that point.

And developers won't have to wait five years before their software can run on consumer systems. They can go right ahead and write for instance a physics engine that runs both on multi-core CPUs and on Larrabee.
They'd have to wait a few years before Larrabee and sufficiently parallel CPUs had enough market share to make the effort economical.
 
Considering that GPU "cores" are crap cores that (IMHO) don't fully count as cores in the first place, that's not saying much.
I'm not talking about NVIDIA's cores, I'm referring to what NVIDIA calls multiprocessors. Those parts are obviously getting incrementally more complex.

If the fixed function is absorbed into the cores directly, it's just another unit.
Absorbed in this case means that a specific task carred out by a fixed function unit gets re-implemented completely or partially in sw. See LRB's <ROPs>.

In my opinion, general purpose cores are generally being overrated, as far as graphics is concerned. A shader array isn't a massive failure if it can't run string comparison instructions or transition between software priviledge levels.
It's quite clear that it's not just about graphics anymore, at least for a while.

The P54 is decades-old and nowhere near cutting-edge as a CPU.
I don't see anyone having a problem matching it within the confines of consumer graphics.
No one is saying that P54 is a state of the art CPU, but last time I checked NVIDIA has designed ZERO of them, so while I'm sure they are very smart and they can do it, I don't see them being as good as Intel from the get-go. The argument <company X has no experience with Y> has to be a reflexive one. On the other hand you might say that NVIDIA doesn't need to design such general purpose core anyway, but I disagree on this. it will eventually happen, probably sooner than later.
 
Back
Top