Larrabee delayed to 2011 ?

Whoah. Really?! Wow, seems like an awful lot of architectural baggage (~3 decades' worth) to drag around in a GPU. Surely those trannies could have been put to better use somewhere else... I mean, it's not as if those CPU cores would have any real integer performance anyway, certainly not enough so you'd consider actually running regular applications on them.
 
I thought Larrabee doesn't support SSE/SSE2-Instruction but only LRBNI. Am I wrong?
 
Last edited by a moderator:
I wouldn't really worry a lot about wasted transistors. That really isn't the problem.

The issue is power consumption, and it's hard to say precisely how that relates to legacy support. A lot of the old cruft in x86 is microcoded, which doesn't necessarily require a lot of power.

DK
 
Whoah. Really?! Wow, seems like an awful lot of architectural baggage (~3 decades' worth) to drag around in a GPU. Surely those trannies could have been put to better use somewhere else...
What "architectural baggage" would you remove and how many transistors would that save exactly (in percentage of the entire die size)?

The original Pentium had merely 3.1 million transistors, and only a fraction of that can be considered legacy overhead.
I mean, it's not as if those CPU cores would have any real integer performance anyway, certainly not enough so you'd consider actually running regular applications on them.
Larrabee's cores will be much faster than the original Pentium, close to an Atom perhaps (which runs lots of modern regular applications). But of course the true power comes from having a couple dozen of these. It's great for developers to get their application running on Larrabee without any effort (at no matter what performance). They can immediately start multi-threading it and making use of vector operations. Doing these things incrementally is a big plus over having to rewrite things from scratch. This is where the tiny bit of x86 baggage starts paying off.
 
So where are they burning excessive power? Is it hw bugs? inefficient implementation?
Their live prototype demonstration indicates none of that. My guess is still software complications.

In my experience planning an optimized software architecture ahead of having the hardware available is futile. It's quite possible they had to go back to the drawing board to take an entirely different approach. But any significant software change also shifts the hardware bottlenecks. So that may take a redesign as well.
 
I'd speculate that we won't see anything from "LRB-recharged" before 2012. If that should be true it doesn't suggest any minor changes IMO. However depends what one understands under "minor or major changes" exactly. I have severe doubts that they won't cater for backwards compatibility to LRB Prime wherever essential.
 
What "architectural baggage" would you remove and how many transistors would that save exactly (in percentage of the entire die size)?
How about a fixed-length load/store instruction set with a few addressing modes?
I've gone over guesstimates before.

The original Pentium had merely 3.1 million transistors, and only a fraction of that can be considered legacy overhead.
The P5 core weighed in at 3.1 million transistors, a contemporaneous Alpha weighed in at 1.68 million. We don't even need legacy overhead to see a significant dent.

Larrabee's cores will be much faster than the original Pentium, close to an Atom perhaps (which runs lots of modern regular applications).
For loads that can be put through vector resources, Larrabee cores should be faster.
In x86, Atom has an advantage in cache, clock, and issue width compared to Larrabee's significantly more restricted issue width.

This is where the tiny bit of x86 baggage starts paying off.
We've gone over this canard as well. The miniscule effort saved in using x86 over any other established ISA is dwarfed by the fact you need to learn how to massively multithread, properly use the shared cache, and use the new vector ISA.

Their live prototype demonstration indicates none of that. My guess is still software complications.
The live demonstration showed a non-overclocked SGEMM running at ~800 GFLOPS.
Other x86 chips running SGEMM have hit +90% of theoretical peak running this.
Larrabee's target DP was at least 1 TFLOP, which also meant 2 or more for SP, with a clock range between 1.5 and 2.5 GHz.

When overclocked, the >600mm2 chip barely outperformed the best known score for a stock RV770 (in this forum).


At the very least, the chip that was demoed was not ready for prime-time.
 
It's great for developers to get their application running on Larrabee without any effort (at no matter what performance). They can immediately start multi-threading it and making use of vector operations. Doing these things incrementally is a big plus over having to rewrite things from scratch. This is where the tiny bit of x86 baggage starts paying off.

How about an alternative approach. They run their apps on existing cpu's. Profile which bits need perf, and rewrite them in OpenCL to target GPUs/whichever-LRB-is-available?

Do you really need x86 in your gpu in this approach? Is this approach in any way worse off than your suggestion? If anything, existing cpu's will blow LRB out when it comes to the serial bits. And they are important in overall perf.
 
32 times that is ~100M trannies. Negligible in comparison to chips that have ~1B trannies???

But you ignored the second part of his sentence, out of that 3.1 millions transistors, how much is dedicated to legacy overhead?
 
But you ignored the second part of his sentence, out of that 3.1 millions transistors, how much is dedicated to legacy overhead?

In a GPU, ALL of x86 is legacy. Who needs mmap() and ioctl() and cousins in a shader? Existing apps run fine on existing CPUs.

x86 is there in LRB for many reasons. Some of them make sense. None of them have got anything to do with performance. At this time, LRB1 seems to be lacking in performance.
 
Wait, I thought Atom was dual issue just like the Pentium core in LRB1.
I've been asking that question, and the answers I've gotten indicate that Larrabee needs to issue 1 traditional x86 instruction and one VPU instruction per cycle (the exception being that vector stores can run in place of an x86 instruction).
In pure x86 code, it is apparently half the width of Atom.
Intel slides have also been released that seem to support this.
 
At this time, LRB1 seems to be lacking in performance.

But you haven't convinced me that removing legacy support would solve (or significantly help) their problems. If legacy support only consumes a small percentage of die space (or transistor count), what could they have possibly done different in that small space that would have saved lrb?
 
I've been asking that question, and the answers I've gotten indicate that Larrabee needs to issue 1 traditional x86 instruction and one VPU instruction per cycle (the exception being that vector stores can run in place of an x86 instruction).
In pure x86 code, it is apparently half the width of Atom.
Intel slides have also been released that seem to support this.

AFAIK, the pentium was a dual issue core. It could issue two instructions on it's U and V pipe per clock. As per my understanding, they just expanded (from 80k feet) the set of instructions that can run on the V pipe. So now, one VPU instruction can dual issue with x86 every clock. Vector stores can dual issue with vector ALU operations. The rest should stay the same.

I haven't seen any slide that says x86 dual-issue can't be done. Forbidding x86 dual issue makes no sense, IMHO.
 
AFAIK, the pentium was a dual issue core. It could issue two instructions on it's U and V pipe per clock. As per my understanding, they just expanded (from 80k feet) the set of instructions that can run on the V pipe. So now, one VPU instruction can dual issue with x86 every clock. Vector stores can dual issue with vector ALU operations. The rest should stay the same.

I haven't seen any slide that says x86 dual-issue can't be done. Forbidding x86 dual issue makes no sense, IMHO.

Larrabee is based on the core, but it is highly modified.

Forsyth's GDC 09 Larrabee presentation, slide 50 describes Larrabee as this:

Two Pipelines
One x86 scalar pipe, one LNI vector
Every clock, you can run an instruction each
Similar to Pentium U/V pairing rules
Mask operations count as scalar ops

Vector stores are special
They can run down the scalar pipe
Can co-issue with a vector math op
etc..
 
But you haven't convinced me that removing legacy support would solve (or significantly help) their problems. If legacy support only consumes a small percentage of die space (or transistor count), what could they have possibly done different in that small space that would have saved lrb?

Well, the 10% figure is right there in front of you. Who knows how much else pointless stuff is there for which we don't have the numbers (cache coherency overhead, anyone?). If you look at the overall bogo-flops/mm2 and bogo-flops/W, LRB1 was hardly any good.
 
Back
Top