Larrabee delayed to 2011 ?

Grall · May 9, 2010

Whoah. Really?! Wow, seems like an awful lot of architectural baggage (~3 decades' worth) to drag around in a GPU. Surely those trannies could have been put to better use somewhere else... I mean, it's not as if those CPU cores would have any real integer performance anyway, certainly not enough so you'd consider actually running regular applications on them.

Novum · May 9, 2010

I thought Larrabee doesn't support SSE/SSE2-Instruction but only LRBNI. Am I wrong?

dkanter · May 10, 2010

I wouldn't really worry a lot about wasted transistors. That really isn't the problem.

The issue is power consumption, and it's hard to say precisely how that relates to legacy support. A lot of the old cruft in x86 is microcoded, which doesn't necessarily require a lot of power.

DK

rpg.314 · May 11, 2010

So where are they burning excessive power? Is it hw bugs? inefficient implementation?

Nick · May 11, 2010

Grall said:
Whoah. Really?! Wow, seems like an awful lot of architectural baggage (~3 decades' worth) to drag around in a GPU. Surely those trannies could have been put to better use somewhere else...

What "architectural baggage" would you remove and how many transistors would that save exactly (in percentage of the entire die size)?

The original Pentium had merely 3.1 million transistors, and only a fraction of that can be considered legacy overhead.

I mean, it's not as if those CPU cores would have any real integer performance anyway, certainly not enough so you'd consider actually running regular applications on them.

Larrabee's cores will be much faster than the original Pentium, close to an Atom perhaps (which runs lots of modern regular applications). But of course the true power comes from having a couple dozen of these. It's great for developers to get their application running on Larrabee without any effort (at no matter what performance). They can immediately start multi-threading it and making use of vector operations. Doing these things incrementally is a big plus over having to rewrite things from scratch. This is where the tiny bit of x86 baggage starts paying off.

Nick · May 11, 2010

rpg.314 said:
So where are they burning excessive power? Is it hw bugs? inefficient implementation?

Their live prototype demonstration indicates none of that. My guess is still software complications.

In my experience planning an optimized software architecture ahead of having the hardware available is futile. It's quite possible they had to go back to the drawing board to take an entirely different approach. But any significant software change also shifts the hardware bottlenecks. So that may take a redesign as well.

Ailuros · May 11, 2010

I'd speculate that we won't see anything from "LRB-recharged" before 2012. If that should be true it doesn't suggest any minor changes IMO. However depends what one understands under "minor or major changes" exactly. I have severe doubts that they won't cater for backwards compatibility to LRB Prime wherever essential.

3dilettante · May 11, 2010

Nick said:
What "architectural baggage" would you remove and how many transistors would that save exactly (in percentage of the entire die size)?

How about a fixed-length load/store instruction set with a few addressing modes?
I've gone over guesstimates before.

The original Pentium had merely 3.1 million transistors, and only a fraction of that can be considered legacy overhead.

The P5 core weighed in at 3.1 million transistors, a contemporaneous Alpha weighed in at 1.68 million. We don't even need legacy overhead to see a significant dent.

Larrabee's cores will be much faster than the original Pentium, close to an Atom perhaps (which runs lots of modern regular applications).

For loads that can be put through vector resources, Larrabee cores should be faster.
In x86, Atom has an advantage in cache, clock, and issue width compared to Larrabee's significantly more restricted issue width.

This is where the tiny bit of x86 baggage starts paying off.

We've gone over this canard as well. The miniscule effort saved in using x86 over any other established ISA is dwarfed by the fact you need to learn how to massively multithread, properly use the shared cache, and use the new vector ISA.

Nick said:
Their live prototype demonstration indicates none of that. My guess is still software complications.

The live demonstration showed a non-overclocked SGEMM running at ~800 GFLOPS.
Other x86 chips running SGEMM have hit +90% of theoretical peak running this.
Larrabee's target DP was at least 1 TFLOP, which also meant 2 or more for SP, with a clock range between 1.5 and 2.5 GHz.

When overclocked, the >600mm2 chip barely outperformed the best known score for a stock RV770 (in this forum).

At the very least, the chip that was demoed was not ready for prime-time.

rpg.314 · May 11, 2010

Nick said:
The original Pentium had merely 3.1 million transistors, and only a fraction of that can be considered legacy overhead.

32 times that is ~100M trannies. Negligible in comparison to chips that have ~1B trannies???

rpg.314 · May 11, 2010

Nick said:
Their live prototype demonstration indicates none of that. My guess is still software complications.

My guess is that it was a 16 core chip, or only 16 out of 32 of it's cores were unlocked when it hit 1T in sgemm.

rpg.314 · May 11, 2010

3dilettante said:
In x86, Atom has an advantage in cache, clock, and issue width compared to Larrabee's significantly more restricted issue width.

Wait, I thought Atom was dual issue just like the Pentium core in LRB1.

rpg.314 · May 11, 2010

Nick said:
It's great for developers to get their application running on Larrabee without any effort (at no matter what performance). They can immediately start multi-threading it and making use of vector operations. Doing these things incrementally is a big plus over having to rewrite things from scratch. This is where the tiny bit of x86 baggage starts paying off.

How about an alternative approach. They run their apps on existing cpu's. Profile which bits need perf, and rewrite them in OpenCL to target GPUs/whichever-LRB-is-available?

Do you really need x86 in your gpu in this approach? Is this approach in any way worse off than your suggestion? If anything, existing cpu's will blow LRB out when it comes to the serial bits. And they are important in overall perf.

rpg.314 · May 11, 2010

Nick said:
Their live prototype demonstration indicates none of that. My guess is still software complications.

Their live demo did not give out any power numbers. So this particular conclusion lacks evidence.

willardjuice · May 11, 2010

rpg.314 said:
32 times that is ~100M trannies. Negligible in comparison to chips that have ~1B trannies???

But you ignored the second part of his sentence, out of that 3.1 millions transistors, how much is dedicated to legacy overhead?

rpg.314 · May 11, 2010

willardjuice said:
But you ignored the second part of his sentence, out of that 3.1 millions transistors, how much is dedicated to legacy overhead?

In a GPU, ALL of x86 is legacy. Who needs mmap() and ioctl() and cousins in a shader? Existing apps run fine on existing CPUs.

x86 is there in LRB for many reasons. Some of them make sense. None of them have got anything to do with performance. At this time, LRB1 seems to be lacking in performance.

3dilettante · May 11, 2010

rpg.314 said:
Wait, I thought Atom was dual issue just like the Pentium core in LRB1.

I've been asking that question, and the answers I've gotten indicate that Larrabee needs to issue 1 traditional x86 instruction and one VPU instruction per cycle (the exception being that vector stores can run in place of an x86 instruction).
In pure x86 code, it is apparently half the width of Atom.
Intel slides have also been released that seem to support this.

willardjuice · May 11, 2010

At this time, LRB1 seems to be lacking in performance.

But you haven't convinced me that removing legacy support would solve (or significantly help) their problems. If legacy support only consumes a small percentage of die space (or transistor count), what could they have possibly done different in that small space that would have saved lrb?

rpg.314 · May 11, 2010

3dilettante said:
I've been asking that question, and the answers I've gotten indicate that Larrabee needs to issue 1 traditional x86 instruction and one VPU instruction per cycle (the exception being that vector stores can run in place of an x86 instruction).
In pure x86 code, it is apparently half the width of Atom.
Intel slides have also been released that seem to support this.

AFAIK, the pentium was a dual issue core. It could issue two instructions on it's U and V pipe per clock. As per my understanding, they just expanded (from 80k feet) the set of instructions that can run on the V pipe. So now, one VPU instruction can dual issue with x86 every clock. Vector stores can dual issue with vector ALU operations. The rest should stay the same.

I haven't seen any slide that says x86 dual-issue can't be done. Forbidding x86 dual issue makes no sense, IMHO.

3dilettante · May 11, 2010

rpg.314 said:
AFAIK, the pentium was a dual issue core. It could issue two instructions on it's U and V pipe per clock. As per my understanding, they just expanded (from 80k feet) the set of instructions that can run on the V pipe. So now, one VPU instruction can dual issue with x86 every clock. Vector stores can dual issue with vector ALU operations. The rest should stay the same.

I haven't seen any slide that says x86 dual-issue can't be done. Forbidding x86 dual issue makes no sense, IMHO.

Larrabee is based on the core, but it is highly modified.

Forsyth's GDC 09 Larrabee presentation, slide 50 describes Larrabee as this:

Two Pipelines
One x86 scalar pipe, one LNI vector
Every clock, you can run an instruction each
Similar to Pentium U/V pairing rules
Mask operations count as scalar ops

Vector stores are special
They can run down the scalar pipe
Can co-issue with a vector math op

etc..

rpg.314 · May 11, 2010

willardjuice said:
But you haven't convinced me that removing legacy support would solve (or significantly help) their problems. If legacy support only consumes a small percentage of die space (or transistor count), what could they have possibly done different in that small space that would have saved lrb?

Well, the 10% figure is right there in front of you. Who knows how much else pointless stuff is there for which we don't have the numbers (cache coherency overhead, anyone?). If you look at the overall bogo-flops/mm2 and bogo-flops/W, LRB1 was hardly any good.

Larrabee delayed to 2011 ?

Grall

Invisible Member

Novum

dkanter

rpg.314

Nick

Nick

Ailuros

Epsilon plus three

3dilettante

rpg.314

rpg.314

rpg.314

rpg.314

rpg.314

willardjuice

super willyjuice

rpg.314

3dilettante

willardjuice

super willyjuice

rpg.314

3dilettante

rpg.314