Larrabee delayed to 2011 ?

Anteru · Nov 18, 2009

Seems that Larabee has been presented at SC09 with 1006 GFLOPS of SGEMM performance, however it was overclocked for this. Source (in German): http://www.heise.de/newsticker/meld...rt-Larrabee-mit-ueber-1-Teraflops-862305.html

"Normal" performance is supposed to be 417 and peak at 712 (I don't get this quite. Is 417 the performance when running on a matrix that does not fit into memory while 712 is when running fully from caches?)

MfA · Nov 18, 2009

712 is presumably peak theoretical FLOPS .... which is not unexpected, but will leave some people disappointed. That's one hell of an overclock though, 240%.

Ailuros · Nov 18, 2009

http://www.theregister.co.uk/2009/11/17/sc09_rattner_keynote/page2.html

MfA · Nov 18, 2009

That does make more sense ... but on the other hand they talk about 80 cores, which sounds like they demo'd the old Terascale chip and not Larrabee.

Ailuros · Nov 18, 2009

MfA said:
That does make more sense ... but on the other hand they talk about 80 cores, which sounds like they demo'd the old Terascale chip and not Larrabee.

Most likely yes. The real question is what's the point of demo'ing the old Terascale chip in any case?

Jaaanosik · Nov 18, 2009

Ailuros said:
Most likely yes. The real question is what's the point of demo'ing the old Terascale chip in any case?

To show you that they are far from being ready.

Ailuros · Nov 18, 2009

Jaaanosik said:
To show you that they are far from being ready.

I was aiming for a more serious answer, but thanks anyway for the good laugh

rpg.314 · Nov 18, 2009

MfA said:
That does make more sense ... but on the other hand they talk about 80 cores, which sounds like they demo'd the old Terascale chip and not Larrabee.

If they demo'ed the cloth simulation, then it is assuredly not on the old 80-core chip. That chip was great for experimentation and as a proof of concept, but was utterly, hopelessly, pathetically useless in real world.

dkanter · Nov 18, 2009

it's actually more likely that the reported just got the number of cores wrong.

DK

Pete · Nov 20, 2009

Anteru said:
"Normal" performance is supposed to be 417 and peak at 712 (I don't get this quite. Is 417 the performance when running on a matrix that does not fit into memory while 712 is when running fully from caches?)

In case ppl don't follow Ail's link, the relevant bit:

On the SGEMM single precision, dense matrix multiply test, Rattner showed Larrabee running at a peak of 417 gigaflops with half of its 80 cores activated; and with all of the cores turned on, it was able to hit 805 gigaflops. As the keynote was winding down, Rattner told the techies to overclock it, and was able to push a single Larrabee chip up to just over 1 teraflops, which is the design goal for the initial Larrabee co-processors.

Ailuros · Nov 20, 2009

Pete,

There's a lot not adding up in that article. Example:

Here's the next problem. Sparse matrix math is what is commonly needed in simulations involving cloth and water. And on that test, a Larrabee chip that was not overclocked was able to do between 7.9 and 8.1 gigaflops, depending on the test and the size of the matrices.

If I divide 417/40 or 805/80 the result is rather in the >10GFLOPs range. What am I missing?

dkanter · Nov 20, 2009

Ailuros said:
Pete,

There's a lot not adding up in that article. Example:

If I divide 417/40 or 805/80 the result is rather in the >10GFLOPs range. What am I missing?

Why 40 or 80? Where did those numbers come from?

SGEMM may have been 805 GFLOP/s but that doesn't necessarily say a whole lot about SpMV.

David

Ailuros · Nov 20, 2009

dkanter said:
Why 40 or 80? Where did those numbers come from?

SGEMM may have been 805 GFLOP/s but that doesn't necessarily say a whole lot about SpMV.

David

From the same article itself: http://www.theregister.co.uk/2009/11/17/sc09_rattner_keynote/page2.html

Once more here's the entire passage from it:

On the SGEMM single precision, dense matrix multiply test, Rattner showed Larrabee running at a peak of 417 gigaflops with half of its 80 cores activated; and with all of the cores turned on, it was able to hit 805 gigaflops. As the keynote was winding down, Rattner told the techies to overclock it, and was able to push a single Larrabee chip up to just over 1 teraflops, which is the design goal for the initial Larrabee co-processors.

Here's the next problem. Sparse matrix math is what is commonly needed in simulations involving cloth and water. And on that test, a Larrabee chip that was not overclocked was able to do between 7.9 and 8.1 gigaflops, depending on the test and the size of the matrices.

Now where the 80 or 40 supposed cores came from that's something one should ask the author or in extension Rattner for clarifications.

dkanter · Nov 20, 2009

Ailuros said:
From the same article itself: http://www.theregister.co.uk/2009/11/17/sc09_rattner_keynote/page2.html

Once more here's the entire passage from it:

Now where the 80 or 40 supposed cores came from that's something one should ask the author or in extension Rattner for clarifications.

It's obvious the author is mistaken. LRB ring is designed for multiples of 16 cores.

David

Ailuros · Nov 20, 2009

As I said there's more than one spot in that article that doesn't make sense. 80 is a multiple of 16 it's just an odd multiplier and no it shouldn't mean anything.

It's both the author's as well as Rattner's responsibility to overview content before an article gets published.

keritto · Nov 20, 2009

compres said:
I sure hope they ditch the abominable x86 ISA. Why is this needed if the chip will be incompatible with their CPUs anyway?

They need something to point out that they're better than competitor

Not that x86 is really something better anyway but i know 90% of people that'll claim that x86 support and when it came with Intel logo on it is something best that money can buy

Anyway it'll be best performing x86 CPU

that'll not have socket for their own or maybe intel reiterate that decision and will make it socket available right on time for first launch.

Grall said:
I'd say it's completely irrelevant wether other chips can't raytrace such a scene in real-time if larrabee's raytracing doesn't look one iota better (or run faster) than other chips' classic rasterizing - and looking at this video it DOESN'T.

I'd say it all depends on programmers will and skills, and that we could saw all kind of magic even on 9.0c only if every time new c-rappyx dx came out with new function set new engines willfully apply it and good old games came out patched for new instructions, and with more optimized textures and way they rendered.
After all we could say that Intel has right amount of money to make programmers willing do something they didn't done before

And in fact no real skills doesn't really lie behind all that early hype about Lara's Bee we saw and hear in last 18 month or so.

And as far TFlops hear-say goes i hope that ATi will have money and power not to sleep on laurels. And that HD5000 (R800/Cypress) series was just thing that bought Ati enough time (and enough cash inflow to be virtually w/o losses) for R900 that will bring FMA on already great vliw engine. That feature that brought so much trouble for nVidia and their decision to pump on steroids same old G70 stylized chip with all new crunching features which paid of in G80 time but not now 5yrs later. I still didn't renounced my wishes (for R800) now for R900 same 16SIMD engines just with 8SP/engine (instead todays 5SP), 4 of them so called SFU on same R800 style core just with VLIW capable running FMA code at 1000MHz so that would give us 2560SP capable of running 2,56 TFlops in double floating point (80bit mode) and stunning 5,1TFlops in raw 32bit mode. With 128TMU 32RBE setup, virtually same we saw in today's R800. And this could be done at same die size or even smaller than today's R800 ... 330mm2, in only 32nm bulk, which is already available by GF as they claimed few month back. So we could see FMA/R900 refresh from ATi in almost the same time frame as delayed Fermi, let's say June 2010, if everything went OK for ATi so they'll have same refresh launch as nVidia did in NV40-G70 days within less than a year and highly revamped architecture.

Grall said:
Apart from the surprisingly fluid framerate this realtime rendering completely underwhelms. The lighting is extremely flat, surfaces look very flat and matte even when close up, the water looks incredibly sluggish and unrealistic (kind of what I'd imagine a sea of transparent mercury would look like).

That's something i noticed on that first raytrace rendering of that wood-forest demo few years back. All tree and hills plains look like they diffused in some liquid. It all had some unrealistic glow. (And the best thing was when they claimed raytracing realism on that teapot captures. The same thing was done maybe even better in first dx9.0 demos. Just marketing-intel.blog talk) If we taking about the same.

MfA · Nov 20, 2009

dkanter said:
It's obvious the author is mistaken. LRB ring is designed for multiples of 16 cores.

It's such a strange mistake to make though and then to pick the number 80 ... I personally would not discount the possibility that they simply have no Larrabee to show so they dug up the Terascale chip instead.

3dilettante · Nov 20, 2009

MfA said:
It's such a strange mistake to make though and then to pick the number 80 ... I personally would not discount the possibility that they simply have no Larrabee to show so they dug up the Terascale chip instead.

Intel did show off an 80-core chip last year, it just wasn't Larrabee. This isn't the first time people have confused the two. The author presumed it to be 80 cores. I didn't see a quote from Intel saying it was.

At any rate, 80 cores makes no sense with respect to the design goals for clock rate and FP throughput.

Larrabee at 1 GHz can hit 1 TF with 32 cores.

If Larrabee had 80 cores (and going by the die shots that have been out on the net, it barely sqeaks in 32), hobbling it to drop below 1 TF would indicate some very serious problems or some extreme sandbagging. SGEMM utilization would have to be something horrible for Larrabee, if that core count were accurate.

fellix · Nov 20, 2009

keritto said:
And as far TFlops hear-say goes i hope that ATi will have money and power not to sleep on laurels. And that HD5000 (R800/Cypress) series was just thing that bought Ati enough time (and enough cash inflow to be virtually w/o losses) for R900 that will bring FMA on already great vliw engine. That feature that brought so much trouble for nVidia and their decision to pump on steroids same old G70 stylized chip with all new crunching features which paid of in G80 time but not now 5yrs later. I still didn't renounced my wishes (for R800) now for R900 same 16SIMD engines just with 8SP/engine (instead todays 5SP), 4 of them so called SFU on same R800 style core just with VLIW capable running FMA code at 1000MHz so that would give us 2560SP capable of running 2,56 TFlops in double floating point (80bit mode) and stunning 5,1TFlops in raw 32bit mode. With 128TMU 32RBE setup, virtually same we saw in today's R800. And this could be done at same die size or even smaller than today's R800 ... 330mm2, in only 32nm bulk, which is already available by GF as they claimed few month back. So we could see FMA/R900 refresh from ATi in almost the same time frame as delayed Fermi, let's say June 2010, if everything went OK for ATi so they'll have same refresh launch as nVidia did in NV40-G70 days within less than a year and highly revamped architecture.

Meddling with the ALU width in a VLIW architecture is not very good idea, especially if there is multi-million dollar R&D investment for the existing compiler base. It's like beginning from the very scratch.
AMD should push for symmetric 5D setup, e.g. make all the five units 40-bit devices (like the current T-unit) and lift the limitations on loading data from the reg file. This alone, with the usual set of new instructions will make it enough to extract even more ILP and better overall utilization, without dipping into fancy experiments, that would cost arm and leg.

dkanter · Nov 20, 2009

MfA said:
It's such a strange mistake to make though and then to pick the number 80 ... I personally would not discount the possibility that they simply have no Larrabee to show so they dug up the Terascale chip instead.

Seeing as I know what the demo was...I can and do know that the author was wrong.

It's not the first time that this particular author has erroneously stated that LRB has 80 cores.

David

Larrabee delayed to 2011 ?

Anteru

MfA

Ailuros

Epsilon plus three

MfA

Ailuros

Epsilon plus three

Jaaanosik

Ailuros

Epsilon plus three

rpg.314

dkanter

Pete

Moderate Nuisance

Ailuros

Epsilon plus three

dkanter

Ailuros

Epsilon plus three

dkanter

Ailuros

Epsilon plus three

keritto

MfA

3dilettante

fellix

dkanter