Larrabee delayed to 2011 ?

keritto · Dec 9, 2009

MfA said:
Why do you keep harping on FMA? Doing single cycle throughput FMA doesn't matter in FLOPS because it doesn't go any faster than single cycle throughput MAD which is what GPUs have been using since forever ... also ATI has said R800 supports FMA any way.

Did you meant that it doesn't help single precision (32bit) throughput? Well I'm doubtful about that because FMA could help better core utilization, and also i have doubts that some simple vliw instruction could make FMA possible beyond good/very good emulation if it ain't implemented on hardware level. So does ATi hides that abilities from RV870 chip until Fermi is released or it's simple marketing blob? Or we'll never saw it on Radeon class only FireStream class cards. And ATi still didn't even announce any of FireStream cards based on RV870.

3dilettante said:
This is a piece from 2000, but there are numbers and reasons stated.
http://www.realworldtech.com/page.cfm?ArticleID=RWT021300000000&p=1

This is great but horrifying history lecture how Intel's push-ups onto newer processing node kill outperforming architecture in favor of their x86 CISC. I just hope that now when taiwanese semiconductor manufacturers reach Intel's shrinkage node levels we couldn't see how history repeats once again in favor of some hypo monster as Lara Bee is. And hopefully it was a good AMD's decision to go AssetSmart and give away running expensive manufacturing plants to jointly venture with ATIC. So we'll now had newer nodes without AMD's lack of cash burdens them.
And just hope that nVidia will finally develop some new architecture after NV40 in 2004 that's finalized in G70 issue. It's a time to do that after Fermi sees daylight in early 2010 as usually cycles last for 18 month and they need to have some NGGA in their minds for long now. I hoped for G300 to be really new as promised (before GT200 launch when they talked about their DX11 inventions) but it's still just tranny pumping and praying to outperform competition on tranny advantage account . And that ATi won't sleep on successful R600 reiterated into R800 design, there's simply too many things they can upgrade on it.

3dilettante said:
Slides from a while back indicated that Larrabee could perform 2 non-SSE DP FLOPs a cycle.
That would seem to indicate x87, though the slides are pretty old at this point.

SSE wouldn't be an option anyway, as it appears Larrabee does not support it.

So we're in fact once again cheated by Intel when it comes in terms of Intel's 487sx math processor performance. We're all glad that's this "features" are once again easier to implement than fully capable transcendental math coprocessor ... only 20 years after Cyrix introduced it's FasMath 83D87 (and improved EMC87). It's hilarious cause i want stay optimistic about tjis fraud.

nAo said:
So it's definitely not so simple..

Why it ain't simple. You simply waste 20% less of die space resulting in cheaper production (not a reality in Intels case i know :mrgreen:

). And better for all of us if we'd have 20% less leakage and maybe down to 40% of original power requirements when you ditch all that x86 ISA pre-decode.

And the best thing Larabee is IN-ORDER chip afaik and all that ooo chips advantages over RISC (that Intel obliterated with their marketing fuds) are gone there. And in order chips need recompilation for ooo compiled applications that are with us in last 10yrs. And also apps need to be aware of all that masking going on when they're executed on chip that carry on illusion of x86 compliance. I'd give more of credit to Fermi when it comes out to that x86 compatibility even if aint x86 chip at all :mrgreen:

Lara Bee is multicore core-multithreaded chip and in this proto-Larabee age they need to see what kind of optimizations could they do to outperform higher clocked ooo CPUs on core level. But in fact real question is what kind of HPC should LB provide when their math is based on old 487sx engine?

--

keritto · Dec 9, 2009

Ailuros said:
You're going to design digital teapots with an intel inside sticker in the future

did you even read the article? It's revealing in some points but it's really BS^N^N ROFL for lot of others. Just to quote few BS lines from page 2.

#1
"ATI's Radeon 5870 GPU comes with 160KB of L1 cache, 160KB of Scratch cache and 512KB of L2 cache. The L1 cache features a bandwidth of around 1TB/s, while L1 to L2 cache bandwidth is 435GB/s. This flat out destroys any CPU cache bandwidth figures, and we're talking about a chip that works at "only" 850 MHz. "

What GPU he's really talking about that's numbers for rv770 so called HD4870 just MHz frequency doesn't add up

While oth rv870/Cypress has 32 L1 + 32 scratch per 16 SIMD core x20 came out as 640kb L1 and 640kb scratch!! And about L2 cache hasn't rv770 have much more L2 cache up to few MB as the wise speculations on die shot weer telling? And rv870at least has a double amount of that L2 cache if it's really rv770 x2 w/ dxx11 compliance on single die

#2
"AMD Sexa-Core Shanghai = 640KB L1 [64KB Instruction + 64KB Data per core]"

And just a line above it he successfully do the math for Quad core

And last time i hear a news from AMD hexa-core D0 revision is called Istanbul

#3
"Intel Sexa-Core Dunnington = 96KB L1 [16KB Instruction + 16KB Data per core]"

Theo, directly copy-pasted this reliable info from Wikipedia source

AFAIK Dunnington is 45nm Penryn based CPU and Penryn had 32kB(L1I)+32kb(L1D) cache. Thus giving also 64kb L1 total x6 cores = 384kB L1 on all six cores. ntm if 16kB is true then the number is 192kB and not 96kB (130nm Northwood already had 12kb L1D + 8kB L1I cache

)

Intel’s Conroe and Penryn designs relied on the standard design of a 32 kB L1 Instruction cache (L1I) and a 32 kB L1 Data cache (L1D) along with a huge (4 MB and 6 MB, respectively) L2 cache. The L1 cache strategy has not changed, the L1I is still 32 kB 4-way set associative, and the L1D remains 32 kB 8-way set associative with a 4 KB page size, thereby eliminating the need to physical addresses via TLBs since the product of associativity and page size equals the actual L1D size.

Click to expand...

Theo is really a lot of BS for one so called journalist, and he could without much hassle line up himself with some other from FOXnews or CNN. They also suck in math but great bookies/accountants.

And not even to mention that I'm more of Damn fan if i'm fanboy at all

But there's one big also much of hear-say truth in that article as the statement that German/USA/Indian engineers get paid to do nothing and just not to blab out that Intel had nothing. And i'd pretty openly claim that after Intel shown fuzzy die shots at a presentation on the opening ceremony of the Visual Computing Institute of Germany’s Saarland University in May this year [2009]

And rumor that Intel will move their research to successful Israeli team might bme a true reason why i5/i7 respins came out 6 month later and why some young people in Israel protest against inhumane way Intel treats their engineers there. Good way of syndicating for better pay check

It's simple lazy people get paid and intel wants their money back even if that meant they turn out to slave labour for some of their research labs. Well it's their fault they give world such good product as Core2 (that USA team built Nehalem upon at least they claim so) and so they have to flatten out that deviated Larabee project and above all Intel frightens them with closing down the labs. I'd purposely put all four legs up in air and let them fire me if they can. But it might be that once again these people aren't smart enough.

mczak said:
Hmm using ddr3 would impose a huge bandwidth penalty.
I wonder, couldn't you use just 3 normal (32bit) gddr5 chips per memory channel instead of 2 and then only use 64bit + 8bit for ECC? Sure that would be waste of memory but at the price those tesla cards are likely going to be sold it should be a non-issue (and the more consumer oriented cards certainly won't do that).

It could be possible but memory controller should be designed that innovative way. In that case of IMC redesign, it might be better to have dedicated bandwidth + extra ecc chips on side if it's plausible that ECC going onto into separate chip per 4 64bit memory channels. Just dreaming possibility.

Anyway as someone mentioned G80 based Teslas had only 1.6GT/s memory IO speed while GPUs based on same had 2.2GT/s. It's good thing that mem IO works on lower speed for sake of computing reliability for heavy usage (overhead), while if some texture in memory occasionally get out broken it's not to big deal cause in most cases goes unnoticed.

--

Ailuros · Dec 9, 2009

keritto said:
did you even read the article? It's revealing in some points but it's really BS^N^N ROFL for lot of others. Just to quote few BS lines from page 2.....

Uhmmm I skipped the rest to save bandwidth....was all that demagogic diarrhea really necessary in answer to a personal joke between me and Simon which you obviously didn't get?

darkblu · Dec 9, 2009

Ailuros said:
Uhmmm I skipped the rest to save bandwidth....was all that demagogic diarrhea really necessary in answer to a personal joke between me and Simon which you obviously didn't get?

your question voids itself if he didn't get the joke/context in the first place ; ) /offtopic

returning to the question of the role of x86 in larrabee: while many would argue the x86 was just the shell, was of no primary significance compared to the simd parts, yadda, yadda, it still begs the question:

in a design that aims to achieve abnormally-high levels of efficiency at high programmability, some 1TF/150W @ GP, max performance/dollar, etc, why would you want to start off with *any* ballast?

it's akin to setting out to design a stratosphere flyer while using a seaplane as a base. yes, you could eventually make it fly very high, but how do you prove you couldn't have done better had you actually chosen a high-flyer for a base?

i really wish somebody less ISA-doctrinal had taken over the task, not intel. somebody smaller (less inert) and smarter (with more experience in the field). somebody like 3dlabs, perhaps, whose recent SoCs could be a good lesson for intel.

rpg.314 · Dec 9, 2009

darkblu said:
in a design that aims to achieve abnormally-high levels of efficiency at high programmability, some 1TF/150W @ GP, max performance/dollar, etc, why would you want to start off with *any* ballast?

My theories

x86 is there to try the market lock-in trick that world+dog has moved past.
x86 is there because intel initially though lrb could sit on a qpi socket. This also explains the cache model
x86 is there because of the painful lesson intel learnt when they had to license x86-64 from amd

Anteru · Dec 9, 2009

It could be also because all of the Intel software tools (ICC, VTune, etc.) are aimed towards x86, and they didn't have time to write a new backend for ICC etc. I guess there are also many optimisation programmers at Intel who are adept with the Pentium core, and less so with ARM/whatever. From this side, x86 looks like a natural choice. I guess it's just due to time-to-market/budget issues.

3dilettante · Dec 9, 2009

Larrabee's derivative core appears pretty different to me.
For anything not running LRBni, which would need compiler adaptation far removed from the classic Pentium, the core is a single-issue x86.

I'm disappointed that the first-gen device won't be seeing the light of day as a consumer product.

If it had we would have probably gotten additional details of the architecture's implementation, and we would have had physical exemplars running comparative runs with the competition. We'd actually know what Intel was going to release and in what form.
A lot of assumptions could have been tested then, and we'd have some idea of the power envelope and application compatibility for a heavily software-based approach.

Given that the graphics debut has been deferred, I guess 2010 is going to be much less interesting than I had hoped.

Panajev2001a · Dec 9, 2009

darkblu said:
i really wish somebody less ISA-doctrinal had taken over the task, not intel. somebody smaller (less inert) and smarter (with more experience in the field). somebody like 3dlabs, perhaps, whose recent SoCs could be a good lesson for intel.

I will use my great analytical skills to go out on a limb and say that maybe you might like the Zii... but just maybe

.

Blazkowicz · Dec 9, 2009

that dunnington is a surprise to me, I thouht it was three penryn dies on one chip.

zed · Dec 9, 2009

Bloody hell, just read theyre dropping Larrabee, I remember saying a couple of years ago.
"It will tank as bad as the i740", but hell even I didnt envisage it would tank this bad.

Intel c2004 - "We'll be breaking the 10ghz barrier by the end of the decade"

Whilst this screwup aint quite at that level, Youve gotta ask yourself about some of the toplevel design decisions intel have made

FrameBuffer · Dec 10, 2009

zed said:
Bloody hell, just read theyre dropping Larrabee, I remember saying a couple of years ago.
"It will tank as bad as the i740", but hell even I didnt envisage it would tank this bad.

Intel c2004 - "We'll be breaking the 10ghz barrier by the end of the decade"

Whilst this screwup aint quite at that level, Youve gotta ask yourself about some of the toplevel design decisions intel have made

Well they weren't that far off: Celeron (Netburst) @ 8.1Ghz

frogblast · Dec 10, 2009

There is one _very_ strong argument in favor of using x86: Eventually, Larrabee software and SIMD instructions would migrate to the CPU itself: A handful of large OOOE cores, and a lot of small ones, each entirely coherent and with an identical instruction set. ie, an operating system could even play dumb and see a system with x+y cores (although in practice the scheduler would want to know).

There would be no 'larrabee native' option that executes some code on a foreign device, but you would compile your code with VS/gcc, make some threads, and go. Want to allocate a texture? Call malloc(). Want to use OpenCL, but have much lower latency, so you can have an easier time feeding physics processing into graphics? That is easier done here than by going across a PCIe bus.

x86 (RECENT x86, SSE4 and friends included) and coherency are necessary if this kind of device is going to run existing software.

Nick · Dec 10, 2009

FrameBuffer said:
Well they weren't that far off: Celeron (Netburst) @ 8.1Ghz

Meaning the double-pumped ALU's ran at 16.2 GHz...

Nick · Dec 10, 2009

zed said:
Whilst this screwup aint quite at that level, Youve gotta ask yourself about some of the toplevel design decisions intel have made

Yeah, especially since crystall balls were so cheap before the financial crisis, and in hindsight even monkeys can make the right decisions 50% of the time.

silent_guy · Dec 10, 2009

Nick said:
Meaning the double-pumped ALU's ran at 16.2 GHz...

That's what I thought too... until someone here pointed me to the chip-architect.com site.

The writer talks about the subject here. I remember there being a way more in depth breakdown of the pipe-line on the same site, but I can't find the article back.

Nick · Dec 10, 2009

darkblu said:
it's akin to setting out to design a stratosphere flyer while using a seaplane as a base. yes, you could eventually make it fly very high, but how do you prove you couldn't have done better had you actually chosen a high-flyer for a base?

Actually at one point in history seaplanes were flying higher and further than any other commercial airline planes. The reason was the lack of good runways...

Larrabee is a revolutionary new design, and the overhead of x86 is negligible compared to the advantages. Fast to market, an abundance of existing tools, workload migration, extendability, etc.

The reason for Larrabee's delay is definitely not x86. On the contrary, any other ISA choice would take far longer to achieve competitive performance. Any theoretical performance advantage would be totally nullified by an initial lack of software optimization. We're only talking about a few percent of x86 decoder overhead anyway. The reason Larrabee is delayed is because it's still a revolutionary new approach to use a fully generic device for rasterization. Most of the complexity shifts to the software. Forward API compatibility comes at the price of less hardware specialization though. This can be offset by the effects of unification and caching, but that's no small task. They simply need more time to perfect it.

Ideally Larrabee should be programmed directly by the application developer. The potential is huge (as proven by FQuake). The problem is it will take many years to go that route. Application developers are still struggling to tame a quad-core CPU, let alone manage dozens of cores running hundreds of threads with explicit SIMD operations. We still need a lot of progress in development tools (such as explicitely concurrent programming languages - inspired by hardware description languages). Till the day this becomes as obvious and advantageous as object-oriented programming, application developers expect APIs to handle the hardest tasks. This however presents a double overhead: once because the ideal algorithm has to be modified to fit the API, and once because the hardware doesn't exactly match the API. Programming the hardware directly eliminates both overheads and puts performance above that of classical GPUs.

Of course GPUs are also evolving toward greater programmability, and APIs are getting thinner to allow more direct access to the hardware. But Intel is attempting to skip ahead. Even though we won't see a Larrabee GPU in 2010, x86 is enabling Intel to get to the convergence point much faster than anyone else. And since x86 is yet to be dethroned in the CPU market, there's no reason to assume that the overhead will play any significant role when the competition presents its own fully programmable hardware.

Squilliam · Dec 10, 2009

Could it be that the current LRB design was shelved to focus on a console implementation of the design? In this case its getting the horse to come before the carriage in getting the software environment sorted before the move to release consumer level cards to the wider public? Perhaps further to that making their IGPs Larrabee based would probably be more fortuitous as it would force the game developers to optimise for their integrated graphics architecture?

Nick · Dec 10, 2009

silent_guy said:
That's what I thought too... until someone here pointed me to the chip-architect.com site.

The writer talks about the subject here. I remember there being a way more in depth breakdown of the pipe-line on the same site, but I can't find the article back.

The ALUs are still double-pumped. A double-pumped scheduler and register file were planned for Tejas but never saw the day of light. Anyway, they clearly meant a 10 GHz core frequency, not overclocked and not double-pumped.

That's totally irrelevant though. It's just marketing talk and a quick extrapolation of what consumers can expect. Today we have quad-core CPUs with an effective performance far higher than what those 10 GHz designs would have achieved. So there's absolutely no reason for dissapointment.

rpg.314 · Dec 10, 2009

Nick said:
Actually at one point in history seaplanes were flying higher and further than any other commercial airline planes. The reason was the lack of good runways...

Larrabee is a revolutionary new design, and the overhead of x86 is negligible compared to the advantages. Fast to market, an abundance of existing tools, workload migration, extendability, etc.

The reason for Larrabee's delay is definitely not x86. On the contrary, any other ISA choice would take far longer to achieve competitive performance. Any theoretical performance advantage would be totally nullified by an initial lack of software optimization. We're only talking about a few percent of x86 decoder overhead anyway. The reason Larrabee is delayed is because it's still a revolutionary new approach to use a fully generic device for rasterization. Most of the complexity shifts to the software. Forward API compatibility comes at the price of less hardware specialization though. This can be offset by the effects of unification and caching, but that's no small task. They simply need more time to perfect it.

Ideally Larrabee should be programmed directly by the application developer. The potential is huge (as proven by FQuake). The problem is it will take many years to go that route. Application developers are still struggling to tame a quad-core CPU, let alone manage dozens of cores running hundreds of threads with explicit SIMD operations. We still need a lot of progress in development tools (such as explicitely concurrent programming languages - inspired by hardware description languages). Till the day this becomes as obvious and advantageous as object-oriented programming, application developers expect APIs to handle the hardest tasks. This however presents a double overhead: once because the ideal algorithm has to be modified to fit the API, and once because the hardware doesn't exactly match the API. Programming the hardware directly eliminates both overheads and puts performance above that of classical GPUs.

Of course GPUs are also evolving toward greater programmability, and APIs are getting thinner to allow more direct access to the hardware. But Intel is attempting to skip ahead. Even though we won't see a Larrabee GPU in 2010, x86 is enabling Intel to get to the convergence point much faster than anyone else. And since x86 is yet to be dethroned in the CPU market, there's no reason to assume that the overhead will play any significant role when the competition presents its own fully programmable hardware.

If lrb i meant to go on a cpu socket, then x86 is the only ISA that makes sense.

If lrb is meant to remain hidden behind the pci-e bus, then x86 makes no sense.

EDIT: replaced thing with ISA

nutball · Dec 10, 2009

rpg.314 said:
If lrb i meant to go on a cpu socket, then x86 is the only ISA that makes sense.

If lrb is meant to remain hidden behind the pci-e bus, then x86 makes no sense.

And therein lies the rub. Whilst the early Larabees were clearly targeted at the discrete market, Intels long-term strategy may well include moving into the CPU socket. Even if they have no concrete plans for that now, maybe they're hedging their bets - x86 gives them the future flexibility to make that transition if they need to. Or do both, integrated LRB-light and discrete LRB-max for different markets.

This is why I said a few pages ago that I see x86 as a wise choice for LRB, so long as it doesn't have a meaningful impact on current-day performance.

Larrabee delayed to 2011 ?

keritto

keritto

Ailuros

Epsilon plus three

darkblu

rpg.314

Anteru

3dilettante

Panajev2001a

Blazkowicz

zed

FrameBuffer

frogblast

Nick

Nick

silent_guy

Nick

Squilliam

Beyond3d isn't defined yet

Nick

rpg.314

nutball