Larrabee delayed to 2011 ?

I'm questioning the idea that the x86 side of Larrabee puts it at any advantage with regards to compilers and optimizations than any other CPU ISA when the target is a single-issue in-order processor.

That part of the chip is what would benefit from the long line of x86 compilers and the years of research.
That part is so limited in Larrabee that any other chip on any other ISA using a similar arrangement would not need significant effort to get equivalent results in a very short time frame. There's just not enough potential to waste, and the parts that have potential are too new to benefit.

Again it depends on what you want to do with the chip. If one does decide to write an engine from the ground up designed around the programmability Larrabee brings to the table then you can't rule out x86 optimizations as being beneficial. How beneficial is certainly a valid question and I don't have the answer for that.

I guess the point is that in a worst case scenario in which you gain nothing from x86 support, you also lose nothing and at least you have the option.
 
Most optimizations are algorithmic and completely ISA agnostic. Algoritmic improvements almost always boost performance way more than ISA specific optimizations do. Sometimes ISA specific is the most reasonable way forward, for instance if that matrix multiply that's used everywhere can be SSE optimized. But when you're profiling and finding that some for-loop somewhere is consuming a surprisingly large percent of the CPU time because it's looping over thousands of objects every frame you don't begin by adding SSE code, inserting prefetch calls etc. First step is perhaps to separate active and inactive objects into separate lists so you only have to loop over the handful active objects instead of checking the enable flag of them all. ISA specific optimizations might have halved the CPU time, but the algorithmic optimization could perhaps shave off 90% of the CPU time.

I am lumping SSE-like ISA extensions into the general term "ISA".

You are of course right about algorithmic optimization being more important than ISA-specific optimizations. Even with my limited Pascal and C programming experience a decade ago I know that much :D I didn't mean to imply that ISA-specific optimizations are any more important than any other type of optimizations, I just meant that they exist, and shouldn't be ignored. Of course, most ISA-specific optimizations are handled by the compiler.
 
At least the fast to market part is demonstrably false for Larrabee.
So you think Larrabee would have been available by now and have optimized drivers if it wasn't x86? I really doubt that, but feel free to "demonstrate" otherwise.
I forget how many compiler back ends have been optimized for a half-width P54 core with a strap-on 512-bit masked vector ISA.
The scalar unit is dual-issue. To compilers it's a P54 with 64-bit support. So Intel will have no problem generating optimized code for that. Furthermore, they already had very powerful debugging, profiling and thread analysis tools that can all be used with minimal changes. Also their engineers have extensive expertise using these tools to write optimized x86 code. By the way, Michael Abrash wrote the assembly optimizations for the Quake renderer, perfectly tuned for P54...

Using x86 instead of any other ISA was the fastest and cheapest choice for Intel. The advantages far outweight the disadvantages.
No. This has been discussed already, and the penalty is much more significant, particularly if the core has no additional OoO hardware to hide the penalty.
Contemporaneous RISC cores to the original Pentium tended to have at least a 1/3 die size and power advantage for the performance offered.
I wasn't comparing x86 decoders directly to RISC decoders. I was talking about x86 decoders compared to the entire Larrabee die, and how much you'd save when using ARM decoders. It's not like you'll be able to fit many more cores onto the same die.
I thought your work was an example of something that already does this.
Getting it running is not the issue. Getting it fast is the real challenge. On a CPU and on Larrabee most GPU design parameters become software parameters. And dedicated silicon becomes an algorithm that can run anywhere on the chip. It also puts you in control of data flow and task scheduling. All this offers huge potential to ensure high utilization at all times. But this same potential also means there are hunderds of ways to perform the same task, with each choice often influencing the entire architecture.

The software will no doubt go through many iterations, each offering significant performance improvements. So it's likely they just haven't hit their initial goals yet to launch it as a competitive GPU.

A small anecdote in the development of SwiftShader 1.0 is that I would get an iPod if I matched the performance of Pixomatic in Unreal Tournament 2004 (I was still a student back then). At that time Pixomatic was using a 'code stitching' technique; creating new specialized functions at run-time by putting together blocks of highly optimized assembly code. But it was limited to only two textures per pass, which was all UT2004 needed. With SwiftShader we also aimed to support shaders. To achieve this I had to use SoftWire, my run-time code generator. But that meant I had less control over optimizations. Even a full year after getting UT2004 running it seemed impossible to achieve higher preformance while supporting more features at the same time. But then I stopped focussing on the pixel pipeline, and started looking at the bigger picture. I minimized overdraw by first rendering the entire scene into the z-buffer only and then rendering color. I also developed a new clipping algorithm which was ten times faster than anything I had read about before. In the end SwiftShader 1.0 achieved 110% of Pixomatic's performance, at higher quality. I got two iPods.
From a commercial perspective, this is likely far from ideal and not the apparent direction the development world is taking.
No it's not. The Direct3D API is allowing more control over the hardware with every new generation. OpenCL is also a relatively thin API, putting a lot of things into the hands of the developer. Yet at the same time we see many game studios use off-the-shelf engines. This acts like middleware which saves the application programmer from the increasingly complex task of putting pixels on the screen. The engine becomes the new API. And the exciting thing is that there are many engines, each suited for certain genres of games and often also abstracting between platforms.

With Larrabee the ability to develop custom middleware is boundless, offering application developers a very wide range of possibilities. The complexity of direct access to the CPU never stopped anyone either. There are many software layers to offer abstraction, while still allowing to create anything imaginable. This increases application diversity and creativity. So all Larrabee needs is a number of standard API implementations and some libraries with basic functionality to create your own.
So you're saying that until the tools exist to abstract away concurrent concepts, the devs will have to settle on using tools that abstract away concurrent concepts that we already have.
Yes. Just like there are many programming languages for the CPU, each with a specific purpose, we need new ways to program devices such as Larrabee, each offering different levels of abstraction and control.
As mentioned before, unless you have an interest in integrating a Larrabee core into a chip with an x86 CPU socket, the x86 is of no real import.
Any other CPU ISA could be substituted and it would change absolutely nothing, although the implementation may be ~10% smaller and possibly tens of percent cooler/faster.
Absolutely. But nobody could bring that to the market. Intel is saving itself a huge amount of time and money by reusing what it already masters. Soon we'll have Direct3D, OpenGL, and x86/LRBni. There won't be room for another API/ISA. And even though x86 has some overhead, so does supporting Direct3D and OpenGL on a GPU. There are tons of features that are not commonly used, that still take die space. Only after many more generations of Direct3D everything might have become programmable. At that point it's no different from a programming language or an abstract ISA. That's the slow route to what Intel is doing with Larrabee, using a native ISA. But by that time Intel could dominate the market, thanks to x86. Any other ISA is worthless when there's already a lot of middleware written for x86. And as proven by the CPU market such an edge on the software front makes it impossible for any other ISA to gain market share, especially not with just a 10% performance advantage.

If Larrabee fails it will be because the initial performance/price can't get them enough market share to create momentum in the software world. So getting that initial renderer software right is critical. It might also of course also take some hardware revisions.
 
Middleware already written? You mean all the stuff which has to run on consoles too? Or maybe HPC software? (Which is also generally portable.)

Windows and binary software is what kept x86 entrenched, not middleware.
 
Last edited by a moderator:
I think so, but not sure. However, could you be more clear with the point you are making?

You were talking about the decoder on the A64, and how it related to LRB. The x86 core of a LRB does not have an x86 decoder. LRB also uses a lot of tricks to hide latency, so OoO is not as big a thing as you might imagine.

-Charlie
 
Middleware already written? You mean all the stuff which has to run on consoles too? Or maybe HPC software? (Which is also generally portable.)

Middleware game engines are multi-platform (not portable) by design.
I'm not expert in HPC software, but somehow I get the feeling they probably code close to the metal (ie, definitely not portable).
 
I've discover x86 is not always the silver bullet.

I bought a very cheap, fanless x86 netbook, built on a brand new 90nm system-on-chip made by a Taiwanese company (DM&P). It's the Northec Edubook.
it comes with a "lite" version of ubuntu which is decent, but hardly lets any space left on the slow 8GB flash storage. usable but I don't feel it's "my" system. Perfect time to try Arch Linux then, it looks perfect for building your barebone lightweight.

But.. Archlinux only supports i686 and amd64! my CPU is an i586 (so is VIA C3 and K6/2, I've learnt). Larrabee looks like it's an i586 too.
 
Last edited by a moderator:
Actually at one point in history seaplanes were flying higher and further than any other commercial airline planes. The reason was the lack of good runways...
at one point in history x86 was an ok ISA (about 20 years ago). also, at the point in history when seaplanes were holding the hight/length records, strato-flyers were but a blink in the imagination of the progressive aero-engineering thought of the day. i'm sure you're already seeing the parallel forming..

Larrabee is a revolutionary new design, and the overhead of x86 is negligible compared to the advantages. Fast to market, an abundance of existing tools, workload migration, extendability, etc.
i'm sorry but i entirely disagree here. I'm with 3dilettante: what abundance of existing sw tools are you referring to? do you believe LRB devs would be mostly concerned with tracing u/v-paring issues in their LRB routine? from where i stand, LRB designers chose a base ISA most modern compilers still manage to handle mainly thanks to the enormous effort of the whole industry to shoe-horn contemporary backends into it. heck, that even holds true for the 'modern' parts of it - gcc's autovectorizer has a custom SSE part that works outside the standard autovectorizing backend (no wonder, we're speaking of a simd ISA which still lacks data-dependent permute, some 10 years after the rest of the world discovered it) - so how's that ISA playing nicely with the compiler, and thus, enjoying a progressive sw tools base?

yes, x86 in the LRB would give intel an advantage in the central socket. should i be glad about it? please, explain to me, why i, as a potential LRB customer, should care about intel's entrenchment, and not about the fact that their part uses some 5%-10% (hypothetically) more juice than it could have otherwise, if only intel were not so concerned about their little trench war?

The reason for Larrabee's delay is definitely not x86. On the contrary, any other ISA choice would take far longer to achieve competitive performance.
so performance/watt means nothing now?

Any theoretical performance advantage would be totally nullified by an initial lack of software optimization. We're only talking about a few percent of x86 decoder overhead anyway. The reason Larrabee is delayed is because it's still a revolutionary new approach to use a fully generic device for rasterization.
tell that to 3dlabs who are currently selling a potent mobile chip, using the same principles LRB was meant to demonstrate, and on a much older shrink node at that. while offering industry-standard APIs and all that jazz.

Most of the complexity shifts to the software. Forward API compatibility comes at the price of less hardware specialization though. This can be offset by the effects of unification and caching, but that's no small task. They simply need more time to perfect it.
oh, they clearly do. /sarcasm

Ideally Larrabee should be programmed directly by the application developer. The potential is huge (as proven by FQuake). The problem is it will take many years to go that route.
no. ideally LRB should have shipped with the industry-standard APIs, and only optionally allowed the user to get down and dirty, in those rare occasions the user felt like it. anything else would be non-practical. unless, of course, LRB would've been so uncompetitive on the standard API's front, that people would've seen no reason to use it. hmm, i wonder..

Application developers are still struggling to tame a quad-core CPU, let alone manage dozens of cores running hundreds of threads with explicit SIMD operations. We still need a lot of progress in development tools (such as explicitely concurrent programming languages - inspired by hardware description languages). Till the day this becomes as obvious and advantageous as object-oriented programming, application developers expect APIs to handle the hardest tasks. This however presents a double overhead: once because the ideal algorithm has to be modified to fit the API, and once because the hardware doesn't exactly match the API. Programming the hardware directly eliminates both overheads and puts performance above that of classical GPUs.
it seems to me you neglect the fact that sw development if is a game of finite resources - developers don't have the resources to create machine-optimal code. at least, not during most of the time. you can't blame developers they did not sit and optimize every piece of their code to be clock-by-clock optimal. it's just not viable.

Of course GPUs are also evolving toward greater programmability, and APIs are getting thinner to allow more direct access to the hardware. But Intel is attempting to skip ahead. Even though we won't see a Larrabee GPU in 2010, x86 is enabling Intel to get to the convergence point much faster than anyone else. And since x86 is yet to be dethroned in the CPU market, there's no reason to assume that the overhead will play any significant role when the competition presents its own fully programmable hardware.
friendly jab: you really should pay more attention to the mobile/embedded market. you know, that multi-billion market where most of the EE progress is occurring these days and where intel are desperately trying to set their foot, rather unsuccessfully so far ; )
 
friendly jab: you really should pay more attention to the mobile/embedded market. you know, that multi-billion market where most of the EE progress is occurring these days and where intel are desperately trying to set their foot, rather unsuccessfully so far ; )

Even if they will in the future, chances are high they'll use 3rd party IP for that. I am obviously missing the relevance to what he said in his paragraph, but that's not important either.

I'd rather point in another direction here:

"I do think we will see plans from them next fall, but not a product until sometime later. I would also expect that it might have some significant changes as a result of the lessons learned," McGregor wrote in an e-mail, without speculating what kinds of changes might be made.


http://news.yahoo.com/s/pcworld/200...ays?utm_source=twitterfeed&utm_medium=twitter

Now would be a good time to speculate about which significant changes could be made and what valuable lessons they might have learned. IF they should dump x86 in future LRB variants I'm sure I'll be quite busy passing hats out with 'bon apetit' cards on the side.

Of course there's always the chance I start reading that were NO wrong design decisions at all for LRB and everything is as it should be....and along those longwinded explanations I hope anyone doesn't mind if I take a nap ;)
 
I'm not expert in HPC software, but somehow I get the feeling they probably code close to the metal (ie, definitely not portable).

I can tell you HPC is now C or Fortrand with MPI. Most code bases are very portable. You may have to modify the blocking strategy (for the memory hierarchy) as well as the communication strategy (given the specific topology/BW) but that has nothing to do with the ISA.
 
Middleware game engines are multi-platform (not portable) by design.
I'm not expert in HPC software, but somehow I get the feeling they probably code close to the metal (ie, definitely not portable).

Nah, very few scientists have the knowledge to go down the assembly route.
 
Depends on what the software dev wants to do with their app. Isn't that the whole point of Larrabee? Programmability above all else (with decent performance). I agree that the easiest way to get any sort of usefulness out of Larrabee would be to target its vector extensions rather than to attempt to write a 3d engine in x86 from the ground-up, but the option is there.

Programmability != x86.

Backward compatibility = x86.

Fermi is almost as programmable as Lrb, and it does not have x86.
 
Compatibility of legacy software aside, there's no need for heterogeneous processors on the same die/socket to use a different ISA. Most of the people I've spoken to consider a more unified ISA to be the end goal here, whether it be x86 or something else entirely, and consider the current state of having to target a pile of different ISAs far from ideal.

Game programmers have been juggling god-knows-how-many ISAs for about 6 years now, without so much as noticing, thanks to the magic of JIT. :smile:

Even with JIT (which is great of course) it's still a problem and definitely non-ideal. Sure you can make do on Cell-like models and such, but there's no question that it's harder and less flexible.

To be fair, Cell has never had the JIT compilers to ease the pain. OCL for cell is still in beta. If your concerns about multiple ISA's stem from a Cell like situation then I agree. A cell-like multiple compilers model is undesirable.

Having a standardized ISA obviously also doesn't preclude the existence of virtual ISAs, and indeed they still provide utility. That said, the existence of these virtual ISAs does not invalidate all of the benefits of having a standard hardware ISA.I think your hostility towards this concept is a little unwarranted. While there obviously is a continuum of cost vs. benefit, you're dismissing all benefits out of hand and assuming a huge cost. Without knowing the solid numbers here (or maybe you do have inside info?) I think it's a bit naive for you to assume that the people involved haven't run the real numbers themselves. Of course you're welcome to your own opinions, but I'd ask for some moderation unless you have real numbers to back up your claims.

Fine, show me a real world example where a unified ISA scores over the JIT model. My opinion on this matter is definitely shaped by intuition and individual taste. But if you can show me a single good point in favor of unified ISA over JIT, I'd be happy to have learnt more. I think this way because I can't think of any points myself.

And binary compatibility is not identical to unified ISA. I am in favor of binary compatibility for CPUs. I don't see the need for gpu's to suffer the brain-damages of x86 to integrate tightly with cpu's on the programming model level. I don't see the need to run Word 97 on my gpu's either.

For my part as a software developer, it would be great if I had some throughput cores that were directly accessible the same as any other core in the OS. It would be awesome if they shared a memory space and consumed the same binaries as the bigger cores since that would allow my schedulers to do on-the-fly load balancing and keep the system fed all the time. Obviously there's a hardware cost to this and the cost vs. benefit is the ultimate goal, but I really don't think you can make the argument that there's no benefit when there clearly is.

Unifying cpu and gpu mem pools will help. Big time. No 2 ways about it. But I am afraid I don't see the necessity of unifying ISA's to integrate gpu's tightly in the os.
 
So you think Larrabee would have been available by now and have optimized drivers if it wasn't x86? I really doubt that, but feel free to "demonstrate" otherwise.

Fermi will debut before lrb with relatively well optimized drivers, and it as almost as programmable as lrb itself.

I wasn't comparing x86 decoders directly to RISC decoders. I was talking about x86 decoders compared to the entire Larrabee die, and how much you'd save when using ARM decoders. It's not like you'll be able to fit many more cores onto the same die.
That depends on the amount of x86 overhead.
No it's not. The Direct3D API is allowing more control over the hardware with every new generation. OpenCL is also a relatively thin API, putting a lot of things into the hands of the developer. Yet at the same time we see many game studios use off-the-shelf engines. This acts like middleware which saves the application programmer from the increasingly complex task of putting pixels on the screen. The engine becomes the new API. And the exciting thing is that there are many engines, each suited for certain genres of games and often also abstracting between platforms.

With Larrabee the ability to develop custom middleware is boundless, offering application developers a very wide range of possibilities. The complexity of direct access to the CPU never stopped anyone either. There are many software layers to offer abstraction, while still allowing to create anything imaginable. This increases application diversity and creativity. So all Larrabee needs is a number of standard API implementations and some libraries with basic functionality to create your own.

Nobody is doubting the potential of lrb to open up opportunities. Some people here, myself included, are not sure how the x86 helps and are speculating that it brings down the perf/area.
 
Game programmers have been juggling god-knows-how-many ISAs for about 6 years now, without so much as noticing, thanks to the magic of JIT. :smile:
Oh certainly some are hidden behind APIs, but the engine developers in particular do complain about it and wish they didn't have to target so many - that was my point :) This is particularly true as they see the list increasing on the horizon as things converge...

Fine, show me a real world example where a unified ISA scores over the JIT model.
The two are not mutually exclusive. Heterogeneous cluster computing benefits greatly from unified binary compatibility across nodes, which is not dissimilar to what the future may look like on a single chip. JIT is obviously still useful even in these cases for its ability to aggressively optimize data-dependent execution paths though, so it's definitely not going away or anything.

And binary compatibility is not identical to unified ISA. I am in favor of binary compatibility for CPUs. I don't see the need for gpu's to suffer the brain-damages of x86 to integrate tightly with cpu's on the programming model level. I don't see the need to run Word 97 on my gpu's either.
Sure, but this conversation is less about the ability to run Word on your GPU than the ability for a global OS scheduler to own heterogeneous throughput cores down the road, and have the ability to preemptively multitask and load balance across these different cores.

But I am afraid I don't see the necessity of unifying ISA's to integrate gpu's tightly in the os.
As above, it gets hard to do preemption and thread migration without a unified ISA for one, which is desirable for efficient scheduling. Having multiple separate "types" of compute cores that need to be specially handled at the application level will produce less efficient schedules and load imbalances.

Obviously a lot of this is still unclear until we actually start to see these sorts of beasts, but I don't think you can dismiss it out of hand this early. I think it's clear that there's some value there, and it just remains to be seen how much.

Fermi will debut before lrb with relatively well optimized drivers, and it as almost as programmable as lrb itself.
Could you point me to a link for the "almost as programmable as LRB itself"? I've seen the NVIDIA press and such, but applying the filter that was necessary to map their PR for G80/CUDA down to reality ("full C support guys!") I'm still predicting some significant gaps as far as general-purpose programmability goes. I'd love to be happily surprised though :)
 
Could you point me to a link for the "almost as programmable as LRB itself"? I've seen the NVIDIA press and such, but applying the filter that was necessary to map their PR for G80/CUDA down to reality ("full C support guys!") I'm still predicting some significant gaps as far as general-purpose programmability goes. I'd love to be happily surprised though :)

To be fair you'd have to apply the PR filter on all sides. Albeit it isn't related to the programmability topic you two are debating here, there probably was a serious miscalculation considering LRB's DP precision abilities by rpg.314 a couple of posts ago. Everyone read "half the rate" in Intel's documents, it's just that I'm afraid that half could lead to an unpleasant real time throughput surprise. And yes that's marketing/PR in essence too.
 
Again it depends on what you want to do with the chip. If one does decide to write an engine from the ground up designed around the programmability Larrabee brings to the table then you can't rule out x86 optimizations as being beneficial. How beneficial is certainly a valid question and I don't have the answer for that.

I guess the point is that in a worst case scenario in which you gain nothing from x86 support, you also lose nothing and at least you have the option.
But what would those be for a very basic in-order chip?
It wouldn't take very long to exhaust them, and it's no advantage because these optimizations have been well explored for every other chip of this type regardless of ISA.

So you think Larrabee would have been available by now and have optimized drivers if it wasn't x86? I really doubt that, but feel free to "demonstrate" otherwise.
New graphics product lines in various market segments have been conceived, deployed, and replaced in the time frame it's taken from initial design to this delay, and a generation or two more will have elapsed by the time Larrabee III comes out.

All I need to demonstrate is that the rest of the graphics world has not taken its sweet time.


The scalar unit is dual-issue. To compilers it's a P54 with 64-bit support.
Intel slides and Larrabee dev statements have been consistently touting a scalar x86+VPU issue, with the possibility of performing a vector store using the scalar issue slot.
This question has been brought up before, and the answers so far have been pointing to a more restricted scalar pipe.
I've asked this question repeatedly, and while I admit I have no definitive way of determining if I've asked the right people, the answers so far have been consistent.
I'm quite willing to revise my estimate if an official source confirms dual-issue for the scalar side.
Hiroshige Goto's diagrams are not official Intel sources, as far as I know.

I wasn't comparing x86 decoders directly to RISC decoders. I was talking about x86 decoders compared to the entire Larrabee die, and how much you'd save when using ARM decoders. It's not like you'll be able to fit many more cores onto the same die.
The penalty for x86 applies to the entire x86 pipeline. There are other wrinkles to the design that bloat the rest of the chip.
As a result, a contemporaneous RISC chips in total was about a third smaller the the original Pentium. There may be conceivably more obscure RISCs than could have done the same in even less space.

Given that 2/3 of each core is not the vector unit, even if the vector pipe was in no way oversized because of some additional x86 complexity, close to half of Larrabee in total (assuming it devotes about 2/3 of its area to cores) is possibly a third too large.

I have to add a lot of "possibly" and "maybe" caveats because we don't have a physical exemplar to analyze and compare, which is why I am very disappointed about the latest Larrabee announcements.

No it's not. The Direct3D API is allowing more control over the hardware with every new generation. OpenCL is also a relatively thin API, putting a lot of things into the hands of the developer. Yet at the same time we see many game studios use off-the-shelf engines. This acts like middleware which saves the application programmer from the increasingly complex task of putting pixels on the screen.
This permits greater algorithmic control. So long as the abstraction layer exists, potentially behind a compiler layer, x86 itself doesn't affect the developer.

With Larrabee the ability to develop custom middleware is boundless, offering application developers a very wide range of possibilities.
The market reality is that there is not a boundless need for custom middleware. Vast segments of the market get along fine and do not care for the lack custom middleware, or content themselves to a few significant abstraction layers.

The complexity of direct access to the CPU never stopped anyone either.
That's because most don't care to go that far. The extra effort and the risk of tying the product to a single physical implementation yields too little reward.

There won't be room for another API/ISA. And even though x86 has some overhead, so does supporting Direct3D and OpenGL on a GPU. There are tons of features that are not commonly used, that still take die space.
Larrabee's target performance put it on par with GPUs between 2/3 to 1/2 its size, and it failed to even hit that mark.
I think we know where the winning side of that argument lies at present.

If Larrabee fails it will be because the initial performance/price can't get them enough market share to create momentum in the software world. So getting that initial renderer software right is critical. It might also of course also take some hardware revisions.
So the design decision that bloats a massive chip by 10+ percent and possibly caps its power-constrained performance by tens of percent right from the outset has no bearing on this?

I do believe x86 alone is not the primary contributor, though it probably contributed to the decision to have a 1:1 relationship between the number of fully featured cores and hardware cores. That decision probably bloated things by tens of percent over whatever x86 already contributed.
 
New graphics product lines in various market segments have been conceived, deployed, and replaced in the time frame it's taken from initial design to this delay, and a generation or two more will have elapsed by the time Larrabee III comes out.

All I need to demonstrate is that the rest of the graphics world has not taken its sweet time.

Yup. Larrabee 1 competed with 4890 and assuming it launches next year, it will have to go against 6870. ;) If it is delayed to 2011, then Intel has even bigger mountains to climb. If anyone can climb them, intel can, but that doesn't reduce the challenge before them in any way.

Let's consider the sgemm demo that was shown recently. Assuming it was using all 32 cores at full blast and 2Ghz standard clock, it huffed and puffed to ~50% peak performance at sgemm. Sgemm is about the nicest workload you can find to optimize, considering the calibre and track record of the people at intel, it should be running at >90% of it's peak performance without breaking a sweat.

Now, tbh, considering all the smoke and mirrors that has been going on wrt lrb since that siggraph paper, there is a good chance that Intel wasn't running this at full blast. But you cannot dismiss the possibility that atm, the hw/sw combo is running at 50% it's peak.:rolleyes:
Intel slides and Larrabee dev statements have been consistently touting a scalar x86+VPU issue, with the possibility of performing a vector store using the scalar issue slot.
This question has been brought up before, and the answers so far have been pointing to a more restricted scalar pipe.
I've asked this question repeatedly, and while I admit I have no definitive way of determining if I've asked the right people, the answers so far have been consistent.
I'm quite willing to revise my estimate if an official source confirms dual-issue for the scalar side.

As per my understanding of lrb's software threading, it needs the ability to dual-issue a vector load-store with a vector alu instruction to achieve zero-overhead software-thread switching.

The penalty for x86 applies to the entire x86 pipeline. There are other wrinkles to the design that bloat the rest of the chip.
As a result, a contemporaneous RISC chips in total was about a third smaller the the original Pentium. There may be conceivably more obscure RISCs than could have done the same in even less space.

Given that 2/3 of each core is not the vector unit, even if the vector pipe was in no way oversized because of some additional x86 complexity, close to half of Larrabee in total (assuming it devotes about 2/3 of its area to cores) is possibly a third too large.


Larrabee's target performance put it on par with GPUs between 2/3 to 1/2 its size, and it failed to even hit that mark.
I think we know where the winning side of that argument lies at present.

So the design decision that bloats a massive chip by 10+ percent and possibly caps its power-constrained performance by tens of percent right from the outset has no bearing on this?

I do believe x86 alone is not the primary contributor, though it probably contributed to the decision to have a 1:1 relationship between the number of fully featured cores and hardware cores. That decision probably bloated things by tens of percent over whatever x86 already contributed.

Me neither, and since x86 contributes to overhead without any visible (atleast immediately visible) benefits, and it is competing in a space where perf/area/W is the king, it is definitely not worth the trouble.
 
Let's consider the sgemm demo that was shown recently. Assuming it was using all 32 cores at full blast and 2Ghz standard clock, it huffed and puffed to ~50% peak performance at sgemm. Sgemm is about the nicest workload you can find to optimize, considering the calibre and track record of the people at intel, it should be running at >90% of it's peak performance without breaking a sweat.
I'm not sure how many such demoes there were.
The stories I saw said there was a demo with significant overclocking and I couldn't interpret if 16 or 32 cores were on, and whether those runs were at OC speeds or not, and no power figures.

It didn't help that the stories I ran across had writers that didn't know what they were talking about.
Given that RV770 is nearly in that realm for a dense matrix multiply as shown in this forum, not a good mm2 comparison.

If it were OCed to get to that, and if it were at 32 cores, and that those would have been the release specs (all big assumptions) Larrabee would have been in massive trouble, but I don't know if it was that bad.

Many of my questions would be answered if Intel actually released something, so we're back on the lack of an implementation to analyze.

As per my understanding of lrb's software threading, it needs the ability to dual-issue a vector load-store with a vector alu instruction to achieve zero-overhead software-thread switching.
I wasn't stating that the store capability on the scalar pipe was uncertain, as that was stated outright by Intel slides, rather that the software could opt to do so or not.
 
Oh certainly some are hidden behind APIs, but the engine developers in particular do complain about it and wish they didn't have to target so many - that was my point :) This is particularly true as they see the list increasing on the horizon as things converge...

IIUC, you are saying that game engine people complai about writing shaders that are performance portable across IHVs. Did I understand your point correctly? I was referring to GPU ISAs only in that comment of mine.

Sure, but this conversation is less about the ability to run Word on your GPU than the ability for a global OS scheduler to own heterogeneous throughput cores down the road, and have the ability to preemptively multitask and load balance across these different cores.

As above, it gets hard to do preemption and thread migration without a unified ISA for one, which is desirable for efficient scheduling. Having multiple separate "types" of compute cores that need to be specially handled at the application level will produce less efficient schedules and load imbalances.

AFAICS, you are looking at a future where there will be say 6 sandy bridge cores and 32 lrb cores on a single die, and your os will be able to kick threads around freely. If that is what you are looking at, then I am afraid that is not going to happen for a very long time.

nehalem=x86+x86-64+mmx+sse3+ssse3+sse4.1+sse4.2+avx+x87 fpu

lrb=x86+x86-64(?)+LRBni {the question mark is there because atm, it is not clear whether lrb will run SSE and SSE2, even though it is supposed to have 64 bit extensions. x87 fpu's status is also unclear, though my hunch is that it will be there}

Now you see why a thread cannot be kicked from a sandy bridge core to a lrb core (and vice-versa) without risking SIGILL.

Now do you see why all the talk of lrb sitting on a cpu socket has vanished, to be replaced "lrb is a compute accelerator for xeon and core processors".

Even if it goes on die, asking lrb to become binary compatible with an intel cpu is too much.

Sure intel could add all these to lrb cores and lrbni (and texture sampling instructions :D) to sandy bridge, but such a chip would make even Itanic blush when compared to a contemporary amd (or via + umm..., maxwell ;)) fusion chip.

Obviously a lot of this is still unclear until we actually start to see these sorts of beasts, but I don't think you can dismiss it out of hand this early. I think it's clear that there's some value there, and it just remains to be seen how much.

Fair enough, let's wait for time to settle the issue.

Could you point me to a link for the "almost as programmable as LRB itself"? I've seen the NVIDIA press and such, but applying the filter that was necessary to map their PR for G80/CUDA down to reality ("full C support guys!") I'm still predicting some significant gaps as far as general-purpose programmability goes. I'd love to be happily surprised though :)

AFAIK,

cuda has 2 restrictions on pure C,

1) no function pointers
2) no recursion

Both have been definitely relaxed in fermi and you can do c++ exception handling inside kernels. I do not know if you can call new and delete from kernels just yet, but I think you can. Of course, malloc() from within a kernel may be a PR balloon. :???:
 
Back
Top