Intel Larrabee set for release in 2010

x86 compatibility is the killer feature for Larrabee that makes it truly interesting from both a graphics and HPC standpoint.
Graphics application programmers using it for something mass market will never program the chip using x86 directly, so I think that's moot. The graphics programming landscape shifted away from raw to high-level some time ago, and I can't see it going back.

As for HPC, that space has always had a multitude of architectures and ISAs to consider depending on what's going on, so x86 isn't a killer feature there either. NVIDIA offer a GPU programming model that's not unlike what HPC has to deal with already, using extended C no less (which is a killer feature in my opinion), and it's only going to improve in terms of ease and hardware features to help over time.

So as a parallel programmable architecture, I don't think being x86 is a big advantage for HPC either.
 
NVIDIA's GPUs are, after all, increasing in programmability. Does anybody (Hannibal?) expect that trend to stop?

Oh yeah, I'm sure they'll upgrade to a real ISA, and they'll keep pushing CUDA and other tools... but that won't stop them from being a fabless semi company with a niche, boutique ISA that's in direct competition with a commodity x86 part from Intel.
 
Oh yeah, I'm sure they'll upgrade to a real ISA
What's a real ISA?

Apart from that I partially agree on the rest, NVIDIA will probably need to address CUDA shortcomings (as it'snot very easy and straightforward to code for) in the next couple of years if they want to be competitive with Intel, programming model wise.
Can NVIDIA do that? no doubt about it, they certainly will.
 
Graphics application programmers using it for something mass market will never program the chip using x86 directly, so I think that's moot. The graphics programming landscape shifted away from raw to high-level some time ago, and I can't see it going back.

My point is not that people will program these chips in assembler. My point is that x86 compatibility brings with a ton of tools and binaries that can run on a chip with zero tweaking and recompilation. It just works. That's why x86 continues to be compelling in whatever new form factor or usage scenario that Intel can shoehorn it into, from UMPC to HPC.

As for HPC, that space has always had a multitude of architectures and ISAs to consider depending on what's going on, so x86 isn't a killer feature there either. NVIDIA offer a GPU programming model that's not unlike what HPC has to deal with already, using extended C no less (which is a killer feature in my opinion), and it's only going to improve in terms of ease and hardware features to help over time.

So as a parallel programmable architecture, I don't think being x86 is a big advantage for HPC either.
Are you honestly suggesting that x86 isn't making major headway in the HPC market?

Also, for what it's worth, Intel showcased a set of C extensions (Ct) for data parallel (i.e. Larrabee) programming... and of course, the tools being x86 and all, it was pretty painless for them to get it all up and going... the same is true for the OpenMP stuff they were demoing across the room... again, if you're just extending and x86 app or tool then you have to do so much less work.
 
Also, I hear that Tim Sweeney and Michael Abrash are NDA'd on Larrabee and that it's rocking their world.

Sweeney seems like he has been at least somewhat on point with regard to future hardware trends and programmability and CPU and GPU convergence.

As for Sweeney, he said dedicated video hardware would have been obsoleted years ago, circa Voodoo 2 timeframe. He also said NV30 rocked his world and would crush the competition. We know what happened there. He's more off the future hardware trends than on.
 
My point is not that people will program these chips in assembler. My point is that x86 compatibility brings with a ton of tools and binaries that can run on a chip with zero tweaking and recompilation. It just works. That's why x86 continues to be compelling in whatever new form factor or usage scenario that Intel can shoehorn it into, from UMPC to HPC.
Point taken, but I was coming at it from a 3D graphics standpoint, where Intel will have to provide the expected programming infrastructure there, mostly on Windows.

Are you honestly suggesting that x86 isn't making major headway in the HPC market?
Of course not. What I'm suggesting is that x86 is far from being the major player in HPC, and I can't see why yet another player with C programming for their hardware (on x86 hosts let's not forget) can't make headway into HPC too.

Also, for what it's worth, Intel showcased a set of C extensions (Ct) for data parallel (i.e. Larrabee) programming... and of course, the tools being x86 and all, it was pretty painless for them to get it all up and going... the same is true for the OpenMP stuff they were demoing across the room... again, if you're just extending and x86 app or tool then you have to do so much less work.
Good, I'm happy to hear that. But then I also don't think it's that big of a deal when your programming model supports C and requires you run on x86 anyway, which is the case for NVIDIA and CUDA. They're in the same boat there, from a programmer's perspective, as Intel and Larrabee (IMO).

I'm thinking about it as someone who might want to develop and deploy a HPC app targetted at a massively parallel FP machine. I don't think the difference there is in Larrabee's favour in any way, but I'll wait and see how Intel expect folks to program the thing in a parallel fashion (for non-graphics apps) before I truly make my mind up.

Edit: Beware that I'm coming at it from the perspective of Larrabee being parked an add-in-board (if that wasn't obvious). Obviously if it's the main processor, the dynamic changes a bit.
 
My point here is that Talmudically parsing second-hand summaries from people like Tom Krazit to get time-frame information (or any kind of real information, for that matter) is a complete waste of time, and gets you nowhere.
I agree with that, although then we're just back in the domain of uncertainty.
My point is that x86 compatibility brings with a ton of tools and binaries that can run on a chip with zero tweaking and recompilation. It just works.
If Larrabee is optimal or near-optimal at FP-heavy x86 applications without recompiling or modifying a single line of code (and without code morphing), then I will eat my hat, and I can assure you that many other posters on this board would also be willing to do so.

Being near-optimal without recompilation would indeed be a killer feature. But if you need to recompile (which I'm sure you will), x86 compatibility has essentially become as much of a gimmick as it was on the original Itanium. In fact, I'd be tempted to point out a few other parallels there, but no matter.
Also, you can think what you will about Sweeney and Abrash. I only mention their names because I heard about what they think second-hand. As for people who I've heard from first-hand who're NDA'd on Larrabee and are impressed, I won't get into that... but I do know three of them, for what it's worth.
Well, that's nice. I'd be more impressed if they were also under NDA for NVIDIA and AMD GPUs in that timeframe (whatever that may be), however, and could compare the two. Because as it is, that's roughly similar to being impressed by NDAd NV30's specs when ATI just released the original Radeon.
They can give their parts a real ISA and have a go at it.
Real ISAs are always (and will always be) inferior to schemes such as NVIDIA's PTX when the one-time compile/optimization is not a problem, which it obviously is not for data-parallel HPC applications. If you think otherwise, I'd gladly counter any arguement you might have there! :)
 
NVIDIA offer a GPU programming model that's not unlike what HPC has to deal with already, using extended C no less (which is a killer feature in my opinion), and it's only going to improve in terms of ease and hardware features to help over time.

And NVIDIA have a 3+ year head-start in getting people using their C extensions.

Are you honestly suggesting that x86 isn't making major headway in the HPC market?

The HPC crowd don't need binary compatibility, we just recompile our codes. Yes x86 is making headway but the x86-ness isn't the thing that makes it attractive, it's the price-performance ratio.

Also, for what it's worth, Intel showcased a set of C extensions (Ct) for data parallel (i.e. Larrabee) programming... and of course, the tools being x86 and all, it was pretty painless for them to get it all up and going...

This is very true.

the same is true for the OpenMP stuff they were demoing across the room... again, if you're just extending and x86 app or tool then you have to do so much less work.

OpenMP is a mixed bag in my experience. My personal feeling is that one of its major failings is that it's too easy to use (paradoxical as that may seem). It has some serious shortcomings in certain areas (eg. memory placement). It's a great way to scale from 1 to 10 threads, it's not so hot going from 10 to 100. IMO it hides too much from the programmer for it to be viable for extreme scalability on a NUMA architecture.
 
Oh yeah, I'm sure they'll upgrade to a real ISA, and they'll keep pushing CUDA and other tools... but that won't stop them from being a fabless semi company with a niche, boutique ISA that's in direct competition with a commodity x86 part from Intel.

"niche fabless" has nothing to do with market acceptance of products.

Also, perhaps you have noticed that NVIDIA has sold half a billion chips in the past 10 years and is now selling at a rate of over 100 million chips a year and growing pretty rapidly. So I really don't think "niche" is applicable.

If you are implying all fabless semiconductor companies are "niche" I am not sure you are correct there either. Just maybe the economics that TSMC brings to the table are far superior than Intel's, in spite of Intel's process lead. Indeed, that is what allows NVIDIA to comparatively larger chips than Intel and earn almost equivalent gross margins at about some between 1/5 and 1/10 the price of an Intel chip on average (as you are surely aware memory and other components make up a substantial portion of a graphics board cost). Those are very powerful economics to deal with. So INtel has some advantages (at least for now) by operating fabs, but NVIDIA has a big one by not operating a fab.
 
Voltron,

I didn't say or imply that fabless == "niche." I said that their /ISA/ is niche.

The fabless part becomes a problem when you're trying to design a really complex, high-performance part and make it fit someone else's process. This is where Intel as an IDM has an advantage: they design processors for their specific process technology. Their fab engineers and architects are under one corporate roof, so to speak, and they share a lot of specific knowledge that lets the architects design things that Intel can product in volume at good yields. But this foundry vs. IDM tangent is off-topic...
 
True - I think i read that with a bit of dyslexia.

But Hannibal, you did bring up NVIDIA being fabless as point. And while in-house manufacturing clearly has its advantages, it would be naive to think that NVIDIA and TSMC aren't working extremely hard to close that gap. Meanwhile. in spite of those advantages the proof of the power of NVIDIA and TSMC's cost advantage in these companies margins, financial statements, and ASPs. So people can talk all they want about Intel this and Intel that, but economically the evidence is there.
 
If Larrabee is optimal or near-optimal at FP-heavy x86 applications without recompiling or modifying a single line of code (and without code morphing), then I will eat my hat, and I can assure you that many other posters on this board would also be willing to do so.

Of course it won't be optimal without a recompile. Indeed, it has been known for a while that Larrabee programming is x86 + some GPU-specific extensions. So certainly nothing will run optimally on Larrabee without a recompile any more than code written for SSE2 hardware will run "optimally" on SSE4 hardware without a recompile. But that's not the point.

The point is that you don't /have/ to do a recompile to get the world's largest installed base of software and tools up and running on it. You only do a recompile when you need a specific piece of new functionality or a performance boost.

As I said in my response to Rys, it's all about the x86 tool chain, and the relative painlessness of extending it to support new ISA features, and the fact that the same hardware will run both legacy code (albeit sub-optimally) and code that has been recomplied with the extended tools optimally.

You guys are looking at this from a lone developer's point of view, but I'm talking about the wider x86 ecosystem picture.

Ultimately, x86 becomes attractive in any given niche--whether it's HPC or ultra-mobile--at the exact moment that you're no longer at a real performance disadvantage for using it. Now, this is a subtly different claim than saying that x86 has some inherently attractive features from an indivdual code geek's point of view--it doesn't. But what it has is enormous scale, because it's backed by this huge installed base of tools and expertise.

So the moment that Moore's Law makes it possible to use x86 in an area without suffering too badly from a relative performance standpoint, then it becomes a compelling choice for these scale-based, ecosystem reasons.

Real ISAs are always (and will always be) inferior to schemes such as NVIDIA's PTX when the one-time compile/optimization is not a problem, which it obviously is not for data-parallel HPC applications.

I'm not really sure how to respond to this... other than I just completely disagree. I mean, nobody really /wants/ to use an intermediary ISA, or JIT, or anything like this, if they could just as easily use a product that natively implements the world's most popular ISA. I'd love to hear your arguments in favor of investing, say, an substantial portion of a large company's developer resources in a proprietary, intermediary ISA when there's an x86 solution that gets you, say, 80% there.

I think the only time in the history of computing that the industry has looked at x86 on the one hand and a JIT or BT solution on the other (where you code to an intermediary ISA that's not actually implemented in hardware) and said, "I'll take the non-x86 ISA" is with Java.
 
True - I think i read that with a bit of dyslexia.

But Hannibal, you did bring up NVIDIA being fabless as point. And while in-house manufacturing clearly has its advantages, it would be naive to think that NVIDIA and TSMC aren't working extremely hard to close that gap. Meanwhile. in spite of those advantages the proof of the power of NVIDIA and TSMC's cost advantage in these companies margins, financial statements, and ASPs. So people can talk all they want about Intel this and Intel that, but economically the evidence is there.

Understand that I'm speaking strictly long-term here. I don't think NVIDIA is going to go under next month or next year. When Larrabee comes out, even if Intel does knock everyone's socks off with some kind of RTRT + Raster badness to the point every gamer on earth must immediately rush out and buy an Intel GPU, it's going to take the market a while to figure out that the tectonic plates have shifted, and that a "GPU" that's really a many-core x86 part is fundamentally a game-changer.

I mean, all the RISC workstation vendors didn't go out of business when Intel launched the PPro. Of course, I think that GPUs vs. Larrabee will play out on a much more compressed time horizon than RISC vs. x86 did.
 
You only do a recompile when you need a specific piece of new functionality or a performance boost.
Well, if you don't care about the overall performance or the vector FPUs, why not just run your code on a vanilla ARM core? :) I get your point, but you're falling in the same trap that plagued the Itanium design team, IMO. Binary compatibility is only appealing if it delivers 'good enough' performance. For a chip that is exclusively aimed at high-performance workloads (whether that is HPC or Graphics), there is no such thing as 'good enough'.

As I said in my response to Rys, it's all about the x86 tool chain, and the relative painlessness of extending it to support new ISA features
That is one thing I completely and utterly agree with. The tremendous investments in toolchains for x86 are certainly an advantage, although I'd also like to point out that it's not perfect yet for multithreaded applications, and that debugging programs with tens of threads can be a nightmare right now IMO. I would certainly hope and expecvt this to be much easier in 2009+ than today, however.

Ultimately, x86 becomes attractive in any given niche--whether it's HPC or ultra-mobile--at the exact moment that you're no longer at a real performance disadvantage for using it.
Which corresponds to the exact second when performance becomes 'good enough', because x86 can never be optimal. While this indeed makes it attractive for the ultra-mobile market in the long-term, the very definition of HPC tends to be that there is no such thing as 'good enough'. The only reason (except the toolchains) why x86 is attractive in HPC today is that it has better economies of scale in terms of *production* and R&D. You know, the exact same ones GPUs also enjoy today...

So the moment that Moore's Law makes it possible to use x86 in an area without suffering too badly from a relative performance standpoint, then it becomes a compelling choice for these scale-based, ecosystem reasons.
Moore's Law implies nothing regarding relative performance penalties. If you are 50% less efficient, that won't magically change when you're thinking 32B transistors vs 16B transistors compared to when it was 32M vs 16M. As such, x86 only becomes attractive when it either has economies of scale (for production + R&D) that other solutions do not enjoy or that performance has become 'good enough'. In the case of Traditional GPUs vs x86, both of these potential advantages do not exist.

I mean, nobody really /wants/ to use an intermediary ISA, or JIT, or anything like this, if they could just as easily use a product that natively implements the world's most popular ISA.
That is correct, but if and only if perf/$ and perf/watt are roughly similar.

I'd love to hear your arguments in favor of investing, say, an substantial portion of a large company's developer resources in a proprietary, intermediary ISA when there's an x86 solution that gets you, say, 80% there.
First, let me counter some of the negative aspects you're pointing out. It should be noted that PTX is not proprietary, so whether AMD and Intel support it is really up to them. In the end, there is nothing that prevents interested parties from writing an efficient PTX-to-x86 converter. And if NVIDIA feels that would actually put their hardware in a good light, they could even easily do it themselves.

Really, your entire arguement there is based around three points, so I'll answer one by one: a) x86 has a much better toolchain today. b) x86 is easier than alternatives because everyone is used to it. c) JIT-like techniques nearly never worked before, why would it suddenly make sense?

- A: NVIDIA and AMD have every interest in the world to invest aggressively to reduce the gap there between now and 2009/2010. I doubt they'll get there, but I don't think anyone can deny that it will be less of a problem (or advantage, from Intel's point of view) in that timeframe.
- B: Everyone is used to the latest architectures implementing x86, not the ISA itself. Optimizing for Larrabee and optimizing for Conroe are so fundamentally different tasks that you'll basically have to relearn everything, as far as I can tell. Abrash might be at an advantage here for various reasons, but I'm very skeptical about the rest of us.
- C: Traditional JIT languages only execute each code fragment a small number of times. Just doing the final stages of optimizations and compilation before running the program on a GPU is not the same thing at all, because that exact same code will be run thousands, or millions, or even billions of times. The overhead is pretty much negligible, and you can gain a lot from that extra bit of optimization.

In the end, I think many of your arguements pretty much fly out of the windows when you consider how large the GPGPU market will likely be by 2H09, because Intel won't have anything to compete with that before then. We'll see how fast that goes, though.
 
I agree with the premise that Larrabee will be a serious threat to GPUs moving into HPC, or rather, the segment of HPC GPUs are initially targeting.

GPUs are going into the cheap flops segment, the segment that cheap x86 clusters have basically conquered.

x86 compatibiliby can be a powerful draw in some cases, though it is mitigated by the fact that HPC has a lot less inertia than x86's desktop stronghold.

There are things x86 compatibility requires or has brought along in the periphery that GPUs in HPC will have to combat.

Exceptions:

I've seen anecdotal evidence that precise or relatively precise exceptions (or just having exceptions at all) are apparently very interesting to those who work in HPC.
x86 compatibility does enforce such capability, though in theory any other full-blooded ISA would as well.

Exceptions and other non-performance features are signs of an architecture that is designed with the idea that what it computes matters.
GPUs have a well-known legacy for not being all that rigorous, and the first instantiations of their GPGPU products aren't far enough away from that legacy.

Flexibility:
Larrabee can simply do more than GPUs can. The benefits of x86 compatibility means the cores are capable of existing independently of a master CPU: Larrabee can be its own master.
Some of the released slides indicate that this will be the case for some potential systems.
That would make Larrabee an almost drop-in replacement for some applications, though the performance drop without at least a quick recompilation would be horrendous.
Then again, a little performance is infintely more than GPUs that can do none.

It's also where Larrabee can bring a cost advantage. GPUs will not escape the necessity of having the CPU along as a master device.
There are likely workloads where Larrabee could dispense with a separate CPU and either not care or actually gain performance.

If the desire is for cheap flops, and this is the market GPUs are targeting, it may help to cut out the middleman.

It may be possible Larrabee could position itself in other segments GPUs cannot touch. That means even if they hold position in the cheap flops sector, Larrabee can strike from safe harbor on the other side.
This assumes Intel's other chip designs don't get in the way.

Efficiency:
This is a wild card, and not a guarantee for Larrabee.
There are signs that it will have an edge.
Folding at Home as an example, it was pointed out that flops results for GPU work units were in some ways inflated versus those run by Cell (an idea somewhat closer to Larrabee). GPUs in FAH have a lot of throwaway computation, while Larrabee might be able to get away with a little less. This is something that is going to be much more critical in the future and in my next point.

If Larrabee can manage greater efficiency, and GPUs have some pretty drastic fall-off in non-ideal workloads, then something like a factor of two or three shorftall in peak performance may not be enough to keep GPGPUs ahead.

If Larrabee is capable of twice the SP flops as it has DP, then there may not be a gap at all.

Power Power Power Power:
Intel has a history with power management.
Intel has had a power-aware philosophy beaten into it in the last few years.
GPUs currently are very coarse in their management (hey we have 2 speed modes!).
Regardless of peak performance, power is currently one of or the most dominant factors in high-performance design.
I have not been impressed with some of the indications of the attitude various GPU execs have about that subject.
If they have changed their minds in recent months, they are still several years late to the party.
GPUs are comparatively efficient to hefty IPC cores trying to spit out peak flops. They will not be so lean if faced with a design that is far more aware of power than they are.
Whether Larrabee will be power efficient is a fine question. It may very well not be.
If GPUs progress as they have done, it won't matter because they'll be no better at best and likely worse.

Intel:
A company with the size and resources of Intel can handle long protracted fights, can establish or force beachheads, and can suffer more setbacks.

Unknown spoilers:
The x86 ISA.
Without knowing more, we don't know how much of a penalty Larrabee will pay for working with this ISA.

The vector extensions.
Without knowing more, we won't know what omissions might hold the chip back.

Larrabee's design quirks:
We may find out its early rendition will be an interesting design, but with some set of failings and shortfalls. For any decent new beginning I bet heavily this will be the case.

Larrabee's focus:
Design is still king, even with a process lead and massive engineering resources.
Choices in Larrabee's design will do more to hurt or hinder it more than having an in-house fab.

GPUs:
A lot of my speculation is predicated on the idea that GPU designers don't change anything. This is incredibly unlikely.
They have been getting feedback for some time, so they should be smart enough to know to work on fixing their shortcomings.

Time:
Intel's timing will be important.
Too much time means GPUs have time to evolve.
Too little, and Larrabee will be rushed and come out in an environment that still favors GPUs or standard x86.

Intel:
A company with the size and resources of Intel has a lot of other concerns.
Competition from its own chip lines will be a threat to Larrabee's viability.
Intel also has a spotty history in sticking to its guns after an initial setback.
 
As for Sweeney, he said dedicated video hardware would have been obsoleted years ago, circa Voodoo 2 timeframe.

http://www.beyond3d.com/content/interviews/18/4

Feb-2004

Tim Sweeney said:
I think CPU's and GPU's are actually going to converge 10 years or so down the road. On the GPU side, you're seeing a slow march towards computational completeness. Once they achieve that, you'll see certain CPU algorithms that are amicable to highly parallel operations on largely constant datasets move to the GPU. On the other hand, the trend in CPU's is towards SMT/Hyperthreading and multi-core. The real difference then isn't in their capabilities, but their performance characteristics.

When a typical consumer CPU can run a large number of threads simultaneously, and a GPU can perform general computing work, will you really need both? A day will come when GPU's can compile and run C code, and CPU's can compile and run HLSL code -- though perhaps with significant performance disadvantages in each case. At that point, both the CPU guys and the GPU guys will need to do some soul searching!
 
Mmmm, but it isn't that his second (or more) shot at providing a timeframe for that prediction to come true? Sort of like how commercially viable nuclear fusion seems to always be 10 years away whenever you check? :smile:

Tho Fusion and Larrabee are better ammunition for him than anything he had in 2004. . . or 2000. . .or 1998. . .or whenever he made that prediction for the first time.
 
If power constraints remain as they are, Sweeny's Age of Convergence will be a short one.

If his time frame were to come true, I predict that by 2020 we'll have some newfangled "graphics-specific CPU" and everyone will be singing the praises of some "revolutionary" separate and specialized silicon.
 
Mmmm, but it isn't that his second (or more) shot at providing a timeframe for that prediction to come true?

Yes. The time I was refering to was way back when and not the more recent 2004 timeframe. The first time-frame Sweeny said those items was around 1998-99. He even repeated it on the Motley Fool boards -- before B3D was even around.

An snippet of Sweeney's "The Sky is Falling" proclaimed @ 11/11/1999 and Archived on GameSpy
Tim Sweeney - I don't think voxels are going to be applicable for a while. My thinking on the evolution of realtime computer graphics is as follows:
1999: Large triangles as rendering primitives, software T&L.
2000: Large triangles, with widespread use software-tesselated curved surfaces, limited hardware T&L.
2001: Small triangles, with hardware continuous tesselation of displacement-mapped surfaces, massive hardware T&L.
2002-3: Tiny triangles, full hardware tesselation of curved and displacement-mapped surfaces, limited hardware pixel shaders a la RenderMan.
2004-5: Hardware tesselation of everything down to anti-aliased sub-pixel triangles, fully general hardware pixel shaders. Though the performance will be staggering, the pipeline is still fairly traditional at this point, with straightforward extensions for displacement map tesselation and pixel shading, which fit into the OpenGL/Direct3D schema in a clean and modular way.
2006-7: CPU's become so fast and powerful that 3D hardware will be only marginally benfical for rendering relative to the limits of the human visual system, therefore 3D chips will likely be deemed a waste of silicon (and more expensive bus plumbing), so the world will transition back to software-driven rendering. And, at this point, there will be a new renaissance in non-traditional architectures such as voxel rendering and REYES-style microfacets, enabled by the generality of CPU's driving the rendering process. If this is a case, then the 3D hardware revolution sparked by 3dfx in 1997 will prove to only be a 10-year hiatus from the natural evolution of CPU-driven rendering.
 
Last edited by a moderator:
Back
Top