Larrabee delayed to 2011 ?

rpg.314 · Dec 14, 2009

Nick said:
If they can't afford it then the whole project was doomed either way. This ~10% is nothing in comparison with the massive challenges to implement everything (except texture sampling) using generic programmable cores.

If they fail, x86 isn't to blame. Any other IHV with an other ISA would fail just the same. It would just mean the time isn't right yet for this approach. However, if they succeed the entire computing world is in their hands.

I agree, considering that lrb sw involves bigger magic than the hw, prolly sw is a bigger problem for them right now. But that doesn't make x86 any less of a problem.

No. Most of the code will not need a recompile. The cores are Pentium compatible, and on average 90% of every project is not performance critical. The remaining 10% will require a recompile (actually a complete redesign) to achieve high performance, but it's far easier to be able to do that incrementally than rewriting an entire application for scratch using unfamiliar tools.

I totally disagree with this line of thought. Almost all the apps which-make-sense-to-run-on-lrb will need a recompile as almost all of them use intrinsics. And that is assuming they don't have any inline assembly. If they used a JIT internally, well then good luck, couse now you need a brand new *performant* jit backend.

The tools bit is also misguided, IMHO. If by familiar tools you mean gcc/msvc/icc/vtune etc., then no one outside intel will use it to write code for lrb, atleast initially. The *tools* they will use are ocl/ogl/dx, and lrb doesn't get a pass over cypress/fermi there.

Nick · Dec 14, 2009

darkblu said:
i, for one, believe that choosing a hapless ISA has not helped intel to deliver this part.

Intel has to deliver a GPU to create market share. There are two aspects to this: hardware and software. The hardware itself is finished, demonstrated good synthetic benchmark results, and is ready to deliver for the HPC market. So clearly they are confident about that aspect. My "belief" is there must be a software issue.

Any other ISA would have made it much harder to deliver a part which would receive good reception from the software world. In the long run that's more critical than anything else. One year of delay is nothing compared to the competition's ongoing struggle to gain some acceptance.

not the performance per se. abysmal power efficiency (ie. performance/wattage) is what most likely killed this project (or 'delayed' it, as you put it).

The scalar cores have poor power efficiency, but again, most work takes place in the vector units. The texture samplers and memory controllers are also no different from other architectures. So another ISA for the scalar cores isn't suddenly going to dramatically improve overall power efficiency.

i strongly doubt it would've taken intel months to spot a sw performance anomaly in their own chip. IMO, performance has likely been relatively on-par with the projections.

Because Intel has done many projects like this before... ?

It's a hardware company. They didn't hire Abrash and other software rendering gurus for nothing. It's a very complex task and even for these gurus it takes a lot of time to try different approaches and achieve high performance. Like I said before, the parameter space is massive and there isn't one straightforward way to do things.

You really think another ISA would have solved the abysmal performance of Microsoft's reference rasterizer? First and foremost it's a software issue. Other software renderers are over a hundred times faster. Although REF clearly wasn't written with performance in mind, they didn't make it slow on purpose either. So this illustrates that there's a massive difference between optimized and unoptimized code.

Even for code that was written for performance from the start there can still be a lot of potential left for optimization. Pixomatic 3 claims full support for DX9, but is four times slower than Pixomatic 2. SwiftShader 2.0 shows no such performance drop compared to SwiftShader 1.0. In fact a lot of things that were believed to be optimal on the day of the 1.0 launch became significantly faster. And it would be foolish to believe that SwiftShader 2.0 is the final answer.

how does owning something that does not fit a job save you time and money toward that job?

You're starting with the wrong assumption. Any ISA would do. So it saves time and money to use the ISA and tools you already have. And where it will really save time and money is to create a software ecosystem.

i really doubt the bolded part ('hey! my C2D-targeting code runs faster on a p5! woot!').

Who said anything about running it faster? It's about being able to run it at all with little or no changes. This is really going to motivate developers. Nobody likes spending months rewriting all code for GPGPU to get a first result, only to realize that their approach doesn't perform as expected and they'll have to rewrite some of it. There's another 90/10 rule here: half of the time to create a project is spent on the first 90%, the second half is spent on the last 10%. With x86 compatibility developers can skip rewriting 90% of the code and stay motivated to tackle performance issues, which is the only thing they were really interested in anyway.

you totally lost me here. what x86 binary libraries you would like to run on a LRB, and why would you want to run them there, and not on your dual/quad-core monster CPU that was designed to eat old x86 code for breakfast?

Memory management, string manipulation, data conversion, containers, system methods, search algorithms, math routines, etc. You want to run them on Larrabee because the latency of a round-trip to the CPU is much slower.

again, everything you've argued for so far has been based on the premise that x86 fits the task nicely. but we don't agree there. and low and behold, the 'competition' has material, currently-sold-on-the-market parts - how's that for doing better?

Where's the software for these parts? Can I open my C# compiler and start coding for it?

Larrabee will initially need experts to optimize various libraries and tools, but once available application developers can use them to add new functionality and create end-user applications. To dominate the computing world you have to cater for every developer. That's only possible by allowing direct access to the hardware and building various layers of software on top of it. Any ISA can be used for that, but x86 accelerates it by having a solid starting ground.

Note that they didn't make Larrabee partially x86 compatible. They made it Pentium compatible.

zms-08 tapes out Q1/Q2 '10 (no LRB will be on the market by then). the 'choppy motorcycle' demo is some pre-production code on zms-05 - final code reportedly runs better.

Which proves exactly my point above...

also, the part offers much more than GL ES - all in the form of APIs (Creative are actually opposed to the idea of giving direct access to the GPGPU part to application programmers, at least not until they have a sound compiler targeting that).

What? A delay due to software issues?

cool. and how big is the LRBni ecosystem?

Apart from a few developers who already started using the prototype primitives, it's non-existant. But that's really irrelevant. LRBni is an extension, just like for instance SSE3. Did Intel face major issues when trying to introduce SSE3? No, the only thing of critical importance for those processors was the support of Pentium instructions. The existing ecosystem made it easy for developers to incrementally adapt their software for SSE3, where it mattered, instead of having to recreate everything from scratch.

last time i checked, the market i was speaking about was spread a tad beyond netbooks. basically, it works like this - the moment desktop windows becomes an non-factor, that same moment atom develops weak knees (or simply follows the way of desktop windows).

I can't see how that relates to Larrabee. In the mobile/embedded market something like a 300% difference in performance/power is disastrous. For Larrabee the ISA choice has an impact of about 10% but it's offset by a massive advantage on the software front.

Nick · Dec 14, 2009

rpg.314 said:
I agree, considering that lrb sw involves bigger magic than the hw, prolly sw is a bigger problem for them right now. But that doesn't make x86 any less of a problem.

Sure, x86 will continue to have some overhead. But looking at the CPU market that hasn't exactly stopped them.

I totally disagree with this line of thought. Almost all the apps which-make-sense-to-run-on-lrb will need a recompile as almost all of them use intrinsics. And that is assuming they don't have any inline assembly. If they used a JIT internally, well then good luck, couse now you need a brand new *performant* jit backend.

Even if they use intrinsics or inline assembly, most binaries have a fallback path using plain Pentium code. For those that don't, it's typically in a module for which the application developers have the source code. So yes a recompile would be necessary, but it's pain free. My whole point was that developers would be able to get their code running on Larrabee with minimal effort, and work incrementally from that. They'll be able to release Larrabee-ready software very early. As soon as it runs faster than on a CPU, it's a win, and they can continue to improve it every version. Rewriting everything from scratch to run on a GPU has a much less attractive pain/gain ratio.

Note that several x86 compilers already have some pretty powerful auto-vectorization features. And quite a few developers are familiar with x86 intrinsics so they'll have little difficulty using LRBni instrinsics if necessary. Explicit parallelism is also easier to reason about than the implicit parallelism of for example OpenCL.

The tools bit is also misguided, IMHO. If by familiar tools you mean gcc/msvc/icc/vtune etc., then no one outside intel will use it to write code for lrb, atleast initially. The *tools* they will use are ocl/ogl/dx, and lrb doesn't get a pass over cypress/fermi there.

A lot of people express interest in developing software for Larrabee in C, C++, C#, etc. Especially for GPGPU applications, people don't choose DX, GL, CL or CUDA because they want to, but because they have to. Many applications ask for very different tools. So it makes a lot of sense to provide direct access to the hardware and allow developers to create anything they need.

Why did GPUs even become programmable, when every application at that time was fixed-function? Why did they become unified, when every application had balanced vertex/pixel workloads? It's not always easy to see the future benefits when you're facing the current overhead. But I think we all agree we don't want to go back to fixed-function or non-unified hardware. Imagine what developing on a CPU would be like when all you got is APIs from Intel. One day we'll look back at hardware that only supports a few APIs as stupid.

rpg.314 · Dec 14, 2009

Nick said:
The scalar cores have poor power efficiency, but again, most work takes place in the vector units. The texture samplers and memory controllers are also no different from other architectures. So another ISA for the scalar cores isn't suddenly going to dramatically improve overall power efficiency.

The scalar bits take up 2x the area of lrb vpu. :???:

Are you sure that a leaner ISA w/o support for system calls, bcd, x87 and crazy instruction encoding could not have done this better?

Are you sure Intel will be able to scale this this monster to 256 cores while preserving full cache coherency?

To be sure, ~35% of non-pad area in fermi is in ff hw. But we all know that it is destined to go away in favor of alu's/fermi cores (the "things" fermi has 16 of, not the crazy "cuda cores"). Some of that is prolly texturing, but are you sure intel has got some magic bullets up it's sleeve that they'll be able to unleash tomorrow to counter that super-linear increase?

rpg.314 · Dec 14, 2009

Nick said:
Sure, x86 will continue to have some overhead. But looking at the CPU market that hasn't exactly stopped them.

The CPU market has very different dynamics than the gpu market, so the cpu example is not really relevant. Today Intel is in the place of the RISC vendors during the days of CISC vs RISC vendors, and the established players have a huge head start.

Even if they use intrinsics or inline assembly, most binaries have a fallback path using plain Pentium code. For those that don't, it's typically in a module for which the application developers have the source code. So yes a recompile would be necessary, but it's pain free. My whole point was that developers would be able to get their code running on Larrabee with minimal effort, and work incrementally from that. They'll be able to release Larrabee-ready software very early. As soon as it runs faster than on a CPU, it's a win, and they can continue to improve it every version. Rewriting everything from scratch to run on a GPU has a much less attractive pain/gain ratio.

I hadn't thought of the fallback paths. Thanks for bringing that up. So for the devs who are willing to write intrinsics over portable shaders, who are happy to target 1 out of 3 IHVs, and who do not mind restricting their market to a product which right out the gate has less perf/W/area than their competitors and who is taking a somewhat risky market strategy (atleast initially) will be happy to write code for lrb. And if they are gonna recompile to use lrbni intrinsics any way, then how does x86 help? If they are going to be recompiling code (many times) to do incremental development, how does the x86 help?

Note that several x86 compilers already have some pretty powerful auto-vectorization features. And quite a few developers are familiar with x86 intrinsics so they'll have little difficulty using LRBni instrinsics if necessary. Explicit parallelism is also easier to reason about than the implicit parallelism of for example OpenCL.

To each, his own I guess. I find writing shader like code, (with implicit parallelism) easier to think about.

A lot of people express interest in developing software for Larrabee in C, C++, C#, etc. Especially for GPGPU applications, people don't choose DX, GL, CL or CUDA because they want to, but because they have to. Many applications ask for very different tools. So it makes a lot of sense to provide direct access to the hardware and allow developers to create anything they need.

I don't think much C, C++, C# code for gpu's will be written unless there is an industry standard modification to these languages. How many will be willing to ignore 2 out of 3 IHV's with better perf/W characteristics.

Why did GPUs even become programmable, when every application at that time was fixed-function? Why did they become unified, when every application had balanced vertex/pixel workloads? It's not always easy to see the future benefits when you're facing the current overhead. But I think we all agree we don't want to go back to fixed-function or non-unified hardware. Imagine what developing on a CPU would be like when all you got is APIs from Intel. One day we'll look back at hardware that only supports a few APIs as stupid.

No doubt, the constraints of present day gpu's will be relaxed, but IHV specific code? I am not buying it right now.

Voxilla · Dec 14, 2009

Nick said:
Intel has to deliver a GPU to create market share. There are two aspects to this: hardware and software. The hardware itself is finished, demonstrated good synthetic benchmark results, and is ready to deliver for the HPC market. So clearly they are confident about that aspect. My "belief" is there must be a software issue.

I think it is a combination of hardware and software issues.

By now releasing a new GPU architecture without DX11 is pretty problematic IMHO.
Larrabee lacks several DX11 vector instructions, like for example bit reversal, bit count and bit search. (there are some scalar but no vector)
It's texture units also would not have support for the new block compression formats. Emulating this in software resulting in abysmal performance would not be an option.
In short the hardware lacks in DX11 hardware department.

Next I would have expected major performance issues with heavy texturing, mainly due to insufficient latency hiding and L2 cache trashing.
Also I can not imagine knowing Intel previous efforts that AF texturing quality would be good enough.

Anti aliased rendering performance, likely would take a much steeper nose dive compared to GPUs due to lack of compressed Z/color, shrinking tile size, lack of MSAA hardware, blending/ Z hardware etc.

Finally with the advent of tessellation and the growing importance of high triangle rate,
Larrabee likely is not able to come anywhere close to the 500+ Mtris/s rate of current GPUs, due to lack of hardware rasterizers and software rasterizing becoming hopelessly inefficient for small triangles.

Groo The Wanderer · Dec 14, 2009

rpg.314 said:
Are you hinting at some kind of runtime profiling by the OS, to schedule this stuff better? *Perhaps* physics makes sense to be scheduled across, but not many do.

This is one part about LRB that has not come out yet. I would suggest you guys look into it a lot more closely, a lot of the magic is there.

-Charlie

Groo The Wanderer · Dec 14, 2009

rpg.314 said:
The scalar bits take up 2x the area of lrb vpu. Are you sure that a leaner ISA w/o support for system calls, bcd, x87 and crazy instruction encoding could not have done this better?

The P54c core that it is based on was an 83mm^2 chip on .35µm. Intel process roadmaps went .35µm -> .25µm -> .18µm -> .13µm -> .90nm -> 65nm -> 45nm, so that would be 6 shrinks, or 1/64th the area. The 32 cores @1.3mm^2 of that chip would be effectively half the original area, or, rounding up, 42mm^2.

On a 600+mm^2 chip, 42mm^2 isn't all that much, how much of that 1.3mm^2 per core is waste/expendable? Is it worth it to try and save .2mm^2 per chip, and possibly hose some obscure compiler function in the process?

-Charlie

3dilettante · Dec 14, 2009

In light of the cancellation of the initial variant and the delay of the unveiling, it's more like that's where the magic was, or rather, where the magic was until such time in the future that the past tense becomes is, which presently puts Intel as not only the creator of a new graphics architecture, but also the pioneering developer of a new tense for which I lack the wordiness and non-linear concept of time to truly describe or comprehend.
Or maybe I currently didn't, until such time as I do in the future.
Past conditional future resolved tense, or (will?) somesuch.

Back in normal time, I've wondered before if it were possible for performance counters to be monitored by the run-time and the shader compiler that now (will, did?) runs on Larrabee.
Part of my uncertainty is how much of the P54 core would remain at that point, since the elder architecture didn't exist with the modern range of counters and performance monitors, and how aggressive the compiler could be.

Maybe a range of code different fragments that can be branched to if the run time detects a fall-off of performance would help. The question is how does one update the threads in a timely fashion and how frequently this can be done before other penalties become excessive.

3dilettante · Dec 14, 2009

Groo The Wanderer said:
The P54c core that it is based on was an 83mm^2 chip on .35µm. Intel process roadmaps went .35µm -> .25µm -> .18µm -> .13µm -> .90nm -> 65nm -> 45nm, so that would be 6 shrinks, or 1/64th the area. The 32 cores @1.3mm^2 of that chip would be effectively half the original area, or, rounding up, 42mm^2.

As cheery as that math is, if this is a valid starting point, then the VPU aggregate area would be 1/3 of the 42 (going by Intel statements), so the core area of Larrabee would be around 56 mm2 in total. What the other 550mm2 is doing there would be a major question.

The fuzzy die photo for Larrabee doesn't make me think this simple scaling is all there is to the story.

Nick · Dec 14, 2009

rpg.314 said:
The scalar bits take up 2x the area of lrb vpu.

Sure, but it's not like a GPU's VPU can operate in a vacuum either. The overhead you can actually attribute to x86, for the entire die, is likely in the order of 10%.

Are you sure that a leaner ISA w/o support for system calls, bcd, x87 and crazy instruction encoding could not have done this better?

Those features only account for a minor amount of overhead, and some might actually prove useful.

Transmeta never succeeded to offer any substantial advantage, despite being free from most of x86's so-called inefficiencies.

Are you sure Intel will be able to scale this this monster to 256 cores while preserving full cache coherency?

They already have 48-core x86 processors for research purposes. When transistor budgets allow, I don't think they'll have a lot of trouble scaling to 256 cores or more. I wouldn't worry too much about Intel's ability to create such hardware. Again the real challenge will be on the software front. At that scale Amdahl's Law really demands that every bit of parallelism is extracted. What's needed is programming languages and compilers which take care of most of that.

To be sure, ~35% of non-pad area in fermi is in ff hw. But we all know that it is destined to go away in favor of alu's/fermi cores (the "things" fermi has 16 of, not the crazy "cuda cores"). Some of that is prolly texturing, but are you sure intel has got some magic bullets up it's sleeve that they'll be able to unleash tomorrow to counter that super-linear increase?

They actually have quite a bit of freedom. Modern x86 processors are very different internally from the earliest x86 processors. And also on the ISA side they can always add new extensions, and deprecate or emulate old ones. x86 has faced numerous issues during its lifetime, but has always survived despite people prediciting its end. This really illustrates again that the ISA doesn't have that much of an impact.

3dilettante · Dec 15, 2009

Nick said:
Transmeta never succeeded to offer any substantial advantage, despite being free from most of x86's so-called inefficiencies.

Analysis of the internal ISA for Crusoe revealed that there were kinks to the ISA and physical architectures that existed solely for the sake of x86 emulation.
It would have been interesting to see in these many-core times what could have been done, as power budgets have become much more dominant than they were back in those days.

They already have 48-core x86 processors for research purposes. When transistor budgets allow, I don't think they'll have a lot of trouble scaling to 256 cores or more.

But those are not coherent. It's a science project that doesn't attempt to scale that part of the equation.

They actually have quite a bit of freedom. Modern x86 processors are very different internally from the earliest x86 processors. And also on the ISA side they can always add new extensions, and deprecate or emulate old ones. x86 has faced numerous issues during its lifetime, but has always survived despite people prediciting its end. This really illustrates again that the ISA doesn't have that much of an impact.

The K7 core had nearly twice the transistor count of the Alpha EV67.
x87 floating point butchered FP performance for years, despite masses of transistors thrown at it.

If an ISA is well-behaved, it may not have much of an impact over another well-behaved ISA.
x86 is at least a little bit naughty.

Nick · Dec 15, 2009

Voxilla said:
By now releasing a new GPU architecture without DX11 is pretty problematic IMHO.
Larrabee lacks several DX11 vector instructions, like for example bit reversal, bit count and bit search. (there are some scalar but no vector)

I really doubt that Intel hasn't kept a close eye on the DX11 specs and updated LRBni with new instructions when necessary.

And it wouldn't be unreasonable to emulate some of these instructions. They take extra space to implement in hardware, while being hardly ever used. I doubt any DX11 game would suffer from not having fast implementations for these operations. Heck, I expect other IHV's to also consider implementing them only in the special function units or emulating them with a number of regular instructions.

Next I would have expected major performance issues with heavy texturing, mainly due to insufficient latency hiding and L2 cache trashing.
Also I can not imagine knowing Intel previous efforts that AF texturing quality would be good enough.

Considering that Larrabee is ready as an HPC part but not as a GPU it's indeed possible they faced some texturing issues. Latency hiding is another software issue though; scheduling and adjusting the number of qquads per thread can make a big difference.

Anti aliased rendering performance, likely would take a much steeper nose dive compared to GPUs due to lack of compressed Z/color, shrinking tile size, lack of MSAA hardware, blending/ Z hardware etc.

Using a tile-based rendering approach makes compression unnecessary. However it does require a very calculated use of cache memory and carefully orchestrated scheduling of tasks to prevent thrashing. Again most of the burden falls onto the software side and there are many implementations to explore and parameters to tweak.

Finally with the advent of tessellation and the growing importance of high triangle rate,
Larrabee likely is not able to come anywhere close to the 500+ Mtris/s rate of current GPUs, due to lack of hardware rasterizers and software rasterizing becoming hopelessly inefficient for small triangles.

Wrong.

OpenGL guy · Dec 15, 2009

Nick said:
Using a tile-based rendering approach makes compression unnecessary. However it does require a very calculated use of cache memory and carefully orchestrated scheduling of tasks to prevent thrashing. Again most of the burden falls onto the software side and there are many implementations to explore and parameters to tweak.

Unnecessary? Let's say the bus is 1024 bits. Without compressed Z, you are limited to 32 Zs per clock.

Nick · Dec 15, 2009

3dilettante said:
Analysis of the internal ISA for Crusoe revealed that there were kinks to the ISA and physical architectures that existed solely for the sake of x86 emulation.

They must be there for a reason. Else they would have just emulated those as well. So people clearly actually put some CISC instructions to good use.

It would have been interesting to see in these many-core times what could have been done, as power budgets have become much more dominant than they were back in those days.

Transmeta mainly attempted to improve efficiency using VLIW. The Pentium 4 processors it competed with at that time had a very low IPC. So extracting static instruction parallelism made perfect sense. However, today's x86 CPUs have much improved IPC. And there's only so much static instruction parallelism that can be extracted. So today VLIW wouldn't offer any substantial benefit. You can even look at modern x86 cores as using dynamic VLIW, which comes with extra performance benefits which offset the overhead in the decoders and schedulers.

I think a lot of people underestimate how hard it would be to beat today's x86 processors. Even with total ISA freedom, at least 90% of the die space will have to be identical to achieve that level of performance. Just look at this this Phenom II die shot. The four cores themselves only account for 25% of the entire die space, and even that still contains the L1 caches. Heck it would even be a massive accomplishment to shave off 5% whithout loss of performance. Off course the caches don't consume as much as the actual pipelines, but still I doubt there's much to gain using a different ISA.

But those are not coherent. It's a science project that doesn't attempt to scale that part of the equation.

I wasn't aware of that. It looks like it offers high-bandwidth low-latency message passing instead.

I wonder if they could offer the best of both worlds: scalable directory-based coherency, and message passing? The directory-based coherency would be slower than snooping but using speculation would hide latency. It's only there for backward compatibility and when you really need it.

The K7 core had nearly twice the transistor count of the Alpha EV67. x87 floating point butchered FP performance for years, despite masses of transistors thrown at it.

The Alpha was exceptional mainly due to being almost completely custom designed. It was also exceptionally expensive, and exceptionally power hungry due to extensive use of dynamic logic.

Early Alphas also didn't adhere to IEEE-754. The EV6 was the first to offer full conformance in hardware. x87 offered full conformance more than ten years earlier. It also offered an 80-bit format, and trigonometric functions.

Nick · Dec 15, 2009

OpenGL guy said:
Unnecessary? Let's say the bus is 1024 bits. Without compressed Z, you are limited to 32 Zs per clock.

Z stays on the chip, so you're not bus limited.

OpenGL guy · Dec 15, 2009

Nick said:
Z stays on the chip, so you're not bus limited.

??? The internal bus to the cache is 1024 bits, that is what I am referring to. The external memory bus is probably far smaller than 1024 bits. And Z doesn't always stay on chip, even for a TBDR. Shadow buffers are used in nearly every game these days. Yes, the overdraw will happen at chip BW speeds, but that will still be a limiting factor if you don't compress as I mentioned in my initial post.

psurge · Dec 15, 2009

Wasn't transmeta using VLIW + dynamic recompilation to run x86 code? If I'm not mistaken about that, then I'm not sure how this qualifies as being "mostly free" from x86 baggage. It seems to me that the dynamic recompilation engine would be limited in the scope and expense of the optimizations it could apply relative to a static compiler. Such a recompiler also has no higher level source it can derive valid transformations from. So when you claim that it failed to demonstrate substantial advantages over x86, do you mean it failed when executing x86 code, or that even when targeted by a native compiler, no substantial advantages were to be had?

darkblu · Dec 15, 2009

Nick said:
Intel has to deliver a GPU to create market share. There are two aspects to this: hardware and software. The hardware itself is finished, demonstrated good synthetic benchmark results, and is ready to deliver for the HPC market. So clearly they are confident about that aspect. My "belief" is there must be a software issue.

maybe. then again, there may be hw-posed issues for which you may never find a satisfactory solution in sw. as you said, since it's always the combination of hw and sw, a potential design issue discovered late in the hw (due to late sw, etc) could amplify the burden on the final sw by orders of magnitude. but since you brought up Abrash' role in the project - if Mike could not deliver a satisfactory rasterizer running on this hw, then maybe the hw was not meeting its GPU-domain targets?.. just speculating here, of course.

Any other ISA would have made it much harder to deliver a part which would receive good reception from the software world. In the long run that's more critical than anything else. One year of delay is nothing compared to the competition's ongoing struggle to gain some acceptance.

i really don't understand your logic about ISA acceptance - ISA's today are mostly judged by how well they play with their compilers (from coders' perspective), and meet their price/performance/wattage envelopes (from EE/business perspectives) - not by their proximity to their 40th anniversaries. look round you - chances are you'll find more devices hosting 'young' ISAs than such hosting x86 or older. so what reception and by whom are you concerned about?

The scalar cores have poor power efficiency, but again, most work takes place in the vector units. The texture samplers and memory controllers are also no different from other architectures. So another ISA for the scalar cores isn't suddenly going to dramatically improve overall power efficiency.

dramatically - no. but then again LRB's scalar ISA did not have to copy in-verbatim any existing ISA - intel could have used that to their best advantage if they hadn't been so concerned with the central socket. heck, they could've used a modernized form of their dear x86 (say, x86-64) and shaved off the legacy, trimmed the pipeline, retooled the protection mechanism - any/all of the above, just to make LRB a better GPGPU core, while staying in familiar waters. but nooo - it had to be the word97-compliant GPU.

Because Intel has done many projects like this before... ?

creating a quirks/bugs-free design off the bat, and tracing such issues fast in your own designs are two different things. yes, intel have done a few CPU designs, so at least one could expect them to be able to identify an issue quickly, once the latter has been registered, and the symptoms - understood. of course, spotting an issue and resolving it are yet other two different things - there are potential issues that could invalidate fundamental design decisions.. like a shit-choice of a housekeeping ISA (just pulling a wild example here).

It's a hardware company. They didn't hire Abrash and other software rendering gurus for nothing. It's a very complex task and even for these gurus it takes a lot of time to try different approaches and achieve high performance. Like I said before, the parameter space is massive and there isn't one straightforward way to do things.

i agree, the task could become prohibitively-complex if they were seeking for that head-shot at the GPU - 'behold, we're the new GPU masters!' - again, i don't think anybody (clinically sane) expected LRB to dethrone any GPU heavyweights - if intel themselves had such expectations then maybe they were not familiar with the problem domain they were getting involved in. again, not saying they had such expectations - just trying to carelessly speculate of the events we've witnessed lately.

You really think another ISA would have solved the abysmal performance of Microsoft's reference rasterizer? First and foremost it's a software issue. Other software renderers are over a hundred times faster. Although REF clearly wasn't written with performance in mind, they didn't make it slow on purpose either. So this illustrates that there's a massive difference between optimized and unoptimized code.

i'd venture to guess ms' ref performance issues are mostly algorithmic, and only secondly - of insufficient clock-counting. IOW, one cannot say that had they 'coded to the metal' of any ISA they'd have achieved much better peformance. how does that prove the (non-)fitness-to-a-task of an ISA, though?

You're starting with the wrong assumption. Any ISA would do. So it saves time and money to use the ISA and tools you already have. And where it will really save time and money is to create a software ecosystem.

any ISA? really? tell you what, let's pick the good old 65816, slap a non-braindead SIMD extension on it and set a strong foot on the GPU turf overnight! as a proof of our dedication to the software world, we'll give them free binary-level appleGS emulation! what says you?

Who said anything about running it faster? It's about being able to run it at all with little or no changes. This is really going to motivate developers. Nobody likes spending months rewriting all code for GPGPU to get a first result, only to realize that their approach doesn't perform as expected and they'll have to rewrite some of it.

maybe i misunderstood you, but i believe you mentioned something about 'incremental performance improvements of present code' in your original post, ergo my performance reference. pardon me if i've erred.

There's another 90/10 rule here: half of the time to create a project is spent on the first 90%, the second half is spent on the last 10%. With x86 compatibility developers can skip rewriting 90% of the code and stay motivated to tackle performance issues, which is the only thing they were really interested in anyway.

so let me know if i got you right here: developers would really like to spare themselves a trivial recompile of their existing scalar code to a new scalar ISA, but they would eagerly face the challenges of a brand new VPU which, apropos, is only meant to carry the bulk of the workload? hmm..

Memory management, string manipulation, data conversion, containers, system methods, search algorithms, math routines, etc. You want to run them on Larrabee because the latency of a round-trip to the CPU is much slower.

i wonder how often a developer uses those in binary these days. i know for myself that i haven't used much of a container, a string, or anything of this caliber in binary form in my code for the past 10, if not more, years. the little that has still happened to be in binary has been deeply entrenched in OS services. oops, did i just lock my LRB development to some dated OS? sorry, clumsy me.

Where's the software for these parts? Can I open my C# compiler and start coding for it?

you can open your whatever compiler and start using the APIs these parts prudently offer. occasionally, you might have to resort to heresies like OCL, CUDA or native compilers *gasp*. regardless, it would be still a tad better than what you could do with a LRB today.

Larrabee will initially need experts to optimize various libraries and tools, but once available application developers can use them to add new functionality and create end-user applications. To dominate the computing world you have to cater for every developer. That's only possible by allowing direct access to the hardware and building various layers of software on top of it. Any ISA can be used for that, but x86 accelerates it by having a solid starting ground.

a solid ground of 99% HL-langues-encoded libraries? same libraries that run on every ISA and their dog today? these are indeed some fine 'solid starting grounds' anybody would gladly trade off whatever little ISA sanity they might have had.

What? A delay due to software issues?

what delay - the parts in question are on the market. the APIs are on the market. the native compilers are coming last, but you can ask intel how that generally goes. *wink*

LRBni is an extension, just like for instance SSE3. Did Intel face major issues when trying to introduce SSE3? No, the only thing of critical importance for those processors was the support of Pentium instructions. The existing ecosystem made it easy for developers to incrementally adapt their software for SSE3, where it mattered, instead of having to recreate everything from scratch.

issues with what - putting their next SIMD into their next CPU - no - they just sent it off the door - no issues. or do you mean issues for the developers - easy use of the new extension without having to (re)learn a new ISA? - let me see - no auto-utilizing/auto-vectorizing compilers for generations upon generations of the ISA, instead some rudimentary in-house libraries, but clear delegation of the onus of making use of the new extension to the app developers. which, combined with lack of basic capabilities commonly found in other SIMD ISA, can only mean no issues for the app developers. *grin*

let me ask you something: why, in your sincere opinion, developers embraced the GPU shader architectures, particularly after the advent of the HL shading languages? and why do you think intel was so determined to come up with a GPGPU of their own. i mean, after all they had the, erm.. fine GMA line of GPUs (more than half of the pc market - that's what we call a developers' embrace, right?), and they held the key to the central socket. so why?

I can't see how that relates to Larrabee. In the mobile/embedded market something like a 300% difference in performance/power is disastrous. For Larrabee the ISA choice has an impact of about 10% but it's offset by a massive advantage on the software front.

the only reason i brought it up was to improve your awareness of the current state of the GPGPU market.

so, another Q from me: what, again in your sincere opinion, failed this project? maybe Abrash & co's inability to pull a half-decent D3D/OGL implementation for it?

i mean, according to you, the performance/power ratio should've been ok (unless there was something deeply screwed up in LRBni, since the rest of the chip - namely x86 and caches - were just fine), the adoption of the programming model would've been fine (x86? - woot!). pretty much everything would've been roses. and yet, no LRB on the shelves after 3 years of focused effort (by some pretty smart individuals at that, where we totally agree). so why*?

* clearly it couldn't have been a delay in the native compiler /deep sarcasm

Ailuros · Dec 15, 2009

If LRB Prime was put on ice for X timeframe just because of sw issues, it's something we'll find out in the future when LRB-next details surface.

However since Intel clearly stated that LRB was behind in terms of its hw and sw schedule, it's rather naive to blame sw alone.

Larrabee delayed to 2011 ?

Epsilon plus three