Larrabee delayed to 2011 ?

rpg.314 · Dec 11, 2009

Mars3D said:
I know. I was just putting the difficulty of programming with intrinsics in context with the harder issues of gpu programming. I've spent far longer trying to tease out better performance because of the memory architecture than I did writing intrinsics code.

Which workload was it? Could you briefly describe it here? I am curious to know in what situation intrinsics on lrb could be faster than gpu.

I had an issue with it being described as 'best' without qualification. For some people performance is everything.

Join the club!

Andrew Lauritzen · Dec 12, 2009

darkblu said:
cores that are flexible-enough to run everything are abund on the market - those are usually referred to as CPUs : ) by stating that you want to have a homogeneous config of such cores (so you would not need to discriminate in your scheduler) you basically negate all the advantages an 'inherent knowledge of the task domain (aka specialization for a given task domain)' would give you - which is all (GP)GPUs have been doing for the past 15-20 years.

Not true at all... there are plenty of things that can run decently well in both domains. Sometimes the tweakables change, sometimes the algorithms even change, but there's lots of opportunity for load balancing there already. Physics is a great example that has a good mix of stuff that runs pretty decently in both places. And lets remember, this is a question of running *something* vs *nothing* on idle parts of the machine... obviously the scheduler would first try to put things in the most optimal places.

rpg.314 said:
In order, single issue x86 core with tiny cache (accessing L2 of a remote core has higher latency) makes for laughable performance if you are the competition. It makes for really long coffee breaks if you have to use it yourselves.

Doesn't matter if it's not optimal for it, we're talking about cases where it's otherwise going to sit completely idle. More flexibility in a scheduler is always a good thing... you can always get better global schedules with fewer constraints - that's not really up for debate.

It is with this specialization that gpu's get an order of magnitude better perf/W/mm on graphics.

Obviously (come now, do you honestly thing that I don't understand the domain here?)! But the point is the incremental cost of doing "ok" on other types of tasks is often pretty small, and it opens up a much more flexible scheduling and programming models.

Because any core that is optimized to run Word will suck at graphics and vice versa, no 2 ways about it. Word and graphics are different workloads, and different kind of hw architectures are needed to run them fast.

You could apply the same argument against the move to unified shaders, but guess what... it actually turns out that running a vertex shader is similar *enough* to a pixel shader that it makes sense to have the flexibility to load balance them, regardless of the fact that the more general units are less efficient at either task than dedicated hardware.

As I mentioned above, the same applies for stuff like physics, and many other tasks. They work pretty decently in both domains.

Just because I can run a thread of mysql on my Intel GPU, doesn't mean that it makes any sense whatsoever for me to run it there. It is a luxury, which definitely has an impact on perf/mm/W, without any obvious benefits. There may be unanticipated benefits down the line, but at the moment I can see only marketing BS as the reason for putting x86 into LRB.

You're just being naive now. Databases are actually a great example of stuff that has been shown to have the possibility of being greatly accelerated on throughput parts. But wait, it runs pretty well on CPUs too...

For sure parts can't necessarily be totally generalist, but there's no way that being able to load balance between heterogeneous cores where it makes sense (to the scheduler) is a bad thing. Obviously we need a smart scheduler that understands the heterogeneity, but it's still a good step up from not even having the ability in the first place, and guaranteeing idle hardware, such as is the case right now.

rpg.314 · Dec 12, 2009

Andrew Lauritzen said:
Not true at all... there are plenty of things that can run decently well in both domains. Sometimes the tweakables change, sometimes the algorithms even change, but there's lots of opportunity for load balancing there already. Physics is a great example that has a good mix of stuff that runs pretty decently in both places. And lets remember, this is a question of running *something* vs *nothing* on idle parts of the machine... obviously the scheduler would first try to put things in the most optimal places.

You might want to edit this post somewhat. You have attributed some of my comments to darkblu.

Doesn't matter if it's not optimal for it, we're talking about cases where it's otherwise going to sit completely idle. More flexibility in a scheduler is always a good thing... you can always get better global schedules with fewer constraints - that's not really up for debate.

Scheduling is an interesting point. But with an OoM perf imbalance, throwing a crysis pixel shader at a sandy bridge core will choke it, likely bringing the os scheduler to it's knees, and taking the system with it. My dual core laptop sometimes jams up with 1 heavy thread.

May be it's different with more cores. :???:

Also, latency sensitive tasks are often >90% of our daily apps, and for sure scheduling a firefox thread on a lrb core will bring anyone's patience to it's knees if not the system. :smile:

Since the cores are now specializing, scheduling them around freely is not likely prudent. And the event where you speak of, ie some of the throughput/latency cores idling, the few milliseconds saved are unlikely to be worth it. It might be more sensible by throttling the idling cores and saving power than making it do stuff it wasn't meant to do. After all running cross-jobs will burn way more power, and power is the real limitation today.

[somewhat related]Already we have support for situations like this when one core can be overclocked by slowing down others.

Obviously (come now, do you honestly thing that I don't understand the domain here?)! But the point is the incremental cost of doing "ok" on other types of tasks is often pretty small, and it opens up a much more flexible scheduling and programming models.

Yes, except "ok" is ~20-30%. Not an OoM.
http://forum.beyond3d.com/showpost.php?p=1338048&postcount=112

And rest assured, throughput cores are so far ahead and rocketing away so fast which makes it virtually impossible that latency cores will catch up any time soon.

You could apply the same argument against the move to unified shaders, but guess what... it actually turns out that running a vertex shader is similar *enough* to a pixel shader that it makes sense to have the flexibility to load balance them, regardless of the fact that the more general units are less efficient at either task than dedicated hardware.

Both vs and ps were highly math intensive and were relatively branch free. Both needed fp32 and texturing by the time sm3.0 came along. Unification was a logical step because of another reason as well. By running 2 stages of a pipeline on the same hw, possibilities of load imbalancing reduce. This is another of lrb's strengths, pure sw rasterization means waay better load balancing.

This example is not really aligned with your overall argument.

To get the spirit of my argument, think of it this way. It is cheaper in terms of silicon and time-to-market to decode video in software but more expensive in terms of power. And guess what, we now have dedicated video decode cores. Power is a bigger constraint today, so it makes sense to spend some area to save overall power. Does it make sense to have an x86 based DSP decode video and make it run excel to save a few milliseconds or to let it specialize fully for video decoding? Yes that video core takes much less area wrt these fusion chips, but the spirit af argument remains.

You're just being naive now. Databases are actually a great example of stuff that has been shown to have the possibility of being greatly accelerated on throughput parts. But wait, it runs pretty well on CPUs too...

Fine, replace database with powerpoint.

For sure parts can't necessarily be totally generalist, but there's no way that being able to load balance between heterogeneous cores where it makes sense (to the scheduler) is a bad thing. Obviously we need a smart scheduler that understands the heterogeneity, but it's still a good step up from not even having the ability in the first place, and guaranteeing idle hardware, such as is the case right now.

You might be onto something, but with my limited foresight, I am skeptical that scheduling threads (to save a few ms at best) across cores that have an OoM (2 OoM's by the time of haswell+2 ?) performance penalty if the workload characteristics don't match the hw is in any way a win, especially wrt power.

darkblu · Dec 12, 2009

Andrew Lauritzen said:
Not true at all... there are plenty of things that can run decently well in both domains. Sometimes the tweakables change, sometimes the algorithms even change, but there's lots of opportunity for load balancing there already. Physics is a great example that has a good mix of stuff that runs pretty decently in both places. And lets remember, this is a question of running *something* vs *nothing* on idle parts of the machine... obviously the scheduler would first try to put things in the most optimal places.

i think we fundamentally differ here in our views. from my perspective, this 'something vs nothing' argument turns quickly into a non-factor due to the traditionally overwhelming performance discrepancy between the domain-optimized processors, on one side, and the CPUs on the other. the latter, by virtue of their universality could interfere in all domains, but is that desirable? the moment real-time/run-time-constrains step in, the very same moment you feel the urge to stick with the domain-optimized processors.

remember the times when the OGL core specs included things not commonly supported by the GPUs of the day? remember the OGL model of 'every feature that's in the core set must be satisfied by the ICD, even at the price of on-the-CPU implementation'? how many times devs have been sifting for that single element of their draw jobs that would get silently delegated to the CPU by the ICD, effectively killing the fps in the process?

the simple reality is, domain-optimised processors give certain advantages that people often actively seek to utilize (e.g. real-time processing latencies for the typical job packet), by your 'but let's utilize that idling cpu there' you automatically give up those advantages, since the processing latencies in your heterogeneous compute farm become a wild factor. IOW, by leaving no transistor idling, you subject yourself to the latencies of the slowest transistors. which again, due to the traditionally huge performance discrepancies between the processor types, can be detrimental to your overall timing expectations. unless, of course, your scheduler is so darn intelligent that it could counter all those things (AI perhaps?)

entity279 · Dec 12, 2009

darkblu said:
from my perspective, this 'something vs nothing' argument turns quickly into a non-factor due to the traditionally overwhelming performance discrepancy between the domain-optimized processors, on one side, and the CPUs on the other. the latter, by virtue of their universality could interfere in all domains, but is that desirable? the moment real-time/run-time-constrains step in, the very same moment you feel the urge to stick with the domain-optimized processors.

I mostly agree.

And I believe the power wastage argument of running tasks on a processor "just because we can" should be brought up. If a latency sensitive task runs on throughput processor it might prevent the processor to switch to lower power modes, while not providing any tangible advantages.

Besides, because the scheduler is not able to determine when a particular task is finished, it might assign a pending one (inefficiently) to a throughput processor. Let's say the latency processor is "freed" quickly afterwards. What will the scheduler do? Copy the partial computations from the throughput cores back to a latency one? Restart computations on latency cores? Wait for the throughput cores to continue? All these scenarios use a non optimal amount of power (maybe to a lesser extent in case of the first one)

Bludd · Dec 13, 2009

Now for something a little more light-hearted:

http://firingsquad.com/news/newsarticle.asp?searchid=22387

Bouncing Zabaglione Bros. · Dec 13, 2009

Bludd said:
Now for something a little more light-hearted:

http://firingsquad.com/news/newsarticle.asp?searchid=22387

Funny reply in the comments:

What is funny is you could replace the INTEL tag on the guy with nVIDIA and it would still work.

Andrew Lauritzen · Dec 13, 2009

There's definitely a continuum here, as neither endpoint makes sense. Having fully domain-specific cores has been proven to not makes sense even though it does use less power and is technically more "efficient". If you don't buy the unified shaders example, just look at what happened to dedicated physics hardware. I said it then when few shared the view, but physics just isn't different *enough* to justify giving up load balancing for dedicated hardware.

On the other end of the scale, texture sampling has proven itself to be something that *is* worth spending dedicated hardware on... hardware that indeed goes idle in the cases when it is not used.

In terms of scheduling, the schedules for these processors nowadays contain *tons* of small tasks... you're rarely going to completely tank performance by one task running long (unless of course your who chip can only do one thing at once

). In fact in the GPU space this is absolutely critical, and the reason they buffer and schedule *several frames* ahead of work; it's necessary to keep large arrays of cores busy.

When these are a heterogeneous array of cores you just change the scheduling heuristics. Actually some schedulers can adjust for stuff like this automatically by watching the runtimes of various tasks on various compute resources.

Really though, just look at the example of physics. It runs well in both domains, and a pretty significant problem nowadays is that you effectively have to choose one place to run it - GPU or CPU - up front, and depending on the user's hardware configuration the choice can be a really good or a really bad one. In other cases the load varies enough from frame to frame that it justifies moving it around (physics in particular can actually be really spiky). A scheduler with the flexibility to move around even sub-parts of the physics tasks between different compute resources would be a big win here.

Bludd · Dec 13, 2009

Bouncing Zabaglione Bros. said:
Funny reply in the comments:

Haha, yes it would but only on a "no new GPU arch this year" level. Still, very funny.

Nick · Dec 13, 2009

darkblu said:
please, explain to me, why i, as a potential LRB customer, should care about intel's entrenchment

I was primarily responding to those who believe x86 is the cause of Larrabee's delay. Replacing the x86 scalar cores with something using a leaner ISA would have given them 10% higher performance max. Do either NVIDIA or ATI blow off the release of a new GPU because they've only reached 90% of the performance they've hoped for? Don't think so.

They are facing a much more serious issue. And I strongly believe it's a software issue. I've seen my own software renderer reach 30% higher performance by changing only a few lines of code, several times in a row. But it took me months of research, experiments and analysis for each of those changes.

Intel chose x86 because it saves them a lot of time and money. They own the ISA, they own the core IP, and they own lots of powerful development tools. Even though they're facing a delay, any other ISA would have costed them a lot more time and money. On top of that every PC developer has software that will compile for Larrabee with little or no issues. Don't underestimate how awesome it is to have something running on day zero and be able to incrementally improve performance. There are also many binary libraries that won't require any recompile. Yes, they won't run efficiently, but that's not always necessay. You can have some very useful routines that perform complex tasks but you only need to call them a couple of times.

So x86 allows them to create momentum much more quickly than anyone else. It still won't be easy, but the potential is huge and if they succeed it will be nearly impossible for the competition to do better with another ISA.

tell that to 3dlabs who are currently selling a potent mobile chip, using the same principles LRB was meant to demonstrate, and on a much older shrink node at that. while offering industry-standard APIs and all that jazz.

You mean the ZMS-08? It only supports OpenGL ES as far as I know. The choppy 3D demo with the motorcycle also couldn't convince me that it has competitive performance compared to a similar sized dedicated graphics chip.

no. ideally LRB should have shipped with the industry-standard APIs, and only optionally allowed the user to get down and dirty.

They'll certainly have to provide highly optimized implementations of every version of Direct3D and OpenGL, but it's also intended to let developers have direct access to the hardware. That's the whole point of using x86. They want people to develop just as much software for it as for their CPUs. NVIDIA is trying to achieve the same thing with CUDA, but is having a lot of trouble creating momentum because it's hard to migrate existing code and distribute binaries. There simply is no software ecosystem for CUDA yet, and it will take a very long time to create and expand it. x86's ecosystem is already massive.

developers don't have the resources to create machine-optimal code. at least, not during most of the time. you can't blame developers they did not sit and optimize every piece of their code to be clock-by-clock optimal. it's just not viable.

They don't have to optimize every piece of code. 90% of execution time is spent in 10% of the code. The top hotspots are even smaller.

friendly jab: you really should pay more attention to the mobile/embedded market. you know, that multi-billion market where most of the EE progress is occurring these days and where intel are desperately trying to set their foot, rather unsuccessfully so far ; )

Last time I checked, netbooks with Atom processors were selling like hotcakes...

rpg.314 · Dec 13, 2009

Nick said:
I was primarily responding to those who believe x86 is the cause of Larrabee's delay. Replacing the x86 scalar cores with something using a leaner ISA would have given them 10% higher performance max. Do either NVIDIA or ATI blow off the release of a new GPU because they've only reached 90% of the performance they've hoped for? Don't think so.

On the time-to-market issue, you are right. Intel will prolly find it easier/faster to make lrb by basing it off x86.

On the raw perf/W issue, x86 is clearly a luxury. And it is not at all clear that lrb can afford this.

On the sw development (by external devs) side, most of the code you speak of will need a recompile. And you never know when simple recompiles make porting tricky. My knowledge of CISC vs RISC wars is limited, but if recompiles were so simple, I imagine RISC vendors would not have had so much of a lack-of-sw problem.

rpg.314 · Dec 13, 2009

Andrew Lauritzen said:
There's definitely a continuum here, as neither endpoint makes sense. Having fully domain-specific cores has been proven to not makes sense even though it does use less power and is technically more "efficient". If you don't buy the unified shaders example, just look at what happened to dedicated physics hardware. I said it then when few shared the view, but physics just isn't different *enough* to justify giving up load balancing for dedicated hardware.

On the other end of the scale, texture sampling has proven itself to be something that *is* worth spending dedicated hardware on... hardware that indeed goes idle in the cases when it is not used.

In terms of scheduling, the schedules for these processors nowadays contain *tons* of small tasks... you're rarely going to completely tank performance by one task running long (unless of course your who chip can only do one thing at once ). In fact in the GPU space this is absolutely critical, and the reason they buffer and schedule *several frames* ahead of work; it's necessary to keep large arrays of cores busy.

When these are a heterogeneous array of cores you just change the scheduling heuristics. Actually some schedulers can adjust for stuff like this automatically by watching the runtimes of various tasks on various compute resources.

Really though, just look at the example of physics. It runs well in both domains, and a pretty significant problem nowadays is that you effectively have to choose one place to run it - GPU or CPU - up front, and depending on the user's hardware configuration the choice can be a really good or a really bad one. In other cases the load varies enough from frame to frame that it justifies moving it around (physics in particular can actually be really spiky). A scheduler with the flexibility to move around even sub-parts of the physics tasks between different compute resources would be a big win here.

Are you hinting at some kind of runtime profiling by the OS, to schedule this stuff better? *Perhaps* physics makes sense to be scheduled across, but not many do.

Also, GPU workloads are context heavy. Suppose you have a block of 256 threads, 16 regs each, with 8KB shared mem. That is 24KB of context. CPU threads are very cheap to reschedule. On x86-64 it is 16 * 8 bytes + 16*16 bytes = 384 bytes. The gpu workload has ~42x more context. And not to mention the amount of cache pollution it will cause in the already small caches of these gpu cores.

You should also consider the cache management instructions of lrb cores. They assume a thread owns a core and lock down cache-lines. What happens to them when you kick it's thread around? Why do you think such instructions are not there in cpu's? Because, afaics, gpu threads are not meant to be rescheduled.

Although, I should mention this that pre-emptive multitasking is on the overall roadmap of nvidia and will come to it sooner rather than later, but pre-emption will likely be a method of last resort.

CouldntResist · Dec 13, 2009

Nick said:
I was primarily responding to those who believe x86 is the cause of Larrabee's delay. Replacing the x86 scalar cores with something using a leaner ISA would have given them 10% higher performance max.

Perhaps it's not entirely because of x86. Perhaps it's because Larrabee core is too much Pentium, part of which the x86 overhead is?

Larrabee not only has to compete against supposedly lean ISA of GPUs, but also against fixed function hardware. Consider all rasterisation specific stuff, out-of-order warp scheduler, gigantic register file etc. Lack of these in Larrabee = more work in software. Overhead of x86 is therefore amplified by architectural differences not related to the ISA itself.

I'd take your 10% and square it.

Intel chose x86 because it saves them a lot of time and money.

Do you mean that without x86 Larrabee would cost Intel even more than 3 billion and it would be delayed beyond 2011?

They own the ISA, they own the core IP, and they own lots of powerful development tools.

I don't think the reasons were technical at all. I see it more like this:

"Yay, we have manufacturing process advantage over everyone else. We can afford to sacrifice some % of transistors and Watts for our ancient ISA, and the ISA will win us political goals. x86 everywhere(*) FTW! One ISA to rule them all!"

(*) except Itanium (R).

Alternatively:

"We have tried non-x86 ISA already, and it ended up in double embarassment (Itanium and AMD-64). Whichever of you, punks, mentions ISA-word once again, is getting 40 lashes"

Andrew Lauritzen · Dec 13, 2009

rpg.314 said:
Are you hinting at some kind of runtime profiling by the OS, to schedule this stuff better? *Perhaps* physics makes sense to be scheduled across, but not many do.
...
Although, I should mention this that pre-emptive multitasking is on the overall roadmap of nvidia and will come to it sooner rather than later, but pre-emption will likely be a method of last resort.

I'm talking generally about scheduling - both cooperative as is common in games and also preemptive which is still useful at a courser granularity, even if you're not dedicating a full heavy-weight "thread" and all the associated context to each pixel you shade. In both of these cases it's useful for the scheduler to have as much flexibility as possible in assigning tasks to various cores.

zed · Dec 13, 2009

I'd take your 10% and square it.

wot!!! 1%

hm no delete post

Nick · Dec 13, 2009

rpg.314 said:
On the time-to-market issue, you are right. Intel will prolly find it easier/faster to make lrb by basing it off x86.

On the raw perf/W issue, x86 is clearly a luxury. And it is not at all clear that lrb can afford this.

If they can't afford it then the whole project was doomed either way. This ~10% is nothing in comparison with the massive challenges to implement everything (except texture sampling) using generic programmable cores.

If they fail, x86 isn't to blame. Any other IHV with an other ISA would fail just the same. It would just mean the time isn't right yet for this approach. However, if they succeed the entire computing world is in their hands.

On the sw development (by external devs) side, most of the code you speak of will need a recompile. And you never know when simple recompiles make porting tricky. My knowledge of CISC vs RISC wars is limited, but if recompiles were so simple, I imagine RISC vendors would not have had so much of a lack-of-sw problem.

No. Most of the code will not need a recompile. The cores are Pentium compatible, and on average 90% of every project is not performance critical. The remaining 10% will require a recompile (actually a complete redesign) to achieve high performance, but it's far easier to be able to do that incrementally than rewriting an entire application for scratch using unfamiliar tools. Also, once some key libraries have been rewritten a ton of applications can transition much more easily. I expect quite a few software developers are anxiously looking forward to create and sell various pieces of software, ranging from compilers for alternate languages, to optimized libraries, to full-blown applications. It's a new market, that can turn out pretty big.

Nick · Dec 14, 2009

CouldntResist said:
Larrabee not only has to compete against supposedly lean ISA of GPUs, but also against fixed function hardware. Consider all rasterisation specific stuff, out-of-order warp scheduler, gigantic register file etc. Lack of these in Larrabee = more work in software. Overhead of x86 is therefore amplified by architectural differences not related to the ISA itself.

It's not amplified by it, it's just additional overhead (largely compensated by higher utilization).

The real work goes on in the vector units anyway. The dual-issue x86 parts are relatively powerful so they shouldn't be a bottleneck. The real overhead of x86 is the extra area, which can no longer be used for additional cores. But we're only talking about a couple of cores. A small price for all the software benefits x86 offers.

Do you mean that without x86 Larrabee would cost Intel even more than 3 billion and it would be delayed beyond 2011?

Definitely. I don't see too many competitors jumping ahead and using fully generic cores with almost no fixed-function units. The investments are very high, and it takes very long to create a software base. Yet the one to succeed at doing that first, is likely to dominate the future of computing. x86 offers the best chances because it already has a huge software ecosystem to spark off from.

nAo · Dec 14, 2009

Where does the $3B figure come from? I have never heard it before.

MfA · Dec 14, 2009

Nick said:
The real work goes on in the vector units anyway. The dual-issue x86 parts are relatively powerful so they shouldn't be a bottleneck. The real overhead of x86 is the extra area, which can no longer be used for additional cores. But we're only talking about a couple of cores. A small price for all the software benefits x86 offers.

The overhead for the given vector width might be small enough ... what if they could manage a narrower vector width with a different ISA though? (Ignoring for a moment that their method for coherency scales really poorly, lets assume that was improved too.)

darkblu · Dec 14, 2009

Nick said:
I was primarily responding to those who believe x86 is the cause of Larrabee's delay.

i, for one, believe that choosing a hapless ISA has not helped intel to deliver this part. nothing more than that.

Replacing the x86 scalar cores with something using a leaner ISA would have given them 10% higher performance max. Do either NVIDIA or ATI blow off the release of a new GPU because they've only reached 90% of the performance they've hoped for? Don't think so.

not the performance per se. abysmal power efficiency (ie. performance/wattage) is what most likely killed this project (or 'delayed' it, as you put it). and yes, the poor choice of an ISA in a seriously-parallel GP part can have a devastating effect on the power efficiency, crucially more-so on the turf of specialized parts - don't forget it's not a CISC-vs-RISC case we're having here, it's CISC-vs-specialized. and nobody's bringing up the question of LRB's performance - i don't think a sane person expected LRB to take the GPU crown by storm - sane people expected it to be just in the ballpark, thus 10% performance drop is not worth discussing. 10% of the projected wattage, OTOH, is yet another nail in the coffin for a GP part trying to fight against specialized parts - ie. the GP part being already times worse than the specialized competition.

They are facing a much more serious issue. And I strongly believe it's a software issue. I've seen my own software renderer reach 30% higher performance by changing only a few lines of code, several times in a row. But it took me months of research, experiments and analysis for each of those changes.

i strongly doubt it would've taken intel months to spot a sw performance anomaly in their own chip. IMO, performance has likely been relatively on-par with the projections.

Intel chose x86 because it saves them a lot of time and money.

ok, screw the dumb seaplane analogies: how does owning something that does not fit a job save you time and money toward that job?

They own the ISA, they own the core IP, and they own lots of powerful development tools.

..where the majority of those tools were developed in the course of the LRB project? unless you'd argue intel meant to use LRB as a p5 farm.

Even though they're facing a delay, any other ISA would have costed them a lot more time and money. On top of that every PC developer has software that will compile for Larrabee with little or no issues. Don't underestimate how awesome it is to have something running on day zero and be able to incrementally improve performance.

i really doubt the bolded part ('hey! my C2D-targeting code runs faster on a p5! woot!'). unless you meant 3d API code, in which case i don't see the basis for the improvement vs a GPU.

There are also many binary libraries that won't require any recompile. Yes, they won't run efficiently, but that's not always necessay. You can have some very useful routines that perform complex tasks but you only need to call them a couple of times.

you totally lost me here. what x86 binary libraries you would like to run on a LRB, and why would you want to run them there, and not on your dual/quad-core monster CPU that was designed to eat old x86 code for breakfast?

So x86 allows them to create momentum much more quickly than anyone else. It still won't be easy, but the potential is huge and if they succeed it will be nearly impossible for the competition to do better with another ISA.

again, everything you've argued for so far has been based on the premise that x86 fits the task nicely. but we don't agree there. and low and behold, the 'competition' has material, currently-sold-on-the-market parts - how's that for doing better?

You mean the ZMS-08? It only supports OpenGL ES as far as I know. The choppy 3D demo with the motorcycle also couldn't convince me that it has competitive performance compared to a similar sized dedicated graphics chip.

zms-08 tapes out Q1/Q2 '10 (no LRB will be on the market by then). the 'choppy motorcycle' demo is some pre-production code on zms-05 - final code reportedly runs better. and i wonder how choppy LRB's first GPU-originated demos would have been, alas that's something we have no means to know. FYI, zms-08 should decimate zms-05 judging from the paper specs (i promise to report on that once i get my hands on it). also, the part offers much more than GL ES - all in the form of APIs (Creative are actually opposed to the idea of giving direct access to the GPGPU part to application programmers, at least not until they have a sound compiler targeting that).

They'll certainly have to provide highly optimized implementations of every version of Direct3D and OpenGL, but it's also intended to let developers have direct access to the hardware. That's the whole point of using x86. They want people to develop just as much software for it as for their CPUs. NVIDIA is trying to achieve the same thing with CUDA, but is having a lot of trouble creating momentum because it's hard to migrate existing code and distribute binaries. There simply is no software ecosystem for CUDA yet, and it will take a very long time to create and expand it. x86's ecosystem is already massive.

cool. and how big is the LRBni ecosystem?

They don't have to optimize every piece of code. 90% of execution time is spent in 10% of the code. The top hotspots are even smaller.

sorry, i meant to say 'every piece of code that matters' - clearly optimising every last op would be nonsensical.

Last time I checked, netbooks with Atom processors were selling like hotcakes...

last time i checked, the market i was speaking about was spread a tad beyond netbooks. basically, it works like this - the moment desktop windows becomes an non-factor, that same moment atom develops weak knees (or simply follows the way of desktop windows).

Larrabee delayed to 2011 ?

rpg.314

Andrew Lauritzen

Moderator

rpg.314

darkblu

entity279

Bludd

Experiencing A Significant Gravitas Shortfall

Bouncing Zabaglione Bros.

Andrew Lauritzen

Moderator

Bludd

Experiencing A Significant Gravitas Shortfall

Nick

rpg.314

rpg.314

CouldntResist

Andrew Lauritzen

Moderator

zed

Nick

Nick

nAo

Nutella Nutellae

MfA

darkblu