22 nm Larrabee

3dilettante · May 11, 2011

So now the argument is that it is easy to add LRBni to AVX, so there won't be a 22nm Larrabee?

rpg.314 · May 11, 2011

Nick said:
It doesn't make sense to have x86 cores with different features. It looks like they plan to add LRBni type instructions to AVX though.

AVX is specified to support register widths up to 1024 bits. So they could relatively easily execute 1024-bit vector operations on the currently present 256-bit execution units, in 4 cycles (throughput). The obvious benefit to this is power efficiency. Then all that's left to add is gather/scatter support and the IGP can be eliminated, leaving a fully generic architecture that is both low latency and high throughput. Larrabee in your CPU socket, without compromises.

if they plan to add lrbni to avx, then afaik they haven't indicated that way at all, at least in public.

Voxilla · May 11, 2011

Nick said:
It doesn't make sense to have x86 cores with different features. It looks like they plan to add LRBni type instructions to AVX though.

AVX is specified to support register widths up to 1024 bits. So they could relatively easily execute 1024-bit vector operations on the currently present 256-bit execution units, in 4 cycles (throughput). The obvious benefit to this is power efficiency. Then all that's left to add is gather/scatter support and the IGP can be eliminated, leaving a fully generic architecture that is both low latency and high throughput. Larrabee in your CPU socket, without compromises.

I'm not sure if it would be possible to add LRBni to the already very complicated ISA. As you I would like it would be there, but that is wishful thinking. Even if it would be there, you would have like 4 cores, maybe 8 on 22 nm. And what about texture mapping, doing this in software would be a waste of cycles. Using all of this hypothetical CPU for rendering, you would still have a hard time matching the current integrated GPU.

Nick · May 11, 2011

CouldntResist said:
Well, Larrabee without compromises wouldn't be a Larrabee. x86 as a platform for GPGPU was a compromise to begin with.

What I meant was, no compromises for legacy CPU workloads; retaining high single-threaded (scalar) performance.

In the picture you painted, Larrabee's original compromise is compromised even more. Now all you have is 2-way hyperthreading and huge OoO engine designed for singlethreaded workload.

2-way SMT is not set in stone. But yes, leveraging AVX is likely a compromise compared to Larrabee. But that's probably fine. Larrabee failed the fight against the competition's high-end GPUs, but in the low-end market things aren't that critical and you get the entire chip to do the job instead of just the tiny area where the IGP resides. It's limited by bandwidth anyway.

But not only do you get adequate graphics for those without a discrete GPU, workstations and high-end gaming systems would get a CPU that can also handle complex compute intensive tasks.

Nick · May 11, 2011

3dilettante said:
So now the argument is that it is easy to add LRBni to AVX, so there won't be a 22nm Larrabee?

You tell me. I'm just exploring the possibilities.

It makes sense to me to try both ways to enter the graphics market and offer unique advantages. Maybe 22 nm is indeed too soon for a discrete chip; my main point was they appear to be almost two generations ahead now, and that's not going to change soon. So they got time to revamp both the hardware and the software and let the process advantage bridge the gap for legacy workloads. In any case graphics hardware becomes ever more generic and Intel is in a unique position to dominate the market once things converge to the point where everything is fully programmable.

Nick · May 11, 2011

rpg.314 said:
if they plan to add lrbni to avx, then afaik they haven't indicated that way at all, at least in public.

The AVX spec already reserves the encoding bits to extend it up to 1024-bit registers. And FMA is also already on the roadmap. So that mainly leaves gather/scatter to get a nearly equivalent instruction set. Intel engineers already admitted to explore the possibilities of implementing gather/scatter.

They obviously won't officially confirm things in public, but it's only logical that the single biggest bottleneck in their widening SIMD instruction set is something they're taking very seriously. The gains from AVX are generally disappointing, and they wouldn't double the execution width if they didn't also intend to get the data in and out efficiently (one would hope). They can learn a lot from Larrabee and nothing is stopping them from adding the things that work well to their CPU line.

Nick · May 11, 2011

Voxilla said:
I'm not sure if it would be possible to add LRBni to the already very complicated ISA.

I didn't mean LRBni itself (which wouldn't even be possible due to encoding collisions). But AVX isn't lacking a whole lot to make the CPU much better at throughput computing.

As you I would like it would be there, but that is wishful thinking. Even if it would be there, you would have like 4 cores, maybe 8 on 22 nm.

8 is perfectly feasible (note that on 32 nm Sandy Bridge the IGP is taking up the area of two more cores). Also, each of these cores can have multiple AVX execution units with FMA support. This adds up to 1 TFLOP. Yes it's wishful thinking, but it's not out of reach.

And what about texture mapping, doing this in software would be a waste of cycles. Using all of this hypothetical CPU for rendering, you would still have a hard time matching the current integrated GPU.

Nonsense. SwiftShader is not that far behind on the IGP, and it's not even using AVX. So add in AVX, FMA, and gather/scatter, and it would easily beat it. Power consumption can be lowered by using 1024-bit instructions executed on 256-bit units.

rpg.314 · May 11, 2011

Nick said:
Nonsense. SwiftShader is not that far behind on the IGP, and it's not even using AVX. So add in AVX, FMA, and gather/scatter, and it would easily beat it. Power consumption can be lowered by using 1024-bit instructions executed on 256-bit units.

Which IGP are you comparing swiftshader with? Nehalem/Westmere/Sandy Bridge?

Besides, even if Swiftshader could match the iGP in perf, it sure as hell would lose in perf/W. And matching igp argument assumes that igp remains static, contrary to existing trend of increasing igp area budget and slowly decreasing die area devoted to (cpu cores + caches). So, I would expect IGP to pull ahead in the near future, given the expected rate of cpu throughput growth and expected rate of igp area growth.

rpg.314 · May 11, 2011

Nick said:
But AVX isn't lacking a whole lot to make the CPU much better at throughput computing.

No, it just lacks scatter/gather. Arguably the most important part of a vector ISA.

3dilettante · May 11, 2011

Nick said:
You tell me. I'm just exploring the possibilities.

It makes sense to me to try both ways to enter the graphics market and offer unique advantages. Maybe 22 nm is indeed too soon for a discrete chip; my main point was they appear to be almost two generations ahead now, and that's not going to change soon.

The two-generation lead is more evident in low-voltage applications. FinFET's advantages are sizable, but not as commanding in the same ranges that GPUs and CPUs tend to operate in.
It would likely help (edit: a lot) in idle power consumption, when the chip could drop voltages down.
This is one area where Larrabee could stand for improvement. An Intel executive described Larrabee at 45nm as not having power management (this was during the balls-to-the-wall phase in Larrabee's PR life cycle), so a straight shrink would realize less of an improvement.
Whether that is enough to overcome the traditional 10-20% power overhead of x86 and then the power overhead of 50+ cores is not known without having some data on the original design and knowledge of what changes there are.

I've commented on the density side of the equation. Without knowing more, it is hard to say, but if Larrabee at 22nm does not enjoy further physical optimization and we rely merely on straight shrinks, it would probably not be denser than a 28nm competitor.

There are likely additional changes to the architecture at 22nm. This seems evident even from the core count. 50+ falls very short of the idealized doubling of cores with Moore's law. Since Larrabee had 32 at 45nm, the cores are potentially bulkier and the die size probably trimmed from the Itanium-scale bloat of the original chip.

In any case graphics hardware becomes ever more generic and Intel is in a unique position to dominate the market once things converge to the point where everything is fully programmable.

The process gap is a strong reason why competitors will resist going to fully generic architectures. Once the hardware is essentially non-descript, it becomes a process contest. The lessons from trying to beat Intel on a process basis have already been learned.
Their best opportunity is to allow Intel to exhaust its process advantage by spending its good transistors getting a non-optimal design past its overhead.

Voxilla · May 12, 2011

Nick said:
Nonsense. SwiftShader is not that far behind on the IGP, and it's not even using AVX. So add in AVX, FMA, and gather/scatter, and it would easily beat it. Power consumption can be lowered by using 1024-bit instructions executed on 256-bit units.

Let's try to be objective here.
From this link: http://nl.hardware.info/reviews/194...0k-i5-2300-sandy-bridge-review-gpu-benchmarks

Intel HD Graphics 3000 / Core i7 2600K
3DMark06 4225

From this link: http://transgaming.com/business/swiftshader
"a modern quad-core Core i7 CPU at 3.2 GHz running SwiftShader scores 620 in 3DMark06".

This indicates Swiftshader is 7 times slower on a recent Sandy Bridge, compared to the integrated GPU.

Nick · May 12, 2011

rpg.314 said:
Which IGP are you comparing swiftshader with? Nehalem/Westmere/Sandy Bridge?

HD Graphics 3000 (Sandy Bridge).

Besides, even if Swiftshader could match the iGP in perf, it sure as hell would lose in perf/W.

It would not just match it, but beat it. AVX and FMA would increase the GFLOPS by a factor four, while replacing the IGP with CPU cores increases performance by another 50%. Also don't underestimate the power of gather/scatter. It could speed up texel fetching by up to 18x, but is also very useful in other graphics pipeline stages (vertex attribute fetch, primitive assembly, rasterization, table lookups for transcendentals, etc.).

And like I said before reducing the instruction rate by a factor four should help keep the power consumption of the out-of-order architecture in check. Even if it still loses at performance/Watt, that's not a terrible thing as long as the absolute power consumption is at an acceptable level (and at Intel's 22 nm process it should be). Also in light of the limitless capabilities you get in return it's well worth it. Note that even for mobile devices (where power consumption) nobody wants to trade his programmable OpenGL ES 2.0 device with a fixed-function OpenGL ES 1.1 device. Consumers care a lot about features, and a mainstream CPU with all of the above would enable a whole new era of complex high-performance computing. The possibilities are only limited by the developer's imagination.

And matching igp argument assumes that igp remains static, contrary to existing trend of increasing igp area budget and slowly decreasing die area devoted to (cpu cores + caches). So, I would expect IGP to pull ahead in the near future, given the expected rate of cpu throughput growth and expected rate of igp area growth.

That's doubtful since first of all the IGP is limited by bandwidth. Bandwidth is increasing slower than transistor counts, so sooner or later any hardware will be fully programmable, just because it can. Especially low latency main memory won't allow any rapid growth in IGP performance, and nobody's going to pay extra for a more expensive motherboard just to allow extra memory lanes for the IGP (this hasn't happened in the past decade of chipset IGPs either).

So merely because the IGP gets integrated into the CPU won't suddenly convince the consumer to buy a system with a more powerful graphics solution. Someone who cares little about gaming, who previously picked a system with a good CPU but the cheapest possible IGP which offers adequate graphics, won't shell out to buy an APU with a faster IGP. Only the people who previously bought low-end discrete GPUs might go for an APU with a relatively powerful IGP, but that doesn't mean they'll go as far as allowing CPU power to be sacrificed. Aside from saving a few bucks due to the integration, nothing really changes.

And then there's the fact that software is getting more complex too. Just compare DirectX 11 devices to DirectX 9 devices; they're forced to spend a lot of area to increased programmability. So it's not likely that CPU manufacturers will invest more area into the IGP both for features and performance beyond Moore's Law, unless of course as a more expensive part in higher market segments (i.e. it won't be a trend within the same market segment).

These market dynamics are not going to change in any significant way this decade, and even without any new features the mainstream CPUs released by the end of this decade will be capable of running Crysis 2 at highest settings. At that point I really don't think anyone's going to be interested in the inherently limited capabilities of a dedicated component. All worthwhile features (notably gather/scatter) will be sucked into the CPU cores.

Nick · May 12, 2011

Voxilla said:
Let's try to be objective here.
From this link: http://nl.hardware.info/reviews/194...0k-i5-2300-sandy-bridge-review-gpu-benchmarks

Intel HD Graphics 3000 / Core i7 2600K
3DMark06 4225

From this link: http://transgaming.com/business/swiftshader
"a modern quad-core Core i7 CPU at 3.2 GHz running SwiftShader scores 620 in 3DMark06".

This indicates Swiftshader is 7 times slower on a recent Sandy Bridge, compared to the integrated GPU.

That's hardly objective. First of all it achieves a 3DMark06 score of 820 on an i7-2600. So that's 5 times slower, not 7 times. Furthermore, benchmarks with Crysis show that x86-64 is 32% faster than x86-32. Now we're down to a factor 4. And that's still before making use of AVX, let alone FMA and gather/scatter.

The latter is particularly critical because the CPU is actually already more powerful than the IGP at pure arithmetic work. In particular older benchmarks like 3DMark06 still tend to have a high TEX:ALU ratio, skewing the results in favor of the IGP for older applications. The results are closer for newer applications, and about to get even closer.

Last but not least, legacy applications do things which are frankly stupid for a software renderer. For instance Crysis runs 15% faster with a simple hack to disable trilinear filtering and perspective correction when performing post-filter effects. And there's plenty more examples like this. Fair enough, this enters subjective comparison, but going forward APIs will continue to put more control into the hands of developers. While this benefits current GPUs as well, it generally benefits software rendering even more.

Let's also not forget that the IGP was designed by a large team, while SwiftShader was born out of a hobby project. This puts these scores into perspective: hardware only wins by a factor ~4 at an application it was specifically designed for.This shows the true potential of software rendering. With roughly the same effort as writing a game engine, you can create something that is both efficient by using the hardware directly and offers a unique experience to your customers. Nowadays game developers are forced to shoehorn their creative ideas into the legacy graphics pipeline. Slowly every stage of this pipeline is becoming programmable, to the point where it all gets executed by the same generic cores and you're no longer forced to stick to this predefined pipeline. It's inevitably evolving into software rendering on Larrabee-like devices, and/or the CPU itself.

So while "objectively" you can indeed find applications which run much slower on the CPU than the IGP, I think a subjective comparison is much more meaningful to see where things are going.

Nick · May 12, 2011

3dilettante said:
There are likely additional changes to the architecture at 22nm. This seems evident even from the core count. 50+ falls very short of the idealized doubling of cores with Moore's law. Since Larrabee had 32 at 45nm, the cores are potentially bulkier and the die size probably trimmed from the Itanium-scale bloat of the original chip.

It's impossible to tell without at least knowing the die size. Given that Knight's Corner should make up for the investment into Larrabee, my expectation is they want a high yield product. They may also have borrowed a thing or two from the Atom architecture to make it more power efficient while achieving higher frequencies.

It's also worth noting that GPUs don't achieve the idealized doubling of cores either. For a while they appear to have exceeded Moore's Law, while in fact they really just dedicated an ever larger percentage of die size to shading cores. This core explosion has come to a halt and they're now fully at the mercy of process technology.

This also means the focus is changing drastically. The MHz race has ended, the core race has ended, and now the complexity is shifting towards the software.

The process gap is a strong reason why competitors will resist going to fully generic architectures. Once the hardware is essentially non-descript, it becomes a process contest.

Not entirely. First of all I don't think these competitors have a choice. They can't "resist" implementing new APIs like for instance OpenCL. If it gains importance, they'll have to make their hardware more generic. And graphics is absolutely not at a standstill either.

That doesn't mean it becomes merely a process contest at some point. NVIDIA and AMD's GPUs support basically the same APIs but have vastly different architectures. So removing the API layer and generalizing the cores doesn't have to mean all differences are gone. Note that CPUs still come in all sizes and shapes as well.

Their best opportunity is to allow Intel to exhaust its process advantage by spending its good transistors getting a non-optimal design past its overhead.

I really don't think x86 will exhaust Intel's process advantage. The RISC versus CISC battle didn't really result in a winner. It's a very complex interaction of factors which make one architecture more successful than another. And if there's one thing x86 has proven it's that there's no limits to its instruction set nor to its execution model.

Personally I think one of the most critical questions is how the hardware will help the software orchestrate the execution of thousands of tasks. A core scaling efficiency of 99% or 98% makes a massive difference when you got hundreds of cores, making the ISA choice much less critical. The memory model is very important, but it's not clear yet that x86 is at a disadvantage there.

So the competition really shouldn't rely on Intel to exhaust its process advantage. Intel has lots of experience with multi-CPU, multi-core, many-core, and O.S. interaction. On the other hand this also means they got every bit of chance to make the right design decisions, gain market share, and eventually reduce or even close the process gap.

MfA · May 12, 2011

Nick said:
Personally I think one of the most critical questions is how the hardware will help the software orchestrate the execution of thousands of tasks. A core scaling efficiency of 99% or 98% makes a massive difference when you got hundreds of cores, making the ISA choice much less critical. The memory model is very important, but it's not clear yet that x86 is at a disadvantage there.

x86 is not at a disadvantage perse ... but the x86 programmer mindset of fine grained low overhead coherency as the solution to all communication and caching problems does force you into the model of Larrabee.

Now the SCC proves that they can think differently ... Aaron proves that there is a lot of resistance.

3dilettante · May 12, 2011

Nick said:
It's impossible to tell without at least knowing the die size. Given that Knight's Corner should make up for the investment into Larrabee, my expectation is they want a high yield product. They may also have borrowed a thing or two from the Atom architecture to make it more power efficient while achieving higher frequencies.

Going by the descriptions of the original implementation, Larrabee could have borrowed from any number of architectures made in the last 8 years to be more power efficient and achieve higher clocks. Let's hope it doesn't borrow Atom's transistor density.

It's also worth noting that GPUs don't achieve the idealized doubling of cores either. For a while they appear to have exceeded Moore's Law, while in fact they really just dedicated an ever larger percentage of die size to shading cores. This core explosion has come to a halt and they're now fully at the mercy of process technology.

The core count scaling was a Larrabee supporter talking point. Larrabee is a full node behind what some proponents predicted, and that is with an alleged 2 node lead. The extra shrink allows for more modest die size and still allows room to make changes.

Not entirely. First of all I don't think these competitors have a choice. They can't "resist" implementing new APIs like for instance OpenCL. If it gains importance, they'll have to make their hardware more generic. And graphics is absolutely not at a standstill either.

They help define the APIs and influence the directions they take. They will fight to bend transition in their favor however they can until they can be certain that a competitor like Intel is not at a massive advantage. For now at least, they also have the advantage in that their specialized products are the only ones with any history of not embarassing themselves in graphics.

I really don't think x86 will exhaust Intel's process advantage. The RISC versus CISC battle didn't really result in a winner. It's a very complex interaction of factors which make one architecture more successful than another. And if there's one thing x86 has proven it's that there's no limits to its instruction set nor to its execution model.

I've already gone into my discussion that x86 could have contributed about 10-20% overhead, but that while this is a sizeable deficit, I do not think this is where the bulk of the disadvantage comes from.

Personally I think one of the most critical questions is how the hardware will help the software orchestrate the execution of thousands of tasks. A core scaling efficiency of 99% or 98% makes a massive difference when you got hundreds of cores, making the ISA choice much less critical. The memory model is very important, but it's not clear yet that x86 is at a disadvantage there.

Intel did not simulate efficiencies that high. I forget where the line dropped below 98%, but it definitely did in the range of 64 cores.

So the competition really shouldn't rely on Intel to exhaust its process advantage. Intel has lots of experience with multi-CPU, multi-core, many-core, and O.S. interaction. On the other hand this also means they got every bit of chance to make the right design decisions, gain market share, and eventually reduce or even close the process gap.

It's worked for at least 2 Larrabee generations. Past performance is no predictor of future success, but it does give a nice hint.
As far as experience with many-core, Intel is still in a learning mode. It has no commercially deployed many-core product, with its longest and cancelled architecture effectively within the realm of academia until some unspecified date possibly next year.

rpg.314 · May 12, 2011

Nick said:
It would not just match it, but beat it. AVX and FMA would increase the GFLOPS by a factor four, while replacing the IGP with CPU cores increases performance by another 50%. Also don't underestimate the power of gather/scatter. It could speed up texel fetching by up to 18x, but is also very useful in other graphics pipeline stages (vertex attribute fetch, primitive assembly, rasterization, table lookups for transcendentals, etc.).

Add lots of hw that very few apps use, compare with an IGP denied equivalent progress and engineering effort. That way even I can make any architecture smack anything else out there.

If you must compare the IGP with a hypothetical cpu core of your choice, you should compare it with an IGP expected in that time frame. Besides, you should not forget that igp area budget is increasing but cpu cores are not going to scale beyond 4 in consumer market in the foreseeable future.

Besides, the IGP is clocked ~3x lower and has ~4x less area than the cores in SB. Assuming IGP and swift shader are close enough and contrary to your own admission wrt 3dmark06, that's not a near miss. That's a Mt. Everest of power efficiency to climb.

Even if it still loses at performance/Watt, that's not a terrible thing as long as the absolute power consumption is at an acceptable level (and at Intel's 22 nm process it should be).

It flies in the face of EVERY cpu and EVERY gpu vendor's direction. So much so that, I honestly don't know what to say except that LRB1 missed it's clock target by ~40%, very likely due to excessive power consumption.

Consumers care a lot about features, and a mainstream CPU with all of the above would enable a whole new era of complex high-performance computing. The possibilities are only limited by the developer's imagination.

Consumers care about apps, not features.

Where are the consumer apps that scale with cores and vector width? Multicore has been around for years now and vector ISA's for more than a decade. No, games aren't the answer as they scale much more with gpu area than cpu area.

(this hasn't happened in the past decade of chipset IGPs either).

Intel's garbage in the name of igp's of the past are not evidence of anything either way. And the igp's of future have been speculated to have stacked DRAM on package to increase bw.

Someone who cares little about gaming, who previously picked a system with a good CPU but the cheapest possible IGP which offers adequate graphics, won't shell out to buy an APU with a faster IGP.

Are the consumers who don't care for gaming somehow planning to cut themselves off from the next gen GPU powered html5 websites? Or have they stopped caring for the battery life in their laptops/tablets as well?

So it's not likely that CPU manufacturers will invest more area into the IGP both for features and performance beyond Moore's Law, unless of course as a more expensive part in higher market segments (i.e. it won't be a trend within the same market segment).

This is again devoid of any reality. You need only see the progress of the worst offender's igp's over the last three years, let alone rest of the vendors.

rpg.314 · May 12, 2011

Nick said:
So while "objectively" you can indeed find applications which run much slower on the CPU than the IGP, I think a subjective comparison is much more meaningful to see where things are going.

Where are they going? The igp apologists have picked up their game and are devoting increasingly larger output of their precioussss leading edge fabs to igp's at the expense of cpu cores.

CPU core counts have stalled while gpu's core counts continue to scale.

Every research publication these days in graphics is devoted to getting more and more irregular algorithms to scale on GPU's and increasingly getting more and more awesome results while the scatter/gather, multiple vector units remain vaporware, even on roadmaps.

GPU's are even more essential to consumers of today, even more than say 2 years ago as they have invaded UI, youtube, flash and the html5.

And last but not the least, by and large cpu's design remains beholden to serial perf.

Nick · May 13, 2011

rpg.314 said:
Add lots of hw that very few apps use, compare with an IGP denied equivalent progress and engineering effort. That way even I can make any architecture smack anything else out there.

It's not a lot of hardware at all. Like I said AVX already reserves the encoding bits to extend it to 1024-bit operations, FMA instructions are already specified, and gather/scatter requires little more than two 512 to 128-bit shuffle networks. Yet these minor things would make a major difference in SIMD efficiency (both effective performance and power consumption).

And none of this is specific to graphics. Every other application out there that uses SIMD (for which Intel clearly considered it worthwhile widening things to 256-bit execution), would gain significant benefit from these few things as well. And for applications which previously saw no gains from SIMD, gather/scatter can make the difference. Not to mention all of the new applications that become possible when the CPU approaches usable 1 TFLOP performance. So the value of this goes way beyond graphics without boundaries.

If you must compare the IGP with a hypothetical cpu core of your choice, you should compare it with an IGP expected in that time frame. Besides, you should not forget that igp area budget is increasing but cpu cores are not going to scale beyond 4 in consumer market in the foreseeable future.

There's no evidence of that. The latest games are already specifying quad-core CPUs in the recommended system spec, and AMD will soon launch a highly anticipated 8-core CPU. The software world is slow to adopt multi-core programming techniques, but it's really a one-time investment. Once you have a scalable software architecture, more cores get you direct benefit. Even NVIDIA's Kal-El processor is betting on 4 cores becoming the norm soon. It would be foolish to think that once the majority of software is making use of 4 cores, it's not going to evolve beyond.

Besides, the IGP is clocked ~3x lower and has ~4x less area than the cores in SB. Assuming IGP and swift shader are close enough and contrary to your own admission wrt 3dmark06, that's not a near miss. That's a Mt. Everest of power efficiency to climb.

That IGP is helpless on its own. So you have to take the power consumption of the API and driver layers running on the CPU into account, as well as the power consumption of the L3 cache and memory controller.

There will still be a power efficiency difference, but once again you have to look at the complete bundle of advantages you get in return. Also, you said yourself I should compare it to an IGP expected in that time frame: It will be more generic, meaning it's actually closer to a CPU architecture itself, and in relative terms less power efficient than a more dedicated IGP.

And again, once the software actually starts to make optimal use of the highly programmable throughput CPU architecture, you can do more with less. I know for example of a medical application with a dedicated SSE4 optimized software renderer for voxel data, which is several times faster than using their Direct3D 9 rendering path and SwiftShader.

It flies in the face of EVERY cpu and EVERY gpu vendor's direction. So much so that, I honestly don't know what to say except that LRB1 missed it's clock target by ~40%, very likely due to excessive power consumption.

That's an entirely different situation. LRB1 was supposed to compete in the high end market. Missing its target by 40% was completely unforgivable. If instead it takes 70 Watt for a system to achieve the same legacy graphics performance as a system with an IGP achieves at 50 Watt, that's not nearly as disastrous. Power consumption is a limiting factor in the high-end, but not so much in the low-end. Price and features are at least as important for commercial success.

Anyway, there's no need to ditch the IGP as long as it serves a purpose. Intel (as well as AMD) could add gather/scatter and AVX-1024 support and leave it up to the developers to choose between legacy graphics or cutting-edge custom development. By the way, the latter doesn't mean everyone has to reinvent the wheel. People can create open-source or commercial libraries/frameworks/engines for various application fields, expanding the possibilities way beyond the current small set of restrictive APIs. Also, developers would become independent of hardware drivers (both for stability and performance they currently still cause a lot of issues).

Consumers care about apps, not features.

New features leads to new apps.

Where are the consumer apps that scale with cores and vector width? Multicore has been around for years now and vector ISA's for more than a decade. No, games aren't the answer as they scale much more with gpu area than cpu area.

SIMD is used a lot in drivers and low-level libraries. You wouldn't get the same desktop/laptop/netbook experience without it.

That said, it's still underutilized, and the one reason for that is the lack of gather/scatter. Compilers have a hard time parallelizing loops without it. Once gather/scatter support is added, a mere recompile will speed up any application which has loops with independent iterations. That's practically all applications. So again, the use and value of these features goes well beyond graphics alone.

Intel's garbage in the name of igp's of the past are not evidence of anything either way.

They're evidence that a large number of people care more about CPU performance than GPU performance.

And besides, it wasn't just Intel who manufactured chipsets with IGPs! Yet none of the competition seems to have been particularly successful at selling expensive chipsets.

And the igp's of future have been speculated to have stacked DRAM on package to increase bw.

These speculations talk of 1 GB of low-power high-latency high-bandwidth memory. It would clearly also drive up costs and lower yields. There's a few too many challenges here to make this realistic.

And why would they go through all this trouble just to make the IGP much faster? Intel has been doing great in gaining market dominance simply by offering the cheapest adequate graphics solution. Anyone who cares about faster graphics doesn't mind paying for a discrete card which comes with cheap and effective GDDR. It also makes absolutely no sense to integrate the IGP onto the same die and to benefit from the CPU's L3 cache and system RAM, only to give it dedicated memory again! Things are going to get more unified, not less.

Are the consumers who don't care for gaming somehow planning to cut themselves off from the next gen GPU powered html5 websites? Or have they stopped caring for the battery life in their laptops/tablets as well?

I said cares little, not doesn't care. Consumers who are not hardcore gamers don't shell out extra for a powerful GPU. However, they do expect everything to Just Work. And since they don't upgrade often, future-proof features can be more valuable than optimal performance and power consumption for legacy applications.

“The best performance improvement is the transition from the nonworking state to the working state” - John Ousterhout

Nick · May 13, 2011

rpg.314 said:
Where are they going? The igp apologists have picked up their game and are devoting increasingly larger output of their precioussss leading edge fabs to igp's at the expense of cpu cores.

CPU core counts have stalled while gpu's core counts continue to scale.

You're going to have to have to show us some proof of that. It's very early days since single-die CPU+IGP chips have only just appeared. So far I've only seen evolutionary progress, while quad-core and wider vectors are entering the mainstream CPU market.

So instead of claiming things based on an imaginary evolution, I think it's more interesting to discuss why the chip manufacturers, software developers, and consumers would prefer one solution over the other...

Every research publication these days in graphics is devoted to getting more and more irregular algorithms to scale on GPU's and increasingly getting more and more awesome results while the scatter/gather, multiple vector units remain vaporware, even on roadmaps.

Don't be silly, this entire site is about speculation, wishful thinking, and vaporware. So let's try not to take either opinion too seriously. You're hoping Intel and AMD will move heaven and earth to increase IGP performance, while I'm hoping they'll improve the CPU's throughput efficiency. Looking at the cost/gain balance of each, the latter seems like the better deal. And just because you read a lot of research publications about graphics doesn't mean generic computing performance is getting any less attention!

Besides, it's not vaporware. They have Larrabee. Yes it failed as a high-end GPU, but it must have taught them a lot about how to turn the CPU into a better number cruncher. AVX is likely just the first careful step, and they don't have any reason to rush announcing the rest of the roadmap.

GPU's are even more essential to consumers of today, even more than say 2 years ago as they have invaded UI, youtube, flash and the html5.

3D Flash will feature SwiftShader support. So just how "essential" the GPU will be for this kind of new technology remains to be seen. It's not like they require a lot of performance, and meanwhile CPUs are getting faster too.

My work on ANGLE also revealed that CPU performance is at least as important as GPU performance for the overall WebGL application performance. Note that HTML5 also introduces JavaScript threading (web workers)...

And last but not the least, by and large cpu's design remains beholden to serial perf.

Absolutely, but neither gather/scatter or AVX-1024 has to interfere with that. On the contrary. Gather/scatter allows more loops to execute iterations in parallel, and AVX-1024 would enable high throughput without excessive power consumption.

22 nm Larrabee

Similar threads