So now the argument is that it is easy to add LRBni to AVX, so there won't be a 22nm Larrabee?
It doesn't make sense to have x86 cores with different features. It looks like they plan to add LRBni type instructions to AVX though.
AVX is specified to support register widths up to 1024 bits. So they could relatively easily execute 1024-bit vector operations on the currently present 256-bit execution units, in 4 cycles (throughput). The obvious benefit to this is power efficiency. Then all that's left to add is gather/scatter support and the IGP can be eliminated, leaving a fully generic architecture that is both low latency and high throughput. Larrabee in your CPU socket, without compromises.
It doesn't make sense to have x86 cores with different features. It looks like they plan to add LRBni type instructions to AVX though.
AVX is specified to support register widths up to 1024 bits. So they could relatively easily execute 1024-bit vector operations on the currently present 256-bit execution units, in 4 cycles (throughput). The obvious benefit to this is power efficiency. Then all that's left to add is gather/scatter support and the IGP can be eliminated, leaving a fully generic architecture that is both low latency and high throughput. Larrabee in your CPU socket, without compromises.
What I meant was, no compromises for legacy CPU workloads; retaining high single-threaded (scalar) performance.Well, Larrabee without compromises wouldn't be a Larrabee. x86 as a platform for GPGPU was a compromise to begin with.
2-way SMT is not set in stone. But yes, leveraging AVX is likely a compromise compared to Larrabee. But that's probably fine. Larrabee failed the fight against the competition's high-end GPUs, but in the low-end market things aren't that critical and you get the entire chip to do the job instead of just the tiny area where the IGP resides. It's limited by bandwidth anyway.In the picture you painted, Larrabee's original compromise is compromised even more. Now all you have is 2-way hyperthreading and huge OoO engine designed for singlethreaded workload.
You tell me. I'm just exploring the possibilities.So now the argument is that it is easy to add LRBni to AVX, so there won't be a 22nm Larrabee?
The AVX spec already reserves the encoding bits to extend it up to 1024-bit registers. And FMA is also already on the roadmap. So that mainly leaves gather/scatter to get a nearly equivalent instruction set. Intel engineers already admitted to explore the possibilities of implementing gather/scatter.if they plan to add lrbni to avx, then afaik they haven't indicated that way at all, at least in public.
I didn't mean LRBni itself (which wouldn't even be possible due to encoding collisions). But AVX isn't lacking a whole lot to make the CPU much better at throughput computing.I'm not sure if it would be possible to add LRBni to the already very complicated ISA.
8 is perfectly feasible (note that on 32 nm Sandy Bridge the IGP is taking up the area of two more cores). Also, each of these cores can have multiple AVX execution units with FMA support. This adds up to 1 TFLOP. Yes it's wishful thinking, but it's not out of reach.As you I would like it would be there, but that is wishful thinking. Even if it would be there, you would have like 4 cores, maybe 8 on 22 nm.
Nonsense. SwiftShader is not that far behind on the IGP, and it's not even using AVX. So add in AVX, FMA, and gather/scatter, and it would easily beat it. Power consumption can be lowered by using 1024-bit instructions executed on 256-bit units.And what about texture mapping, doing this in software would be a waste of cycles. Using all of this hypothetical CPU for rendering, you would still have a hard time matching the current integrated GPU.
Nonsense. SwiftShader is not that far behind on the IGP, and it's not even using AVX. So add in AVX, FMA, and gather/scatter, and it would easily beat it. Power consumption can be lowered by using 1024-bit instructions executed on 256-bit units.
No, it just lacks scatter/gather. Arguably the most important part of a vector ISA.But AVX isn't lacking a whole lot to make the CPU much better at throughput computing.
The two-generation lead is more evident in low-voltage applications. FinFET's advantages are sizable, but not as commanding in the same ranges that GPUs and CPUs tend to operate in.You tell me. I'm just exploring the possibilities.
It makes sense to me to try both ways to enter the graphics market and offer unique advantages. Maybe 22 nm is indeed too soon for a discrete chip; my main point was they appear to be almost two generations ahead now, and that's not going to change soon.
The process gap is a strong reason why competitors will resist going to fully generic architectures. Once the hardware is essentially non-descript, it becomes a process contest. The lessons from trying to beat Intel on a process basis have already been learned.In any case graphics hardware becomes ever more generic and Intel is in a unique position to dominate the market once things converge to the point where everything is fully programmable.
Nonsense. SwiftShader is not that far behind on the IGP, and it's not even using AVX. So add in AVX, FMA, and gather/scatter, and it would easily beat it. Power consumption can be lowered by using 1024-bit instructions executed on 256-bit units.
HD Graphics 3000 (Sandy Bridge).Which IGP are you comparing swiftshader with? Nehalem/Westmere/Sandy Bridge?
It would not just match it, but beat it. AVX and FMA would increase the GFLOPS by a factor four, while replacing the IGP with CPU cores increases performance by another 50%. Also don't underestimate the power of gather/scatter. It could speed up texel fetching by up to 18x, but is also very useful in other graphics pipeline stages (vertex attribute fetch, primitive assembly, rasterization, table lookups for transcendentals, etc.).Besides, even if Swiftshader could match the iGP in perf, it sure as hell would lose in perf/W.
That's doubtful since first of all the IGP is limited by bandwidth. Bandwidth is increasing slower than transistor counts, so sooner or later any hardware will be fully programmable, just because it can. Especially low latency main memory won't allow any rapid growth in IGP performance, and nobody's going to pay extra for a more expensive motherboard just to allow extra memory lanes for the IGP (this hasn't happened in the past decade of chipset IGPs either).And matching igp argument assumes that igp remains static, contrary to existing trend of increasing igp area budget and slowly decreasing die area devoted to (cpu cores + caches). So, I would expect IGP to pull ahead in the near future, given the expected rate of cpu throughput growth and expected rate of igp area growth.
That's hardly objective. First of all it achieves a 3DMark06 score of 820 on an i7-2600. So that's 5 times slower, not 7 times. Furthermore, benchmarks with Crysis show that x86-64 is 32% faster than x86-32. Now we're down to a factor 4. And that's still before making use of AVX, let alone FMA and gather/scatter.Let's try to be objective here.
From this link: http://nl.hardware.info/reviews/194...0k-i5-2300-sandy-bridge-review-gpu-benchmarks
Intel HD Graphics 3000 / Core i7 2600K
3DMark06 4225
From this link: http://transgaming.com/business/swiftshader
"a modern quad-core Core i7 CPU at 3.2 GHz running SwiftShader scores 620 in 3DMark06".
This indicates Swiftshader is 7 times slower on a recent Sandy Bridge, compared to the integrated GPU.
It's impossible to tell without at least knowing the die size. Given that Knight's Corner should make up for the investment into Larrabee, my expectation is they want a high yield product. They may also have borrowed a thing or two from the Atom architecture to make it more power efficient while achieving higher frequencies.There are likely additional changes to the architecture at 22nm. This seems evident even from the core count. 50+ falls very short of the idealized doubling of cores with Moore's law. Since Larrabee had 32 at 45nm, the cores are potentially bulkier and the die size probably trimmed from the Itanium-scale bloat of the original chip.
Not entirely. First of all I don't think these competitors have a choice. They can't "resist" implementing new APIs like for instance OpenCL. If it gains importance, they'll have to make their hardware more generic. And graphics is absolutely not at a standstill either.The process gap is a strong reason why competitors will resist going to fully generic architectures. Once the hardware is essentially non-descript, it becomes a process contest.
I really don't think x86 will exhaust Intel's process advantage. The RISC versus CISC battle didn't really result in a winner. It's a very complex interaction of factors which make one architecture more successful than another. And if there's one thing x86 has proven it's that there's no limits to its instruction set nor to its execution model.Their best opportunity is to allow Intel to exhaust its process advantage by spending its good transistors getting a non-optimal design past its overhead.
x86 is not at a disadvantage perse ... but the x86 programmer mindset of fine grained low overhead coherency as the solution to all communication and caching problems does force you into the model of Larrabee.Personally I think one of the most critical questions is how the hardware will help the software orchestrate the execution of thousands of tasks. A core scaling efficiency of 99% or 98% makes a massive difference when you got hundreds of cores, making the ISA choice much less critical. The memory model is very important, but it's not clear yet that x86 is at a disadvantage there.
Going by the descriptions of the original implementation, Larrabee could have borrowed from any number of architectures made in the last 8 years to be more power efficient and achieve higher clocks. Let's hope it doesn't borrow Atom's transistor density.It's impossible to tell without at least knowing the die size. Given that Knight's Corner should make up for the investment into Larrabee, my expectation is they want a high yield product. They may also have borrowed a thing or two from the Atom architecture to make it more power efficient while achieving higher frequencies.
The core count scaling was a Larrabee supporter talking point. Larrabee is a full node behind what some proponents predicted, and that is with an alleged 2 node lead. The extra shrink allows for more modest die size and still allows room to make changes.It's also worth noting that GPUs don't achieve the idealized doubling of cores either. For a while they appear to have exceeded Moore's Law, while in fact they really just dedicated an ever larger percentage of die size to shading cores. This core explosion has come to a halt and they're now fully at the mercy of process technology.
They help define the APIs and influence the directions they take. They will fight to bend transition in their favor however they can until they can be certain that a competitor like Intel is not at a massive advantage. For now at least, they also have the advantage in that their specialized products are the only ones with any history of not embarassing themselves in graphics.Not entirely. First of all I don't think these competitors have a choice. They can't "resist" implementing new APIs like for instance OpenCL. If it gains importance, they'll have to make their hardware more generic. And graphics is absolutely not at a standstill either.
I've already gone into my discussion that x86 could have contributed about 10-20% overhead, but that while this is a sizeable deficit, I do not think this is where the bulk of the disadvantage comes from.I really don't think x86 will exhaust Intel's process advantage. The RISC versus CISC battle didn't really result in a winner. It's a very complex interaction of factors which make one architecture more successful than another. And if there's one thing x86 has proven it's that there's no limits to its instruction set nor to its execution model.
Intel did not simulate efficiencies that high. I forget where the line dropped below 98%, but it definitely did in the range of 64 cores.Personally I think one of the most critical questions is how the hardware will help the software orchestrate the execution of thousands of tasks. A core scaling efficiency of 99% or 98% makes a massive difference when you got hundreds of cores, making the ISA choice much less critical. The memory model is very important, but it's not clear yet that x86 is at a disadvantage there.
It's worked for at least 2 Larrabee generations. Past performance is no predictor of future success, but it does give a nice hint.So the competition really shouldn't rely on Intel to exhaust its process advantage. Intel has lots of experience with multi-CPU, multi-core, many-core, and O.S. interaction. On the other hand this also means they got every bit of chance to make the right design decisions, gain market share, and eventually reduce or even close the process gap.
Add lots of hw that very few apps use, compare with an IGP denied equivalent progress and engineering effort. That way even I can make any architecture smack anything else out there.It would not just match it, but beat it. AVX and FMA would increase the GFLOPS by a factor four, while replacing the IGP with CPU cores increases performance by another 50%. Also don't underestimate the power of gather/scatter. It could speed up texel fetching by up to 18x, but is also very useful in other graphics pipeline stages (vertex attribute fetch, primitive assembly, rasterization, table lookups for transcendentals, etc.).
It flies in the face of EVERY cpu and EVERY gpu vendor's direction. So much so that, I honestly don't know what to say except that LRB1 missed it's clock target by ~40%, very likely due to excessive power consumption.Even if it still loses at performance/Watt, that's not a terrible thing as long as the absolute power consumption is at an acceptable level (and at Intel's 22 nm process it should be).
Consumers care about apps, not features.Consumers care a lot about features, and a mainstream CPU with all of the above would enable a whole new era of complex high-performance computing. The possibilities are only limited by the developer's imagination.
Intel's garbage in the name of igp's of the past are not evidence of anything either way. And the igp's of future have been speculated to have stacked DRAM on package to increase bw.(this hasn't happened in the past decade of chipset IGPs either).
Are the consumers who don't care for gaming somehow planning to cut themselves off from the next gen GPU powered html5 websites? Or have they stopped caring for the battery life in their laptops/tablets as well?Someone who cares little about gaming, who previously picked a system with a good CPU but the cheapest possible IGP which offers adequate graphics, won't shell out to buy an APU with a faster IGP.
This is again devoid of any reality. You need only see the progress of the worst offender's igp's over the last three years, let alone rest of the vendors.So it's not likely that CPU manufacturers will invest more area into the IGP both for features and performance beyond Moore's Law, unless of course as a more expensive part in higher market segments (i.e. it won't be a trend within the same market segment).
So while "objectively" you can indeed find applications which run much slower on the CPU than the IGP, I think a subjective comparison is much more meaningful to see where things are going.
It's not a lot of hardware at all. Like I said AVX already reserves the encoding bits to extend it to 1024-bit operations, FMA instructions are already specified, and gather/scatter requires little more than two 512 to 128-bit shuffle networks. Yet these minor things would make a major difference in SIMD efficiency (both effective performance and power consumption).Add lots of hw that very few apps use, compare with an IGP denied equivalent progress and engineering effort. That way even I can make any architecture smack anything else out there.
There's no evidence of that. The latest games are already specifying quad-core CPUs in the recommended system spec, and AMD will soon launch a highly anticipated 8-core CPU. The software world is slow to adopt multi-core programming techniques, but it's really a one-time investment. Once you have a scalable software architecture, more cores get you direct benefit. Even NVIDIA's Kal-El processor is betting on 4 cores becoming the norm soon. It would be foolish to think that once the majority of software is making use of 4 cores, it's not going to evolve beyond.If you must compare the IGP with a hypothetical cpu core of your choice, you should compare it with an IGP expected in that time frame. Besides, you should not forget that igp area budget is increasing but cpu cores are not going to scale beyond 4 in consumer market in the foreseeable future.
That IGP is helpless on its own. So you have to take the power consumption of the API and driver layers running on the CPU into account, as well as the power consumption of the L3 cache and memory controller.Besides, the IGP is clocked ~3x lower and has ~4x less area than the cores in SB. Assuming IGP and swift shader are close enough and contrary to your own admission wrt 3dmark06, that's not a near miss. That's a Mt. Everest of power efficiency to climb.
That's an entirely different situation. LRB1 was supposed to compete in the high end market. Missing its target by 40% was completely unforgivable. If instead it takes 70 Watt for a system to achieve the same legacy graphics performance as a system with an IGP achieves at 50 Watt, that's not nearly as disastrous. Power consumption is a limiting factor in the high-end, but not so much in the low-end. Price and features are at least as important for commercial success.It flies in the face of EVERY cpu and EVERY gpu vendor's direction. So much so that, I honestly don't know what to say except that LRB1 missed it's clock target by ~40%, very likely due to excessive power consumption.
New features leads to new apps.Consumers care about apps, not features.
SIMD is used a lot in drivers and low-level libraries. You wouldn't get the same desktop/laptop/netbook experience without it.Where are the consumer apps that scale with cores and vector width? Multicore has been around for years now and vector ISA's for more than a decade. No, games aren't the answer as they scale much more with gpu area than cpu area.
They're evidence that a large number of people care more about CPU performance than GPU performance.Intel's garbage in the name of igp's of the past are not evidence of anything either way.
These speculations talk of 1 GB of low-power high-latency high-bandwidth memory. It would clearly also drive up costs and lower yields. There's a few too many challenges here to make this realistic.And the igp's of future have been speculated to have stacked DRAM on package to increase bw.
I said cares little, not doesn't care. Consumers who are not hardcore gamers don't shell out extra for a powerful GPU. However, they do expect everything to Just Work. And since they don't upgrade often, future-proof features can be more valuable than optimal performance and power consumption for legacy applications.Are the consumers who don't care for gaming somehow planning to cut themselves off from the next gen GPU powered html5 websites? Or have they stopped caring for the battery life in their laptops/tablets as well?
You're going to have to have to show us some proof of that. It's very early days since single-die CPU+IGP chips have only just appeared. So far I've only seen evolutionary progress, while quad-core and wider vectors are entering the mainstream CPU market.Where are they going? The igp apologists have picked up their game and are devoting increasingly larger output of their precioussss leading edge fabs to igp's at the expense of cpu cores.
CPU core counts have stalled while gpu's core counts continue to scale.
Don't be silly, this entire site is about speculation, wishful thinking, and vaporware. So let's try not to take either opinion too seriously. You're hoping Intel and AMD will move heaven and earth to increase IGP performance, while I'm hoping they'll improve the CPU's throughput efficiency. Looking at the cost/gain balance of each, the latter seems like the better deal. And just because you read a lot of research publications about graphics doesn't mean generic computing performance is getting any less attention!Every research publication these days in graphics is devoted to getting more and more irregular algorithms to scale on GPU's and increasingly getting more and more awesome results while the scatter/gather, multiple vector units remain vaporware, even on roadmaps.
3D Flash will feature SwiftShader support. So just how "essential" the GPU will be for this kind of new technology remains to be seen. It's not like they require a lot of performance, and meanwhile CPUs are getting faster too.GPU's are even more essential to consumers of today, even more than say 2 years ago as they have invaded UI, youtube, flash and the html5.
Absolutely, but neither gather/scatter or AVX-1024 has to interfere with that. On the contrary. Gather/scatter allows more loops to execute iterations in parallel, and AVX-1024 would enable high throughput without excessive power consumption.And last but not the least, by and large cpu's design remains beholden to serial perf.