22 nm Larrabee

Those charts are a good representation of cumulative sales over the past 15 years.
GMA 950 launched in 2005...
Regarding sales for the past 2-3 years (which is what matters the most for OEMs and computing-demanding software developers), they're a bit useless, as the top 5 GPUs aren't even in the market anymore.
Sure, but it illustrates that there's not much incentive at all for developers to invest in GPGPU development. There's a huge range of GPU performances, and the very low-end dominates. The CPU chart on the other hand shows that CPU performance varies relatively little. Developers can right now rely on SSE for any purpose, and easily transition to AVX-128, AVX-256, AVX2, etc.
OEMs know that a better GPU drastically enhances gaming, video and web-browsing performance.
Even the most demanding casual game doesn't require anything more than Intel's IGPs. And we've been watching videos en surfing the web for ages, so better GPUs don't "drastically" enhance performance in a way that matters. Soon mainstream CPUs will be capable of taking over all of the IGP's tasks, and much more.
So if OEMs value better performing iGPUs and prefer the option to bundle AMD APUs, more PCs with AMD APUs will be on the market, more people will buy AMD APUs, and more developers will put a nice, big and shiny stamp in their latest software claiming it takes full advantage of the iGPU in people's newly-bought PCs.
That's a lot of wishful thinking for one sentence.

The best guarantee to ensure that a system can run future applications, is to buy one with a powerful CPU. Developers are currently investing heavily in multi-core programming since that's guaranteed to pay off across all markets. And a CPU with AVX2 will be capable of adequately running anything an APU can run, but not the other way around. So OEMs should really think twice about the true value of having a weak CPU and a slightly faster IGP. High-throughput power-efficient homogeneous CPUs have a brighter future due to fewer compromises.
 
I read news about Intel managing to get CMOS @32nm they expect this technology now to scale along with their process progress (they were stuck @65nm till then).
Are ring buses made out using this technique or they are done other way?
Knight's Corner will be fabricated using a 22 nm Tri-Gate process.
About larrabee/K'sC, basically can we expect change from the original larrabee text units removal aside?
There are strong indications that the gather/scatter implementation has been improved. I also expect it to reach higher clock frequencies at lower power consumption, beyond the process advantage.
I've the feeling that K'sC is clearly a "filler product", something Intel push out to somewhat compete with GPGPU and make some money out of their investments.
I'm sure it will do reasonably well in the HPC market, since Intel will be able to sell complete systems with Xeon CPUs and Knight's Corner MICs which all run x86 based code and use the same toolchain.

But yeah, it might merely win them back their original investment. Only a couple years later many-core CPUs with AVX2 will compete against their own MICs. Most HPC applications are cache size sensitive and Amdahl's Law favors architectures with high sequential performance and good scaling behavior.
As nick is saying Intel is putting is strength in AVX2 instruction set (and proper implementation). It doesn't make much sense to launch next year something that use a completely different instruction set (hence my feel about K'sC being a filler product).
Switching from LRBni to AVX2+ shouldn't be very hard. And the Larrabee architecture can still evolve in any direction they like. Also, AVX-1024 will be required to lower the power consumption of CPUs and make them competitive at performance/Watt. So many years may pass before AVX proves to be superior. There's no need for Intel to rush anything. They've got all grounds covered.
We know really few about Haswell but I don't believe it's the architecture that will allow Intel to do it all. It may allow software rendering with acceptable result for casual gamers, do marvels for physics, AI, etc. for the others but that's it. GPUs (and GPGPUs) will still be a compliant target for the workload that map well to their architectures. 500 GFLOPS won't cut it against modern GPUs.
Indeed, but that would be for a mainstream quad-core Haswell CPU, so you basically get it for free. And if the next architecture implements AVX-1024, the IGP can be ditched and you can probably choose between 8 and 32 cores, depending on the market/budget. Also, I expect GPUs won't be able to scale performance as fast as before. They're hitting the same physical limitations as everyone else. Eventually it will all converge to the same peak performance/Watt and things like effective ILP will determine which architecture is superior.
Honestly I don't know much but after reading some stuffs about UltraSparc CPU line or upcoming IBM POWERPC A2, it looks like to me that the way larrabee was design is no longer adapt to the goals Intel may pursue now. May be it's nothing but I noticed that in all those designs the cores can access a "shared L2" (as I understand it vs larrabee local subset of the L2 is that they can read and write anywhere on the L2 cache whereas larrabee core can only read&write on their local subset of the L2 and read from the others).
Larrabee has fully coherent caches.
There is also the focus on power consumption, 16 wide SIMD may not be workable within the design, it supposedly consume a lot, it set terrible constrain on the memory system. A move AVX2 as Nick is proposing sounds like a win to me, actually I wonder if it worse it to get them push 4FLOPS per cycle (Haswell is supposed to do 2 FMAC per cycle so twice 2 FLOPS right? I'm not sure I got this properly while reading).
AVX execution units are 256-bit wide, so with a pair of them that's 16 single-precision floating-point numbers per core. Larrabee and GPUs have essentially the same physical width.
It could be a win for Intel as Haswell might be awesome but I don't believe the silicon budget will allow proper do it all architecture (if it is to happen), they could have their way with heterogeneous designs, different cores but using the same ISA(s).
The problem with that is unpredictability. How do you allocate threads to heterogeneous cores? Do you pin them to a specific core, risking that you're leaving a faster core unused and other threads are waiting on the slow core, or risking that a latency insensitive thread is occupying a fast core? Or do you let the O.S. schedule threads, causing context switching and data migration overhead?

Heterogeneous architectures may look good on paper but for developers it creates a number of complex issues not unlike livelock and priority inversion. A homogeneous architecture is far easier to program and since high performance software development is very expensive it is worthwhile sacrificing a bit of theoretical throughput for ease of programmability, future scalability, and maintenance.
 
If i could chose between 4 cores with AVX-1024 or 8 cores with AVX-512 i would choose the later. The two execution units in AVX-512 end up with double the L1,L2 cache and bandwith. And the 4 aditional cores will be much more usefull in general. AVX-1024 sounds like a overkill with 256bit execution units.
 
That would also pretty much double the entire die size. From what I understand it takes very few transistors to go from 256 to 1024 bit AVX.
 
If i could chose between 4 cores with AVX-1024 or 8 cores with AVX-512 i would choose the later. The two execution units in AVX-512 end up with double the L1,L2 cache and bandwith. And the 4 aditional cores will be much more usefull in general. AVX-1024 sounds like a overkill with 256bit execution units.
CPUs with AVX-1024 support would also support AVX-512. It's merely a matter of allocating 4 or 2 physical 256-bit registers, and the same sequencing logic can be used. There's no core size benefit to only supporting AVX-512, so the number of cores that fit on a die don't depend on it. Sandy Bridge E is expected to offer 6-core models, so 8-core at 22 nm for Haswell should be available as well but likely at a price premium.

I really don't think AVX-1024 is overkill. GPUs have a logical width of 1024 or 2048-bit. A fourfold reduction in instruction rate will offer much greater opportunity for power gating than AVX-512. And with two AVX execution units per Hyper-Threaded core, it would be able to hide at least 4 cycles of latency. That's not a whole lot compared to GPUs, and in theory it only suffices to always eliminate L1 cache lantency. But together with out-of-order execution it should in practice prove very effective at hiding much longer latencies as well.

So I don't see any benefit in not supporting AVX-1024. It's exactly where Intel wants to go; offering the throughput and efficiency of a GPU within the superior framework of the CPU.
 
That would also pretty much double the entire die size. From what I understand it takes very few transistors to go from 256 to 1024 bit AVX.

Ok but u need also 4 times the bandwith to move those 1024 bits and u need to store it too. It wont magicaly teleport there. For peak sustained troughput this will be chalenge.
 
If your execution units stay at 256bits then you won't need any extra bandwidth to handle 1024 bit chunks as you'll be simply iterating over them, you don't have to load them all into execution units simultaneously. Sure, it would help but it isn't needed.
 
GMA 950 launched in 2005...
Point being?
I said the list you mentioned represented cumulative sales for the last 15 years. The same list shows the S3 Virge, Intel Extreme Graphics, Ati Rage. My statement stands true.

That list won't be used as a guide for developers writing demanding applications. It's useless as such.



Sure, but it illustrates that there's not much incentive at all for developers to invest in GPGPU development. There's a huge range of GPU performances, and the very low-end dominates. The CPU chart on the other hand shows that CPU performance varies relatively little. Developers can right now rely on SSE for any purpose, and easily transition to AVX-128, AVX-256, AVX2, etc.

No, they can't.
You're forgetting about the rise of the ARM architecture in tablets, phones and smart/netbooks.
Even ARM is betting on GPU computing, with the OpenCL-oriented Mali 604.


Even the most demanding casual game doesn't require anything more than Intel's IGPs..

This isn't true. Try playing any 3D browser game (runescape, quake live, free realms, fusionfall, etc) and see how your 945G will fare, no matter what CPU you're using.



And we've been watching videos en surfing the web for ages, so better GPUs don't "drastically" enhance performance in a way that matters.

And your opinion is that web browsing and video watching has stagnated for the past 10 years?

I went from an Atom N270 to an Athlon Neo L310. Both cpus suck really hard.
On the Atom, web browsing turned sluggish after opening more than 2 tabs of flash-heavy websites, I couldn't really see any youtube video over 480p and CPU usage was always at 100%.


But on my Athlon Neo, I also have a Radeon HD3200, I can watch 720p Youtube videos in the main screen while I have 4 tabs with flash-heavy websites in a 1080p secondary screen with no lag whatsoever.
I'm pretty sure the Athlon 64 X2 @ 1.2GHz isn't responsible for it, because it's not even reaching 100% of usage.



Soon mainstream CPUs will be capable of taking over all of the IGP's tasks, and much more.
(....)
That's a lot of wishful thinking for one sentence.

Talk about wishful thinking...



The best guarantee to ensure that a system can run future applications, is to buy one with a powerful CPU

That would be true... in 1998...

I change graphics cards about 3x more than I change CPUs in my desktop.
I always end up buying either more RAM, a SSD, new peripheral, etc. as most of the time they represent a much more useful upgrade than the CPU.
I've bought this really cheap Phenom II X3 that I unlocked into a X4, clocked to 3GHz. It has served me perfectly for the past 2 years, and it will probably serve me equally well for more than a year down the road.

I will probably change graphics card again (for the 3rd time in my current system) before I change CPU.


So OEMs should really think twice about the true value of having a weak CPU and a slightly faster IGP. High-throughput power-efficient homogeneous CPUs have a brighter future due to fewer compromises.

So you disagree with pretty much all of the scientific community related to computing hardware, as you seem to neglect the huge disparity in FLOPS between architectures.
Let's just agree to disagree then.
 
So you disagree with pretty much all of the scientific community related to computing hardware, as you seem to neglect the huge disparity in FLOPS between architectures.

The newest top500 list sees a h[strike]eter[/strike]omogeneous architecture in front. Both in peak FLOPS, ratio of peak/max FLOPS and FLOPS per watt: http://www.top500.org/list/2011/06/100
 
Last edited by a moderator:
The newest top500 list sees a heterogeneous architecture in front. Both in peak FLOPS, ratio of peak/max FLOPS and FLOPS per watt: http://www.top500.org/list/2011/06/100

I wasn't comparing supercomputers, I was comparing single chips for the domestic market.
I'm pretty sure Nick was on the same subject, as he referred to web browsing and video playback.


BTW, of course a heterogeneous architecture will be in front. There are no autonomous computing-oriented GPUs afaik (until Maxwell comes out, at least).
 
I'm sorry - I meant homogeneous archs. And I'm doubly sorry to have mistaken „pretty much all of the scientific community related to computing hardware” for HPC/Supercomputing, when you meant domestic scientific computing. It's just I don't hear much about people doing domestic-scientific work.
 
I'm sorry - I meant homogeneous archs.
Really? Then what you said isn't really true. Within the top 5 supercomputers in your list, 3 of them are using nVidia GPUs.
And except for the japanese one in 1st place, the nVidia-powered supercomputers are more recent too.


And I'm doubly sorry to have mistaken „pretty much all of the scientific community related to computing hardware” for HPC/Supercomputing, when you meant domestic scientific computing. It's just I don't hear much about people doing domestic-scientific work.

I didn't mean domestic-scientific work either (as it's a bit nonsensical too).

I meant "demanding" applications for domestic computers, like playing 1080p (later stereo-3D) videos, 3d games, video+image editing software, opening several tabs with "heavy" web pages, eventually complex WebGL games, etc.

All of the above tasks run a lot faster and/or more power efficient when assisted by a decent GPU.
 
Last edited by a moderator:
Knight's Corner will be fabricated using a 22 nm Tri-Gate process.
Ok so the wiring is not related to last Intel breakthrough.
There are strong indications that the gather/scatter implementation has been improved. I also expect it to reach higher clock frequencies at lower power consumption, beyond the process advantage.
OK so clock speeds closer to what Intel aimed at first (>2GHz).
I'm sure it will do reasonably well in the HPC market, since Intel will be able to sell complete systems with Xeon CPUs and Knight's Corner MICs which all run x86 based code and use the same toolchain.
But yeah, it might merely win them back their original investment. Only a couple years later many-core CPUs with AVX2 will compete against their own MICs. Most HPC applications are cache size sensitive and Amdahl's Law favors architectures with high sequential performance and good scaling behavior.
Indeed priced properly Intel have quiet some stuffs for its side.
Switching from LRBni to AVX2+ shouldn't be very hard. And the Larrabee architecture can still evolve in any direction they like. Also, AVX-1024 will be required to lower the power consumption of CPUs and make them competitive at performance/Watt. So many years may pass before AVX proves to be superior. There's no need for Intel to rush anything. They've got all grounds covered.
They have all grounds covered but I'm not so sure about them having that much time. ARM cpus may bite into the share sooner than later.
Indeed, but that would be for a mainstream quad-core Haswell CPU, so you basically get it for free. And if the next architecture implements AVX-1024, the IGP can be ditched and you can probably choose between 8 and 32 cores, depending on the market/budget. Also, I expect GPUs won't be able to scale performance as fast as before. They're hitting the same physical limitations as everyone else. Eventually it will all converge to the same peak performance/Watt and things like effective ILP will determine which architecture is superior.
That's where I get lost. If Intel is really to kill GPUs (whatever their real purpose graphics/compute) they need to deliver more perfs. 8 Haswell cores (or Skylake something) would be big and still far away from GPUs in throughput. But I agree somehow with you nowadays GPUs standards for power consumption are unacceptable for nowadays CPUs. But 8 haswell cores would not do it. 32 cores would be pretty big and hot.
AVX execution units are 256-bit wide, so with a pair of them that's 16 single-precision floating-point numbers per core. Larrabee and GPUs have essentially the same physical width.
I though width was calculated based on 32 bits fp number for SP (64 for DP). For me AVX is 8 wide. That's why I don't get your 500GFLOPS figure, for me assuming two FLOPS per cycle on 8 elements I found half the value >200GFLOPS (assuming +3GHz clock speed).
The problem with that is unpredictability. How do you allocate threads to heterogeneous cores? Do you pin them to a specific core, risking that you're leaving a faster core unused and other threads are waiting on the slow core, or risking that a latency insensitive thread is occupying a fast core? Or do you let the O.S. schedule threads, causing context switching and data migration overhead?

Heterogeneous architectures may look good on paper but for developers it creates a number of complex issues not unlike livelock and priority inversion. A homogeneous architecture is far easier to program and since high performance software development is very expensive it is worthwhile sacrificing a bit of theoretical throughput for ease of programability, future scalability, and maintenance.
It looks like INtel is working on this it may have failure but that should be easier to achieve with pretty close cores than with CPUs and GPUs cores. I don't remember the name of the project.
 
Really? Then what you said isn't really true. Within the top 5 supercomputers in your list, 3 of them are using nVidia GPUs.
And except for the japanese one in 1st place, the nVidia-powered supercomputers are more recent too.
Yes, i know. Please, have a look at the respective overall (Rpeak/Rmax) as well as the power efficiency. Clearly, the new japanese supercomputer is highlighting where GPUs are lacking most at the moment.

I meant "demanding" applications for domestic computers, like playing 1080p (later stereo-3D) videos, 3d games, video+image editing software, opening several tabs with "heavy" web pages, eventually complex WebGL games, etc.

All of the above tasks run a lot faster and/or more power efficient when assisted by a decent GPU.
Ah, yes, sure they do. I was a little distracted when I saw you mentioning the scientific community all of a sudden. Sorry 'bout that.
 
Yes, i know. Please, have a look at the respective overall (Rpeak/Rmax) as well as the power efficiency. Clearly, the new japanese supercomputer is highlighting where GPUs are lacking most at the moment.
But that is "just" a distinct weakness of Fermi in this respect. They don't handle the matrix operations (which are the base for the top500 list) with very high efficiency. AMD GPUs are actually currently better in this and nvidia promised to improve that considerably with Kepler too (i.e. they aim for parity with CPUs). And you always get a better power efficiency when choosing low voltage parts, that is true for GPUs too.

The more general problem is actually what the scores tell about the code those computers will actually encounter in reality. It's almost nothing. Just solving huge systems of linear equations isn't what most of these systems do as their daily work. :rolleyes:
 
I though width was calculated based on 32 bits fp number for SP (64 for DP). For me AVX is 8 wide. That's why I don't get your 500GFLOPS figure, for me assuming two FLOPS per cycle on 8 elements I found half the value >200GFLOPS (assuming +3GHz clock speed).
Try doing that calculation with two 8-wide units per core ;)
 
And your opinion is that web browsing and video watching has stagnated for the past 10 years?

I went from an Atom N270 to an Athlon Neo L310. Both cpus suck really hard.
On the Atom, web browsing turned sluggish after opening more than 2 tabs of flash-heavy websites, I couldn't really see any youtube video over 480p and CPU usage was always at 100%.


But on my Athlon Neo, I also have a Radeon HD3200, I can watch 720p Youtube videos in the main screen while I have 4 tabs with flash-heavy websites in a 1080p secondary screen with no lag whatsoever.
I'm pretty sure the Athlon 64 X2 @ 1.2GHz isn't responsible for it, because it's not even reaching 100% of usage.

in contrast I've run an Athlon II X2 245 with an ati rage PCI, and now a radeon 7000 PCI. on linux so drivers work but aren't quite first grade. (I sold a great little vid card to buy food)

this CPU is incredibly fast and the second core means my PC isn't brought to a crawl by a runaway task. but any full screen flash video is a no (720p non-flash plays fine)

with the radeon my PC even crawls and may crash when a web page does a fade-out or fade-in thing in javascript. the ati rage works better but radeon does fast opengl for 2D purposes :) : output surface for emulators and vlc.

so, that's an extreme experiment here, it made me appreciate the great stability and performance I would get from a modern IGP or something like a geforce 6200 and up.
I was pissed when nvidia discontinued their IGP lines (and switchable graphics)

funnily, they still have a strong presence on the low end desktop, with the geforce 7025.
 
Last edited by a moderator:
You have to think of a high throughput homogeneous CPU as the unification of a legacy CPU and an IGP.

I don't believe x86 is the center of the universe. Legacy CPU support isn't a hard requirement in HPC and consumer software may follow suit soon enough.

The compute density isn't necessarily much higher than that of a whole APU. But the high throughput AVX units benefit from having access to the same cache hierarchy and from out-of-order execution. You save a lot of communication overhead and certain structures don't have to be duplicated. And as I've detailed before, executing AVX-1024 on 256-bit execution units drastically reduces the power consumption of the CPU's front-end and schedulers, and hides latency by implicitly allowing access to four times more registers.

I agree integration is the future. At least for now. The moment homogenous architectures become good enough we'll come up with new workloads that require dedicated hardware. It's a cycle. In any case I'm not seeing any indication that AVX will provide competitive performance to discrete GPUs regardless of efficiency advantages. The raw advantage in throughput for GPUs is still too great.

So there are no compromises to legacy scalar execution, and it also exploits DLP in practically the same way as a GPU!

The full support of legacy scalar ISAs and techniques is an albatross when it comes to throughput computing, not a benefit.

Besides, there is no viable alternative. You said you agree they will converge but wonder whether CPUs or GPUs are more representative (i.e. closer to the result of the convergence)? GPUs have a very long way to go to offer acceptable sequential performance. Some form of out-of-order execution, and a comprehensive cache hierarchy are an absolute must to be able to compete with CPUs.

Sorry if I missed it but why exactly do GPUs need to match CPUs in scalar performance? We already have enough performance for most sequential tasks and GPUs don't have the expensive burden of squeezing every last ounce of ILP from sequential workloads.

For CPUs to compete with GPUs the only thing lacking is AVX-1024...

That easy eh? :)

Hardware thread switches are rare when you use software fiber scheduling.

Ok thats fair but still unproven.

Anything more as in anything larger? Once IGPs have been replaced by software rendering nothing is holding Intel from selling CPUs with more cores and more bandwidth. If they can increase their revenue by keeping people from buying low-end and mid-end discrete graphics cards, they won't let that opportunity slip.

There's no free pass there. Intel is still constrained by the thermal and power limits of the CPU socket.

Software rendering is not limited by the API so once developers start using the CPU more directly it would even compete with high-end discrete cards. It will take many years, but the convergence isn't stopping so this is bound to happen. Perhaps by the end of this decade buying a discrete graphics card may seem as silly as buying a discrete sound card. They'll still exist but for the majority of consumers won't offer any worthwhile benefit.

Given that AVX is just a wider SSE and the latter did squat for software rendering performance I'm still not seeing reasons to be excited. It would be great to have no need for dedicated hardware and cumbersome APIs but the fact is that CPUs are just too slow at throughput apps. I hope you realize that a comparison of sound and video cards is silly given the vast difference in workload complexity. We are far, far away from "good enough" graphical reproduction of the world.
 
so, that's an extreme experiment here, it made me appreciate the great stability and performance I would get from a modern IGP or something like a geforce 6200 and up.
I was pissed when nvidia discontinued their IGP lines (and switchable graphics)

funnily, they still have a strong presence on the low end desktop, with the geforce 7025.

Careful -> only DX10 IGPs (HD3200, Geforce 8200) and later will get you DXVA, flash and browser acceleration.

Looking at your experience, I think you'd be better served with an E-350.
 
Flash is really an awful thing it's incredible the amount of resources Youtube uses vs the same vids read via vlc for example /OT.
My main computer for now is a laptop (won't change before I reach US) which runs a Pentium M running @1,86GHz it can crawl on some flash heavy website. My real main computer is off now as there is an issue with the serial ata port which works... whenever it wants... which bothering when I actually want to boot... an OS... It ran a athlon X2 2500+ and managed better (still...).

Anyway back to (~)Larrabee, OK I did not get that Haswell were to use two 8 wide SIMDs. Looks like Bulldozer will be a x4 disavantage vs Haswell (I don't believe BD will get refreshed before Haswell launch... sadly...).
 
Back
Top