OK I lose a big post while answering Keldor, I won't go through it just now.
Just a quick generic answer. Who will buy throughput optimized devices +> the answer is not that many people at all, the GPU market for proper discrete GPU is set to shrink and not that big vs the CPU market. As integrate part in CPU they won't grow much further as nobody cares. As for the mobile world well are not past diminishing returns for the average user (but should be there fast) (though neither are CPU) as far as usefulness is concerned it seems all the manufacturers are more willing to spend their power budget on CPU.
Nvidia made no mystery about what translate into improvement for the end users in the embedded realm. You don't that much of a gpu to accelerate the UI.
Anyway as I see, there is one type of core (not really) than can cover 10% of the existing workload and another one that can cover the 90% (which include lot of parallel one) I don't see how they are to lose because they should constrain them selves to serial performances (/programming model) or what not (actually I wonder if the 90/10 is a generous split).
The power issue is overblown based on FLOPS per Watts based on not that relevant benchmark (pointed out by many members I know have relevant position in the industry), Nick or Andrew pointed to supposedly heavily parallel workload where GPU are miles away to realize their potential, it is still not good enough.
I think it is better, especially for some one like: I can't present fact and I've no specific knowledge on the matter, more a recollection of opinions with lots of dissonances (like that pres of Nvidia where the trick to make it happens is proper programming and improvement in the compiler, are we back to VLIW is to win titanium style?), yes it is better to quit the discussion and to continue reading pov here and there.
EDIT my post sounds harsh nothing personal I'm angry at my-self as I wrote something for quiet a while and lose it stupidly... frustrating.
GPUs are not monolithic. Consider the Radeon 7970. On any given clock cycle each of the 128 SIMDs can be executing different instructions. Many from different programs. I don't know exactly how many different programs can be active at once, but it's a lot.
I will try to present my pov about why I think CPU is better (but also why I don;t think that one size fit it all) and why I may not get properly what Keldor, 3Dcgi or Gipsel / those who patiently try to get something in my stuborn brain
As I'm at it thank you for your responses and efforts.
First I've been reading again a paper on realworldtech about perfs per watts and mm^2.
It is about DP performances and it is unclear to me if D.Kanter based is graph on peak/paper FLOPS or sustained performances (I would think it is the former as he provides no information about which bench or real world tasks he would have used for its graph).
At least in DP the power advantage of GPU is imaginary vs something like Blue gene /Q, GPU still lead in perf per mm^2 against such a design, but as power a2 are flexible I'm still not sure that GPU have a significant advantage either (in perf per mm^2) (for the workload IBM aimed at I guess they found out that lot of cache was better than extra DP FLOPS).
Another issue is indeed the graph are based on paper FLOPS not sustained performances on real HPC work loads. I've no data but a bias of mine (and looking at some post from Aaron Spink here and on realtechworld forum) is that actually CPU design (even throughput oriented like Power a2) get closer to their peak performances than their GPU counter parts.
So may stick to throughput optimized cores for a while, I see a clear advantage vs shader cores. If I look at a Power A2, IBM has in its hand a IP that can either end up in Blue gene /q or in power EN in which it is backed by a lot of accelerators. I see a "win" here in multiple ways.
to some extend I would not say that power a2 is throughput oriented as larrabee was, I could say that it looks like an power optimized CPU set to deal with pretty extensive multitasking vs throughput read as optimized for max FLOPS throughput. Actually I think that IBM could go as far as producing Power a2 without FPU/SIMD if they think that it would serve well an hypothetical product
Whereas a power a2 is "throughput" oriented design it could run any code that run on a power 7 for example. As I said in the spoiler they may even pass on the FPU/SIMD, or bump for example SP throughput if they feel like they need, it has been designed so it can operate along various accelerators. One of the "win" I see is on the business side, you have one IP that usability exceeds what a shader cores can do, you can deploy the architecture on more markets.
Another thing I 'see' as a difference (and I don't get your point 3dcgi) is that many cores CPU are indeed many cores where each cores does whatever it wants. In my view the 'cores' in GPU are not there, they still need help from other hardware on the GPU.
May be I misunderstand but it makes a lot of difference in how I see things. For example as CPU cores are autonomous if they have supported for dynamic clocking or power gating they can react instantly and adapt (either turbo, lower their clock speed, go to sleep). In GPU, may be a simplistic view, I picture the command processor as the accountant, he keeps tracks of everything at its speed, that induce latencies in how fast the shader cores can 'adapt' and I wonder if it somehow explain why GPU modulate the clock speed of all their cores at once (could simply that in their current usage on average you don't need to implement such a thing /a misconception of mine).
Still going down that line of thinking (sorry again if erroneous), I still see as the shader cores as somehow slave of others parts of the GPU, whereas if I look at a Power EN wrt dedicated hardware it is slave to the CPU cores.
May be the way I see that is erroneous, but looking forward to more complex design I think that the CPU is better, as chip got bigger I would think that the reliance on "hardware" outside the core (for the GPU cores to function, I don't question the relevant of dedicated hardware whatever is function is) should create more and more problems, like introducing more and more latencies, costing power, limiting the speed at which cores can adapt to workloads or power constrain (I speak of advance power management feature like turbo mode, dynamic clocks, etc.). Either way the shader cores grow completely autonomous but that is going to have a cost I would think both in hardware and power that would further mitigate their advantage in raw /horse power.
As a side note wrt to Nvidia supporting C++ and Python (Keldor post), strangely I read that as a bad news, to me it is a sign that they can't secure/force their costumers into using their proprietary languages, a bit like IBM lock costumers in their software environment (the hardware being secondary). If people runs standard programs on GPU, it means that they are not locked into one company offering/ecosystem and can move at anytime to other architectures without much headache. To me it is more a sign of Nvidia strategy failing (locking costumer into their offering) than a good news, it will make it easier for competitors... to compete on an equal footing.
I use the Power A2 as a reference because for me it is closer to well rounded CPU than larrabee/xeon phi cores that really aim at high FLOPS throughput. Also because IBM is more advanced than Intel in delivering "many cores" products or in presenting how many cores design could look like in close future.
I'm not sure that looking forward IBM will try to significantly increase the max throughput per core for lets call them Power A3 (just generation just after Power A2 not looking too far into the future).
Newer lithography offers various options wrt to the extra silicon and power saving it offers:
+they can have more of the same cores
+you can slightly increase your clock speed
+ you can widen the SIMD
+ you can increase your single thread performances
+you can implement more advanced power management scheme.
For some reason I think that IBM will try to increase the sustained IPC, say now it is 0.8 instruction per cycle, to 0.9 or 1 for example. I wonder if they could use pretty simplistic OoO execution to do so.
It will cost power but it also raise throughput, more importantly I would think that is the improvement that 'show" under any workload, even if a core were all of sudden to deal with a single thread, performances could benefit from that improvement.
Next is power management feature, along with slight increase in sustain IPC, it might help to improve perfs without touching the power budget.
Then is the number of cores that is set to grow according with what silicon and power budget IBM has for a product were it plans to use that arch. That would be closer to what I would call "well rounded CPU cores".
Again there is a "win" here, a business one, I can't see GPU touch anytime soon, it a pretty flexible IP that could be deploy on pretty different chip intended to deal with pretty different workloads.
If I look at people that offer "online" services, pretty computing cycle and storage, I can see that flexible being a massive win. If you have a park of many cores that can do pretty much any task, it means that you maximize the use of your parks and are in a situation to services any kind of "cycle" your costumers demand, you can sell with the same hardware cycles for data mining, run an advance office like service online, or a HPC workload, you can have different web server using based on different OS on a same chip.
That is where I think that even taking in account power "well rounded CPU" will have the win over GPU, you have a park of server, no matter power saving at the chip level, taking in account chip to chip communication, access to the storage, etc. it will burns a lot of power, you may want to to be able to service lot of different usage for what you invest both in power and hardware.
Then there is dedicated hardware within the chip/soc/many cores. I don't think it is going anywhere, to some extend I could describe a latency optimized/ high single thread performance CPU (haswell, BD, POwer7) as dedicated hardware in a world of many power efficient "well rounded cores".
I don't believe that GPU are going to extend their usage far and fast enough so they can win against CPU cores. I don't think they are to disappear either but their is a clear limit to the GPU power your average joe needs, it is pretty low. I would think that Intel is there already, they are pushing further to try to attract more costumer to their CPU when the PC market is slowing vs the ARM realm.
I would not say that because mobile devices (even laptops) needs a low power GPU and that we may have in two years 400GFLOPS in high end phone that GPU has to win too in a power constrain environment still completely different on many account (size, usage, chip level power consumption is only part of the equation and it can competitively be lowered as Power A2 shows and it is only the beginning).
Honestly I've a really hard time believing that a GPU design stand a chance to win exascale or against CPU in the broader sense for computing (may be GPU could save a niche but Nvidia opening its GPUs to standard languages is a bad sign and even that may not happen).
I would think if GPU were really to win, there would be rumors about Intel discrete GPU and IBM working of GPU like architecture as I can't see both those companies do the same mistake at the same time (one could such a mistake as underestimated the GPGPU threat but both I can't believe it).
I would think if GPU were really to win, there would be rumors about Intel discrete GPU and IBM working of GPU like architecture as I can't see both those companies do the same mistake at the same time (one could such a mistake as underestimated the GPGPU threat but both I can't believe it).
Don't mistake business decisions for technical decisions. Intel likely doesn't care much about discrete GPUs because the market isn't fast growing and it isn't big enough to be substantial for them.
Yes, GPUs have a more centralized command processor to dispatch work generated by the CPU, but CPUs have uncore logic as well. Traditionally GPUs haven't needed many command processors because they were designed to receive work via a single CPU thread and the amount of work generated by each command made it unnecessary to have many command processors. This doesn't mean there's a technical reason a GPU can't have a command processor per CU or SMX.
The way I see it the main, over simplified, difference between a UPU as Nick defines it and a HSA style GPU is where the SIMD units sit. A UPU has the CPU and SIMD close together and HSA has them farther apart. Ignoring the memory subsystem the uncore is just glue to make it all work.
Don't mistake business decisions for technical decisions. Intel likely doesn't care much about discrete GPUs because the market isn't fast growing and it isn't big enough to be substantial for them.
Point granted Though I've possibly weak counter argument (as I've no idea of the volume we are speaking about and how in turn if the trend were to intensify how it would affect the volume of CPU manufacturers). I've read that some companies now rely on GPU to do data mining (on tweeter for example), I would think that was a traditional target/market for CPU manufacturers, so as I've to acknowledge that CPU are late to the game and that the dynamic in mobile world makes it so GPU have to be leveraged, I wonder to which extend it is not a threat if the CPU keeps being "late". Business and market dynamics both can defeat technical merit and it work both way.
Intel is really late and lost a lot of time on larrabee which was going backward to the market dynamic. I would think that Intel should have focused more all along on its Atom line and silicon budget has allowed for years for what could have been early many cores designs which may have prevent GPU to gain traction (there is no disputing it they did) on some market segments.
Yes, GPUs have a more centralized command processor to dispatch work generated by the CPU, but CPUs have uncore logic as well. Traditionally GPUs haven't needed many command processors because they were designed to receive work via a single CPU thread and the amount of work generated by each command made it unnecessary to have many command processors. This doesn't mean there's a technical reason a GPU can't have a command processor per CU or SMX.
The way I see it the main, over simplified, difference between a UPU as Nick defines it and a HSA style GPU is where the SIMD units sit. A UPU has the CPU and SIMD close together and HSA has them farther apart. Ignoring the memory subsystem the uncore is just glue to make it all work.
That is a great way to put it, Definitely it reached me and remind me of thing my brain insisted to ignore for some reason, for example the introduction of ACE in GCN and the extension of there capabilities in GCn 1.1.
There is a picture in mind when I think of many (cpu) cores, it is an army of ants. Ants are all the same, autonomous, etc. I used to picture the GPU as an army of lobotomized ants under the control of central planner. Now I think I've a way fairer picture in mind (wrt GPU):
I would picture a genius ants with no legs or teeth leading a carriage of retarded ants, a bit rough around the edge but extremely athletic.
So in a many cores, I would count the number of ants as the number of cores and in GPU I would actually count as core the number of "carriages" if that makes sense (even though it is not as simple).
at this one may wonder about why I bring ants into the discussion... but actually I read some of those books by Bernard Weber called "the Ants" and whereas it is a pushover wrt ants capabilities (/a tale) it still does a great job at presenting the achievement of those tiny things which we so easily ignore
I like the Ants comparison for another reason than the one in the spoiler, I read a few stuffs about what I think is called bio engineering (ie copying the solutions mother nature choose for some problem that we are facing in engineering) and whereas Ants are more complex than silicon chip I still believe that the analogy in not completely out of place.
Still digging further into the "bug's life" analogy I could compare the sum of the tasks presented to hive of ants as the sum of the task in computing. Looking at the number of ants in anthill there is definitely a lot of parallelism to exploit, the issue is how you exploit it.
Overall in the various ants species the bulk (by a massive amount usually) of the hive is made out of your "average" ant, actually some species even rely only on your 'average' with no specialization.
Looking at what I read here and there about computing it doesn't look that different that the computing world, there is plenty of parallelism to be exploited. Though if mother nature were to be a hint I would think that actually the bulk of tasks presented to a hive (or in computing) is not where GPU comfort zone is, GPU are a specialized unit design to deal with the extreme case of parallelism. Somehow I could believe that where GPU operate is not where the bulk of parallelism is. (a GPU could be a soldier ants taking down ten 'average ants" in a fight against another hive).
So where am I going with this? I thin that the "unit" of computing looking forward is neither the soldier (say the GPU though I think the "carriage analogy is more accurate to me ) or the Queen (/sexually differentiate ants => the big CPU core) but indeed your average ants.
I feel sorry to oppose to your pretty much analytic views (you 3dcgi but the others too) based on actual knowledge my attempt of a more synthetic approach of the issue, blending synthesis (or attempt ) of what I read here and there, comparison, etc. But the sad matter of a fact is that I can't do any better on the topic, so we don't speak exactly the same language though part of the message still reach. Another sad matter of a fact is that I will dig further into my "bug's life" type of analogy...
The ants analogy if it was ever to make sense is still not correct. CPU cores are not "social" animals like ants are. Ants have pretty advance way to communicate. CPu evolved more as a lonely animal. To some extend the GPU if I compare it to a carriage of ants is actually more of a social animal that the cpu cores and what would be a hive in a many core design.
What I'm trying to say here is that, CPU cores have as their only mean of communication between core/threads coherent reads and writes. Ants can exchange some information through direct antenna contact, let track (odors), the analogy break a bit here but I guess you guys will still get what I mean.
The GPU have more mean to share and exchange information, they have non coherent access to some memory pools, etc. they won't go into a time and energy consuming antenna contact when it is not needed. I could extend the contact to other (social) species that release pheromones under specific circumstances to alert the hive /neighboring members of the hive.
To some extend CPU are pretty would be ants that rely on antenna contact for everything.
If I transpose back into the silicon world, I could say that with SCC Intel is trying to find a work around that matter of fact and search for more efficient scalable means of communication between many cores. It is still "too CPU" centric in view.
Ultimately if I do a synthesis of what I wrote I'm left with a dilemma, whether I look at the CPU "guys" or at the GPU "guys", I see some sort of "insulation" / I'm not sure CPU guys acknowledge what is good and GPU and the other way around.
I think that CPU (on top of some network grid as in SCC) should borrow some of the GPU means of communication:
non coherent read and write to local or global datashare / scratchpad
Some type of read only pool memory pool
I would try to say 'in technical term' that CPU should more seriously consider (and already implement) some solutions the GPU adopted to efficiently deal with SPMD type of programming as it seems developers won't anything better in near future. As number of cores scales, etc. I could think that reliance only on "coherent" communications is going to hinder what many core can achieve. To me that could be more important that actually match the raw throughput of the GPU. So to speak the magic might not be in the SIMD of the GPU but in how they can communicate.
I also have in mind what an Nvidia top architect said in a presentation about "auto tuning" wrt to scratchpad, it seems to have great benefit for the compiler to do its things.
My analogies have me to think the the ants hive is still the good model and that "well rounded" CPU cores should still be the basis of computing, though CPU as they are not social enough or too strict in their means of interaction (they should widen their option without given up on their benefits).
I took earlier the Power A2 as an example but I don't think is close to what I picture, it is still a too simplistic cores but the AXU port is a goof idea. It would allows somehow some specialization of the cores, not that different from what ants do.
At their "cores" specialized ants are still ants, so to speak in the silicon word the "front end+scalar pipeline does not change".
My belief is that it would be easier for well rounded CPU to evolve into that silicon ants that could be the "unit of compute" onward. It needs to get more social, to grow more antenna, etc. but I think that that is where parallelism, and that if nature is any hint that where people should be heading.
I would put further, the model is not the ants or even the hive but the anthill, with the uncore and some fixed function units (ants grow mushrooms, raise and milk aphils).
If I were to describe such an ant (and I should the purpose of the analogy was to avoid getting there and quiet possibly avoid non sense because of lack of technical background), I could think of the mix of what AMD call a "compute cluster" for its jaguar cores, I think topology is important.
For the core a jaguar/Krait/Swift that would support 2 way SMT and would have 2 FMA units could be the right balance. It could have the ability to operate on vector twice the width of its SIMD to lower the power consumption (hide latency for cheap), it would have have a range in clock speed at which it can operate (depending on circumstances workload and power budget available).
X of this cores would be packed together sharing a last level of cache (the L2 in a jaguar based compute cluster). But it would also borrow to GPU with support for non coherent read/wirte and scratchpad memory (and give room to the compiler and not to use a machete when a knife is enough), those cluster could be link as in SCC.
There would be dedicated hardware depending of the type linked/within a "compute cluster" (even a "big cpu cores", some general purpose type of texture units in charge of gathering data for the cores to process) or accessible through the grid. I speak of really specialized hardware that grant a massive gain in achieved perf per watts (which I don't think the GPU further will provide enough of a win vs "well rounded cores especially if they could benefit from support of proper accelerators).
I think some people will say but that is where GPU are heading, I feel like GPU/shader Cores, like big CPU are already "too big", "too wide" and that there are existing CPU that look like a good basis to get started.
To some extend I think the "gpu guys" could actually get there faster (outside of ISA and legacy constrains) if they were to instead of building GPU creating a many core based on CPU but evolving slightly the CPU into a more social animal.
I would think lot of GPU guys are damned bright CPU that explored parallel computing, and I wonder if at this point they should settle on the type of relatively tiny and power efficient CPU cores I think off and built from that with all they've learned from both realms.
PS, really sorry if all this were to make no sense, I tried my best at least.
EDIT
A footnote mother nature never chose the "carriage model" even in pretty complex organization existing only in the mammals, actually only mankind attempts this in those twisted form of working organization where HR people wants people to be mostly instantly replaceable, choose to not leverage what made human what they are and encase their cunning in absolutely rigid processes, etc. for the sake of having to achieve what could be called pretty inhuman tasks.