The ISA for a UPU

As CPUs tend to crawl through large (parallel) data sets much slower than GPUs, it isn't such a constraint for pure CPUs as it is for GPUs (or APUs).
Well I guess that large data sets greatly affect cache usefulness so bandwidth becomes more important.
There is also the capability of the GPU to hide latencies while working on many requests at a time.
What I don't get is why CPU would not benefit form the extra bandwidth?
For the number of pending request handled in parallel I guess GPU behaviours can be mimic through software.
Another thing I'm not sure I get is "pure CPU", single core single thread are pure CPUs I guess
Multi cores are also pure CPU I would guess, why would many cores not be consider "pure CPU"?
CPU are just to enter the many cores era, the extend of what they can pro-efficiently do is going to widen, I would guess that bandwidth requirement are going up.

Because for certain workloads you need that bandwidth. If you scale CPU bandwidth up, your FLOPs/watt number is going drop. If you want to see how it is disingenuous, imagine Nvidia released a TITAN GPU with only a single 64 bit GDDR5 memory channel clocked very low (iow modern CPU bandwidth). The FLOPs/watt of such a chip would be very good on paper, but I doubt they would have customers lining up for such a device.

Indeed, the power GPUs spend per GB/sec of bandwidth will be significantly reduced making comparisons more accurate.
Not what I meant, the technology wrt memory used by CPU and GPU are to converge soon.
So it is not relevant to compare CPU (not too mention multi cores 2/4 big cores) with GPU looking back at things like DDR3 or GDDR5.
And the power will be the same for the CPU and the GPU.

Now I do not believe that one size fit it all. The nice thing is I could see somebody like Intel shipping multiple products with +8 cores for different workloads based on different cores but supporting the same functionality (same ISA) and able to run the same code.

Now if the point is to compare Ivy bridge to modern GPU, it is a bit moot as imo CPU are to enter a new era.
Again I don't think that using only big cores is the way to go but on the plus is not like Intel (or IBM or competitor on the ARM side) can only design one architecture.
I could see Intel dealing with 3 type of cores for a while:
the big cores (SB, IB, Haswell, etc.)
the middle of the road type of core (the Atom line)
The throughput type of cores (xeon phi)
 
Last edited by a moderator:
This thread is from an era where power is not THE issue. Well, it turns out it is THE issue :)
Well, to which extend GPU have a power advantage vs something like Power A2 or Xeon Phi (though they don't seem designed to handle the same type of workloads)?

I would wait to see how this first gen of products fares, actually I would even till the second generation of products.

As for graphics, imo if they get the throughput cores right, it should not be that much of a trouble to keep a relatively tiny GPU, extremely power efficient, somewhere on the die.
One may do what Intel used to do with geometry on its early IGP, process geometry on CPU cores and what Dice did on the PS3 some shading on the CPU cores.
 
As a side note reading that thread I would think that GPU has won "the war", Nvidia got the contract and will be powering Exascale and so on, AMD approach to APU (with a "big" GPU relatively to the competition) won, and so on.

None of that happened yet, actually the first gen of many cores pretty much just shipped (Xeon Phi and Power A2), the fight is ahead of us not behind.
The disregard toward many cores while not surprising on a forum focused on CPU is not too surprising but I think that the pretense, or sounds like it to me or close, that GPUs already "won" is quiet a stretch.

Xeon Phi is a lot like a GPU. Much more close to a GPU than a CPU, in fact.
 
Coming from a layman perspective and reading this thread, quickly have seen a pattern arise:

1. Points made seem reasonable but examples used are heavily biased to prove a hypothesis
2. Peers mention examples are not reasonable due to x, y reason making the hypothesis an ideal that is not sound or based on current reality
3. Counters are made advising the examples are fine and people just don't get it.

This is repeated several times over.

In the end the thread, and excuse me for the language, seems like a lot of intellectual ego stroking.

Also have read this same thread over the past few years a few times... begins to get a bit pointless after a while unless I am missing something and we are on the verge of a great intellectual discovery?!

This thread hasn't thrown up anything new. The points repeated here have been recycled from older threads on B3D. Since no new data has been presented, this thread wasn't going to amount to anything.
 
Xeon Phi is a lot like a GPU. Much more close to a GPU than a CPU, in fact.
Well, on many regards it is true: on the wrong side of the PIC express bus, GDDR5 memory interface.
But at the bottom it is still a many core CPU, to me at least.
This thread hasn't thrown up anything new. The points repeated here have been recycled from older threads on B3D. Since no new data has been presented, this thread wasn't going to amount to anything.
Indeed that is completely true, I was just pointing that people acted while addressing Nick as if it was done and CPUs lose.
As far as I know neither Intel or IBM have given up on many cores, the situation is more the first products just shipped.
Intel has a lot of "how to" on the topic, they design already quiet a few chips: Polaris, Larabee, SCC and Xeon Phi. I would be surprised if those experiments don't translate in neat products in the upcoming years.
I would also be surprised if Intel were beating a dead horse, they may think they have a shot.

As for graphics I'm not sure it is worse giving up on GPU, especially as the needs for the average users are not that high and GPU have a power advantage.

Reading that thread or all the others as you are pointing out, I would think that the real issue is how much "GPU" do we need. Sebbbi had this interesting post showing that lots of parts in rendering map pretty well to CPU cores.
Ultimately what do we do? We have massive GPU and a few serial optimized CPU cores or we just keep specialized hardware for some problematic parts of the graphic pipeline (both in performances and perf per Watt) and have lots of well rounded CPU cores?

To me the latter sounds better as I feel like GPU SIMD arrays will have a hard time being good at anything but the most data parallel workloads without loosing any perf and per per watt advantage.

My gut feeling as an outsider /forum warrior, is that the basic of computing in a world indeed bound by power consumption concerns should consist in a bunch of 'well rounded cores" + specialized hardware. By "well rounded" cores I mean CPU cores with sane single thread performance and a nice throughput. They could look like Atom, Swift, Krait, type of CPU cores with improve SIMD and more support for SPMD type of languages.

The whole thing would be mostly homogenous with support from specialized "parts" which actually could include a few "big" cores, a GPU, VPU, any accelerator that makes sense. But the bulk of the computing should happen on those "well rounded" cores. It may imply giving up some perfs on the most (data) parallel workloads but also on the serial one.
Not that manufacturers could not blend things how they see fit to target a given market and the matching workloads but I think that the focus for the like as Intel should be on those middle of the road cores not the super throughput oriented one ala Xeon Phi /larabee, or the big ones ala SnB, IB, etc.

Then when it comes to servers, HPC, it seems there is more to scaling than the performance in insulation of a given chip, with IO, etc. For example I would guess that there is more to the performance of blue gene /Q than the perfs of the chip in insulation.

Sorry for the rant but indeed lets wait and see but I think that the people here are a bit too enthusiast wrt GPU. Consoles are to set a ceiling (and not a crazy high one) on GPU parts (or the extend to which they are to be used outside of increased resolution) for the next +5 years.
The scaling of GPU is already slowing down. With the ceiling put by consoles on game development and the improvement on the memory front, IGP are to get competitive and I believe that the market for discrete is to shrink significantly mid term.

Then it is quiet open (and to some extend a financial match) which will generate enough money to support crazy expensive development, GPU that are foremost only good a massively parallel computations or CPUs core (and you can have different types but running the same code, using the same compiler the same tool chain, etc.).

It is a massive stretch to assume that CPUs are to lose imo and even though I don't know much I don't feel like I'm over doing it on the matter.

Edit
Actually I could see the "big" GPU mostly finding a niche in some professional markets where the hardware is already a match for the needs of the those market, actually GPU improvements may hurt the perf/watts on those markets which could very well be just fine with more of the same.
Nvidia and the like could sell the platform and software ala IBM.
Anyway I think that the bulk of the computation are to be handled by CPUs in the not that far future.
 
Last edited by a moderator:
Well, on many regards it is true: on the wrong side of the PIC express bus, GDDR5 memory interface.
But at the bottom it is still a many core CPU, to me at least.

The terms sure are getting muddied..

Seems like there are two major criteria at hand - whether or not a device uses a bunch of fixed function hardware specifically geared for traditional graphics rasterization and whether or not the compute is latency optimized vs throughput optimized. Xeon Phi is obviously not using GPU fixed function hardware but is also pretty obviously throughput optimized by having a bunch of low clocked CPU cores that dedicate most of their area to wide execution units and storage instead of scheduling logic.

Most of this thread (and the last thread about this) has been arguing about the latter point, despite the original topic being more about the former one. Most of the argument has also not been about an approach winning or losing but about whether or not it makes sense to have big separate hardware blocks that are optimized for these different types of program loads vs one unified design that handles both.
 
Well, if CPUs really turn out to be more power & area efficient, then we should expect CPU only designs to show up first in mobile devices as that is where perf/watt is at the highest premium.
 
Well, if CPUs really turn out to be more power & area efficient, then we should expect CPU only designs to show up first in mobile devices as that is where perf/watt is at the highest premium.
Well I don't want to get rid of the GPU completely either especially when it comes down to devices supposed to burn a couple of Watts.
I do not sure it is a good idea either to inflate (and make them even more complicated, costly to design, etc.) the CPU cores to the point where they can do the critical parts in the graphics pipeline competitively.
I'm not sure either that CPUs can win on the most data parallel workloads. But the issue here is "are the late evolution of GPUs a wn in perf per watts on those workloads"?
I wonder as they try to do more GPUs may lose an edge (or let be fair part of it) in perf/Watts on workloads for which they are already quiet optimal /not much to be won by architectural improvements.

Something to me that works against GPU is that, manufacturers are stuck to some extend by the market they foremost address graphics. Could they do such a trade off: Ok I give up 20/30% of perfs in 3d rendering and the most data parallel workloads but I'm to lend a massive win in "compute performances".
I think that producing ships that are already both at the limit in die size and power consumption they can't. Nobody would have intensive to buy that gen of products.

But there are other hurdles, in a many cores, each cores are obviously autonomous but they can be shunted or turboed independently. Wouldn't it be difficult to do that on a GPU where the "central planner" as to get the memo, make the decision, etc.
So far I've seen no hint of an upcoming architecture that would be able to clock the SIMD arrays at different speeds (or shunt one). At the core of the GPU paradigm is the assumption that you never run out of parallelism and work which is true but only for a few workloads.

They try to depart from that but to me they are bound to that paradigm at the lowest level.

To me it comes down to ants, what mother nature chose: the ants colonies as they are or "centrally planned colonies of lobotomized ants"? She chose the former.
That is to me the main issue for GPU, many cores CPU are many still many CPUs, GPU cores is completely dumb struck and I read a bit about Nvidia plan and I see no change, dumb units and central planning, I believe that there is some truth in "bio-engineering" (looking at the choice made by the nature) it is an ugly model, I can't bend my-self to see such a thing "win".
 
I didn't really understand much of that, but allow me sum up my thoughts on the matter in the simplest form possible.

Don't say it, do it.
 
Well, on many regards it is true: on the wrong side of the PIC express bus, GDDR5 memory interface.
But at the bottom it is still a many core CPU, to me at least.
Not sure why you consider it to be a CPU. It's optimized for throughput, and shows few, is any, latency optimizations.
Reading that thread or all the others as you are pointing out, I would think that the real issue is how much "GPU" do we need. Sebbbi had this interesting post showing that lots of parts in rendering map pretty well to CPU cores.
It's a question of power.

An data parallel algorithm is going to map equally well to a wide vector unit, irrespective of the existence of a wide ooo unit nearby. You should execute it where it consumes less power, that means a throughput optimized core.
 
As a cuda/opencl developer, I'd argue that there is no significant difference between a GPU and a many-core CPU. Sure, the GPU has specialized graphics stuff on the chip, but so what? CPUs also have some very specialized stuff, such as Sandy Bridge's video encoding hardware.

Mind you, OpenCL has a very primative programming model, but that's more due to lowest common denominator and vendor wars than anything else. Contrast it to CUDA, which has full C++ support (with a few minor missing features, such as varargs). Heck, there's even a Python compiler (numbapro) that targets CUDA now.

The real question is whether throughput optimized hardware can coexist with latency optimized hardware at the level that Nick has suggested. I'm thinking not, if for no other reason that it would be difficult to put them in the same clock domain. High clock speeds are quite nice for serial performance, but actually hurt parallel performance since the clock speed/die area and power needed to run at that speed is unfavorable, i.e. you can fit enough more low clocked cores in that you have better aggregate performance compared to the larger, more power hungry high clocked cores.

Another issue is memory interface. There's a reason that CPUs use DDR3 while GPUs use GDDR5. GDDR5 has better bandwidth, but worse latency that DDR3, meaning it's a win for a highly parallel system, but a loss for a serial system. This divide is likely to only get deeper since moving to a wider bus, needed for high bandwidth, will force the clockspeeds down due to power constraints. Again, the low clocked and wide approach has more bandwidth than the narrow, high clocked approach, but the lower clock hurts latency.

Basically, there's a fundamental difference between fast but narrow and slow but wide.

Finally, fusing the GPU to the CPU at the ISA level would mean that the developer can either use the full CPUish performance, or the full GPUish performance, but never both. Consider that graphics is quite separate from game logic, hence little synchronization is required. It seems unlikely that the extra cores you could gain from merging would offset the contention. Case in point: tablet/smartphone SOCs have a large assortment of special function hardware - for something like the Apple A6, the CPU, including cache hierarchy, is only around 20% of the die. The GPU is slightly larger, maybe 25%. The rest is filled with various special function hardware. Clearly the designers thought an assortment of special purpose hardware outperformed general purpose hardware, even when only a fraction of it is used at any given time.

A final note is that serial performance seems to have hit a wall. For instance, early benchmarks of Haswell have a whopping 10% performance increase (but in some cases actually slower!) for serial workloads. It's also unlikely that extensions such as wide SIMD can be used much for true serial performance, since you just don't have enough ILP to utilize it. Sure, maybe you can work on multiple iterations, but that would be a good sign that the problem is highly parallel after all. The future is either stagnant or else increasingly parallel. Developers will just have to learn to deal with it.
 
Last edited by a moderator:
Another issue is memory interface. There's a reason that CPUs use DDR3 while GPUs use GDDR5. GDDR5 has better bandwidth, but worse latency that DDR3, meaning it's a win for a highly parallel system, but a loss for a serial system. This divide is likely to only get deeper since moving to a wider bus, needed for high bandwidth, will force the clockspeeds down due to power constraints. Again, the low clocked and wide approach has more bandwidth than the narrow, high clocked approach, but the lower clock hurts latency.
Actually, GDDR5 has the same latency as DDR3 in absolute terms, the DRAM cells are the same after all.
And future wider memory interfaces (wideIO2-RAM/HBM) won't change that. If you use a prefetch of 8 and then transmit the data over a narrow interface running at 8 times the speed or use 8 times the width at the speed of the DRAM won't influence the latency much. ;)
 
I didn't really understand much of that, but allow me sum up my thoughts on the matter in the simplest form possible.

Don't say it, do it.
Well I've to agree with that, Intel (as they were the only to give it a try) has been pretty late to the party.
Larrabee has eaten a lot of development time, Intel may have learned a lot but as far as market penetration is concerned... :LOL:

Anyway now it is all said and done, I think that Intel had no proper business plan for the product, they needed imo a console deal. Assuming the part was competitive (which it could have been according to people that have played with it it wasn't bad) it would have mean for MSFT to loosen the API grip on the market, for Sony after the Cell I'm not sure they were willing to try something exotic.
So I fact the product had no market so to speak outside of the PC world where the hardware could not be competitive (Emulating the fixed graphic pipeline with a software rendering doesn't sound like win to me and it comes at a cost).
Not that I believe that the product was perfect either. It was sort of the CPU but mostly without any of the strong points of a CPU => really sucky single thread performances per cycle and really low clock speed => sucky.
As I do get the people saying that the ISA for the core is irrelevant, I'm not sure X86 was the best choice for such an architecture, ie an IO CPU. X86 doesn't provide much registers, something like ARM V8 could allow not too sucky performance as far as serial execution is concerned.

Still I'm willing to see the follow up they give to the Xeon phi and how all their previous experiment (polaris, Scc) come together. Though I don't think that they are to dish their GPU line for quiet a few years as outside of technical merit the market is still bound to API and the matching quiet specialized hardware.
 
I keep thinking we all live in separate and non communicating universes. In the universe where I live power is the N.1 concern, from phones to peta or exascale machines. Fixed function HW won't be replaced by programmable HW in any significant way any time soon. One day the (in)famous wheel of reincarnation will spin again but that day doesn't seem to be exactly around the corner.
 
Not sure why you consider it to be a CPU. It's optimized for throughput, and shows few, is any, latency optimizations.
It is indeed completely optimized for throughput, but I don't see how that make it a GPU.
It is still plenty of autonomous CPU cores though with wide SIMD units.
I mean if I use the same criteria early CPU were not CPU back when CPU had no L2, execution was IO, etc.
For me the difference is that a CPU (or could be a proper VPU) is autonomous, to some extend I'm wary to call GPU "many-cores", the GPU is still quiet a monolithic entity as I understand it.

It's a question of power.

An data parallel algorithm is going to map equally well to a wide vector unit, irrespective of the existence of a wide ooo unit nearby. You should execute it where it consumes less power, that means a throughput optimized core.
I agree with that, though the issue GPU seems to face now is that they try to extend their use to workloads that offer a lesser amount of data parallelism, right?
I think that CPU have a shot, they won't ever win on the mostly data parallel workload but they could get close enough. My view is that they should try to lend a win "by convenience".

I think that Nick has some points to lower power usage of CPU cores while dealing with not that latency sensitive workloads. Still I don't think that the big cores approach can fit all usage.
If I were to bet the best chance for many cores would be to go with well rounded cores, with acceptable single thread performances and a sane throughput too, that sounds like the most convenient way for me to win "by convenience"... Sorry for the lame wording :oops:
 
I keep thinking we all live in separate and non communicating universes. In the universe where I live power is the N.1 concern, from phones to peta or exascale machines. Fixed function HW won't be replaced by programmable HW in any significant way any time soon. One day the (in)famous wheel of reincarnation will spin again but that day doesn't seem to be exactly around the corner.
Well I agree with your POV but are "shader cores" are already programmable and I would guess burn quiet some power.
Do Intel has figures about which part of Larrabee was more offensively less efficient than the GPU counter parts for example?

I mean I read a few Nvidia papers (focused to toward general purpose used of GPU), they speak a lot about data movements and how it is costly. But when I look at the GPU you move data in fact quiet a lot, from the shader cores to or from the ROPs for example. Looking at the topology of a GPU vs something like larrabee I would assume that move data around is more costly than in Larrabee where tile doesn't move further than the L2 (and I guess are processed from the L1).

So down the road (or soon), ROPs are goners? If they are out much of a perf (and so perf per watts) advantage vs something like Larrabee?

I think the real question is which parts are the performance (perrf/Watts) critical that need to be "saved", and consideration like the chip topology, do the engineers put together all the fixed/specialized hardware in blocks, with quiet some energy used to move data around or they try to spread those units across the die (even if it means less functional units).

I could see some form of "grape/cluster" approach, not that different from SCC, just pushing further in the number of cores and including relevant accelerators able to work efficiently with your "main" compute resources (be it shader cores, or some form of CPU), then the whole thing is on a grid with many "clusters".

I think the argument is not fixed/specialized units vs programmable but more Shader cores vs CPU cores. I can help but think that CPU are better more flexible, they may give up a bit or quiet a bit of raw throughput but too which extend and under which circumstances?

CPU are somehow expensive with their big front end (vs GPU) but they are autonomous, they can more easily adapt their clock speed according to need and power consumptions, they can steal job from each others, etc. GPU are still not there and they rely somehow on a central planner, to me there is something not OK here the talk about locality, data movements, etc. I read in Nv papers for example, you locally capable units, able to react on the go without sending message to the central planner(s) quiet far on the chip.
It is a big advantage I see for CPU, they are "autonomous" in many way. To get there GPU will have to pay the price, and to which extend that "price" will differ from the "price" CPU are already paying, it is for me an interesting question.
 
Last edited by a moderator:
I said something very simple which is very clear to everyone that pays a bit of attention to this industry, how you could extrapolate all of that stuff from it I don't know. Also I'd appreciate if you would refrain from speculating on statements I never made. I don't post much anymore also because of stuff like that.
 
I said something very simple which is very clear to everyone that pays a bit of attention to this industry, how you could extrapolate all of that stuff from it I don't know. Also I'd appreciate if you would refrain from speculating on statements I never made. I don't post much anymore also because of stuff like that.
I was not trying to put words in your mouth, that was a joke, agreed that possibly a bad one (removed it completely).

Still not sure about how to read your (previous) posts, so down the line (and it was also a consensus between Andrew Richard and Tim Sweeney in an interview they gave) there will mostly specialized hardware, and wrt graphics we could be back to fixed function hardware.

But to me that is really down the line, now we speak of many cores (CPU or not) various other units, and looking at the needs (ours as a civilization) we speak of many many chips. We are concerned with data movement between units within the chip or its closest pool of memory but when it comes to servers and clouds the overhead/cost (in all regards power and perfs) of communication between chips or memory pools and storage is crushing.

Focusing on "on chip" power consumption, I clearly see the benefit of type of devices (CPUs type) that can deal do whatever work on some data without having to move the data around, as Nvidia points out, execution doesn't cost much. But then they are a bit unfair and compare GPU and big CPU which burn a lot of energy before they can execute anything.
It is not intellectually fair as the CPU first doesn't deal with the same works, and more importantly it is not an universal rules that CPU have to behave that way. CPU guys have pursue increase in serial performance, it came at that costs but as the focus may shift I don't see why there is a pretense that if CPU were to take a few step back on that road, they should be considered a GPU (not your statement), because GPU want to deal with more than the most obvious data parallel workload they are no longer GPU?

Overall for me, GPU will have to pay part of the price CPU pay (in power) so they don't rely only on many threads to keep their ALUs busy and achieve their throughput. THen you will have 2 types of units with may be a gap of x2 or x3 in throughput for one for the most parallel workloads and the same ratio for the other ones but for less parallel workloads (and it gets blurrier "in between" and depending on how much of your die you are spending of a given type of those resources).

Another advantage I see for "well rounded cores" (so neither Haswell type or larrabee type) is that if done right you can play easily with the amount of cache on die (depending on the effectiveness of cache under a given workload).
You can tweaked but you keep the same tools, compilers, etc.
For example IBM could put a lot more POwer A2 cores with less cache on a chip if they was a need for that. Though the Power A2 is not what I would qualify as a well rounded core.

So for a relatively tiny difference when you start to connect many chips, is it sane to deal with 2 types of resources? Wouldn't load balancing be problematic?

And if you can't elaborate (or explanify for people like me) further than you previous posts, I'll deal with it whatever your reasons ;)
 
Last edited by a moderator:
It is indeed completely optimized for throughput, but I don't see how that make it a GPU.
It is still plenty of autonomous CPU cores though with wide SIMD units.
I mean if I use the same criteria early CPU were not CPU back when CPU had no L2, execution was IO, etc.
For me the difference is that a CPU (or could be a proper VPU) is autonomous, to some extend I'm wary to call GPU "many-cores", the GPU is still quiet a monolithic entity as I understand it.

So what then is the difference between a "GPU" like GCN and a "CPU" like MIC, but with some specialized hardware tacked on? The architecture is pretty much identical...

As for autonomy, this is much more related to the driver model used by most OSes than any sort of hardware shortcoming. Fixing this properly would require hardware tweaks, but also an overhaul of the OS. Now, hardware like Kepler (well, GK110) is capable of running fully autonomously (it can allocate memory and launch kernels without and input from the CPU), but only up to the 2 second watchdog timer limit after which the OS decides that the GPU has become unresponsive and restarts the driver and reboots the GPU.

ROPs aren't exactly going anywhere, since their main purpose is to synchronize pixel writeback when you may or may not have different threads working on overlapping triangles. What is possible is that they will eventually be exposed as/through atomic intrinsics.
 
Last edited by a moderator:
Anyway now it is all said and done, I think that Intel had no proper business plan for the product, they needed imo a console deal. Assuming the part was competitive (which it could have been according to people that have played with it it wasn't bad)
You mean competitive as offering half the performance or twice the power consumption? Where did you hear that it was competitive for a console deal?
It is indeed completely optimized for throughput, but I don't see how that make it a GPU.
The discussion in the thread was mainly about latency optimized cores and throughput optimized cores. A GPU is just an example for the latter (with the addition of some graphics related fixed function hardware). A throughput optimized architecture doesn't have to be a GPU of course.
We are concerned with data movement between units within the chip or its closest pool of memory but when it comes to servers and clouds the overhead/cost (in all regards power and perfs) of communication between chips or memory pools and storage is crushing.

Focusing on "on chip" power consumption, I clearly see the benefit of type of devices (CPUs type) that can deal do whatever work on some data without having to move the data around, as Nvidia points out, execution doesn't cost much.
Usually, most data gets not moved between units, but within a unit or a core. And if I have a task, where most of the data to process is still in DRAM or some shared level cache, it makes no difference whatsoever to which unit I move the data, it has to be moved either way. I only should process it in a unit (if one has different ones) where the energy consumption for the processing (which includes efforts for internal data movement as well as scheduling and the actual processing) is minimized. And if I need different cores to communicate and exchange larger amounts of data, one should try that these cores are physically close and the sharing is done through a physically close on-chip memory array.
It is not intellectually fair as the CPU first doesn't deal with the same works, and more importantly it is not an universal rules that CPU have to behave that way. CPU guys have pursue increase in serial performance, it came at that costs but as the focus may shift I don't see why there is a pretense that if CPU were to take a few step back on that road, they should be considered a GPU (not your statement), because GPU want to deal with more than the most obvious data parallel workload they are no longer GPU?
Who will buy a processor optimized for parallel throughput and lacking performance on latency sensitive tasks which traditionally clearly dominate in the personal computing space? Having a few cores around offering the best possible performance for these kind of tasks is hardly a wrong decision.
 
Back
Top