Designing for 4 GHz doesn't necessarily mean it has to run at 4 GHz. Haswell is a 4 GHz design but there will be chips based on it that are suitable for tablets.
Well it is still a bit of a waste, it is sort of conceptual thing for me, if there has to be something like "a one size fits it all" it has to be a middleground type of approach. That is popular wisdom but usually proves true, you can't have it both ways, or even though you are Flemish may be you know a bit of French "tu ne peux pas avoir, le beurre, l'argent du beurre, et la crémière"
I am proposing to unify mainstream CPUs and their integrated GPU, not a power hungry discrete GPU. Both can be quite power efficient for their respective tasks. So although it's definitely a challenge to unify them, it's not really two extremes. The GPU has a relatively high number of wide SIMD units that run at a low frequency, and my proposed unified architecture shares those characteristics while not sacrificing scalar performance.
I'm actually thinking the contrary (outside of graphics realtime rendering), GPGPU power efficient is more desirable on the "client side" of things. Accelerating the UI, spread sheet, etc. doesn't take much power and should be done better on iGPU (more power efficiently), at the time lots of users don't need even 4 fast cores, I can't see say a quad-core Haswel being competitive (iso-power and for mostly the same die size) against a dual cores+iGPU (especially once you take in account the non graphic related hardware within the iGPU) once you trow in hardware acceleration (form the GPU).
On the other end on the server, cloud side, looking at what IBM Power A2 are achieving and what Intel is aiming at, I think you are right that the homogenous approach (in the majority of the case) is to win.
Going for the middle ground doesn't work. It would fail as a CPU and fail as a GPU. Instead an architecture is needed which achieves high performance regardless of whether the workload contains high ILP, TLP, or DLP, or any mix of it.
You obviously know more than me but to me it sounds like a pipe dream, you have to give up something, sory for the car analogy, but it is a bit the Porshe Cayenne. One wants a fast car with pretty high end performance, the characteristic of a SUV=> you end with a Porsche Cayenne.
Not that it is a failed product but is a really high end one, super expensive, pretty power hungry, it is far remote from should be an average "do it all" car.
As that was just an attempt at having the mods to finally forbid car analogies on the forum...
I'll try something better. On the server side (at large) you have to deal with a bunch of different workloads, there are a lot of parallelism to be exploited, but also plenty of workload have relatively low achievable IPC. Overall I'm not sold that either the wide SIMD approach or the high IPC design are what would fit in the best way the bulk of those workloads. I think it shows with IBM results through the power A2, though one can argue that if it were that great it should be everywhere /sold more, I would argue that IBM might ask a bunch of money for it.
I don't think that thoes new Atom, Jaguar, or A57 type of of core are failed CPU, on the contrary they capture a lot of the low hanging fruits wrt serial performances within an impressive power budget. If you look at some bench of those new Kabini from AMD you see that it gets both hammered by ULV core i3 but also match it on workloads where you can't extract that much ILP and where the number of cores is important. the comparison is ISO power but those chip are not equal as far as die size is concerned, some workloads (or running multiple workloads) could serve more cores, within a given power budget you could do it Jaguar core not with IB cores ( I guess new Atom would better comparison but they are yet to be released). There is also FP intensive workload and Kabini does well (it should look worse in front of Haswell though).
There is die size and there also how "standard" those chips are, ULV are highly binned parts, kabini are not, most of the chip on a wafer are good to operate within extremely constrained power budget.
Then you have workload if you are not on the client I would think that lots (if not most) of the workload are more like the one where Jaguar set-up with more cores should do better, because there isn't that much room to use either pretty wide SIMD units or ILp to be extracted.
I still think that both the GPU and high serial performances CPU are both corner cases. Now as Andrew.L stated a couple of times any increase in IPC is a miracle by it-self, reliance for a long long time on high sustained IPC is getting into a dead end.
On the GPU side it is a bit of the same scaling is no longer has optimal as it used to be (you would have to push the resolution really high), again I quote Andrew and what he stated on a twiit:
"I would argue that the big GPU require to much parallelism" or something among that line.
Or the other end and on a 45nm process part IBM rules in power efficiency.
History has shown that CPUs which can extract a high amount of ILP from badly written code are more successful than CPUs which look better on paper but require polished code. Note also that while Haswell doubles the DLP with AVX2, it still increases ILP with more execution ports. Compromises to single-threaded scalar performance are simply unacceptable. And that's not just for practical commercial reasons, but it's even a strong theoretical demand of Amdahl's law.
Well my understanding of that is that workload for which Amdahl's law sets massive constrains will no longer see much improvement, things there are plenty of other workloads and parallelism, the real answer is where it is?
As a side note I could see Intel give up on 4 way SMT in its next Xeon Phi cores and indeed vouched for the benefit of OoOE which usefulness seems greater to me than multi threading. The 4 way SMT is sort of a relic of Larrabee and Intel attempt to shoe horn massively parallel (//almost perfect case of data parallelism) into CPU cores along with the constrain of texturing. Now those cores aims at HPC, I think that OoOE will be their call for the reason you state.
I'm also wondering if the next gen of Xeon Phi will be as wide as the cores they replace.I could see them looking as a blend of those upcoming new Atom, Xeon phi cores and haswell:
If you compare Haswell to IB you see that not having proper data path cripple performance. I think I read that in larrabee cores datapath are not 512bit wide, it has to be inefficient.
If datapath are 256bit that should be the width of the SIMD.
So those new Xeon phi cores could be like this:
based on the new Atom cores, datapath increased to 256bit, dual FMA unit (as in Haswell), 8 wide SIMD. THe ISA, AVX 3.1, it seems should offer better support for gather and introduce masking as in the LNBx ISA. I actually think that if doable your idea to have it to run on 16 wide 'wavefront" (SP) at half speed could be great. That is (if I don't mess up) 16 DP FLOPS per cycle.
Those cores should be able to "flex their muscles" a lot more types of workload that their predecessors. They should clock better, Intel with use more of them I would bet in a different "topology". Atom can be linked in group of 2 till 8 I don't expect Intel to rework that. Though I don't expect them to group those groups of 8 together but to a system agent via high bandwidth link which will be connected to the memory controller and one or two Crystal Well interface.
A cost in what sense? Remember that the die area that used to be dedicated to the integrated GPU becomes available for giving the unified architecture two 512-bit SIMD clusters. And the cost in power consumption is addressed by running them at half frequency.
Same at above for extremely parallel workload CPU still lag GPU, they lose badly in perfs and perfs per Watts, raw throughput is not that interesting (the same applies when you run not that optimal workload on big GPU, efficiency crumbles).