Bulldozers are faster in reverse

The front end is usually the most power-hogging part of any x86 architecture, where there's a lot of data movement back and forth at all times.
Intel claims that the front-end only consumes a few percent of total core power these days, with FP execution taking up to 75% in compute-heavy workloads. I'm skeptical of that, but my proposed architecture offers a solution for both the front-end and the execution units. Executing AVX-1024 instructions on 512-bit units means the instruction rate is reduced so the front-end can be clock gated more often, while running the SIMD units at half the frequency should allow a more power efficient design even though two clusters are required to keep throughput high (cf. Kepler abandoning Fermi's hot clock).
 
Aren't you answering your own question there...? :)

I must be missing something.

My point is your jumping up and down about a Bulldozer module sharing execution resources. Yet AMD aren't doing a thing to increase FPU execution resources, not 256bit, no additional units etc. So AMD are in effect directly contradicting your position, they are increasing instruction throughput, which if the FPU is so burdened with work would do nothing to increase performance.

So who understands the performance bottlenecks better AMD or Grall?:LOL:
 
Nick said:
I think a key part of the solution is to do the reverse of what Bulldozer does: Have one scalar execution cluster shared between two threads, and two vector execution clusters each dedicated to one of the threads. The vector clusters would run at half the frequency of the scalar cluster, and as a better alternative to Hyper-Threading they could support AVX-1024 instructions which are executed on 512-bit SIMD units to help hide latency.

This seems reasonable, though it might be better to go with 4 threads/scalar eu with 2 threads for each vector eu.
 
Can you clarify some elements of this design?
Does this design, like Bulldozer, keep the retire buffer, exception handling, and memory pipeline in the integer domain?
Is this a unified or non-unified scheduler, since the separate large vector domains at a clock divisor seems more applicable to AMD's coprocessor model?
Are there some assumed changes to the integer side, since Bulldozer's current integer cluster format would be rather constraining?
Are there some other assumed changes, such as cache port width, number of accesses per clock(which clock?) cache arrangement/size, and so on?
Is this assuming 512-bit or 1024-bit registers?

This logic SIMD unit runs at integer core frequency? Is it still in the same domain as the half-clock units?
I don't have strong opinions about these things yet. I first wanted to get a sense of whether the idea of two SIMD clusters at half frequency is even feasible at all. I do assume the single integer cluster to be similar to the one in Haswell, with Hyper-Threading, not like the ones in Bulldozer. Scalar sequential performance is critically important in light of Amdahl's law (Bulldozer's concept of sacrificing ILP for TLP didn't work out in practice due to the software challenges). Separate register files and schedulers for the SIMD clusters probably make the most sense. The SIMD registers I imagine to be 512-bit, but allocated/renamed in pairs for the AVX-1024 instructions. I don't see how 1024-bit registers would work on an out-of-order architecture (assuming you want these wide registers to reduce the number of register file ports required like on a GPU).
 
My point is your jumping up and down about a Bulldozer module sharing execution resources. Yet AMD aren't doing a thing to increase FPU execution resources, not 256bit, no additional units etc. So AMD are in effect directly contradicting your position, they are increasing instruction throughput, which if the FPU is so burdened with work would do nothing to increase performance.

So who understands the performance bottlenecks better AMD or Grall?:LOL:
AMD is trying to promote heterogeneous computing. They're not going to strengthen the homogeneous throughput computing resources even if it's a performance bottleneck. They really want developers to start using HSA instead.

This is a risky gamble. History has proven over and over again that most developers are not willing to put much effort into achieving a speedup. Unified computing offers a reliable speedup at little or no effort, so in the end I expect it to be more successful.
 
This seems reasonable, though it might be better to go with 4 threads/scalar eu with 2 threads for each vector eu.
While I'm sure that would further increase the utilization, the real question is does it improve performance/Watt? Intel removed Hyper-Threading support from Silvermont, probably because it doesn't offer enough of a speedup while it can't clock gate things when still one of the threads is making progress. Also, transistors are cheap enough to just have more cores.

With one thread but AVX-1024 instead, I think the SIMD clusters would behave more optimally in terms of performance/Watt. It won't easily run out of independent work but when it does it can be clock gated instead of having to keep it active for another thread which isn't making optimal progress due to sharing a lot of resources with a stalled thread.

These days all that matters is the average performance/Watt, so I think keeping the thread count minimal is key.
 
AMD's bizarre refusal to implement hyperthreading in their desktop CPUs seem to have played a role leading them down the path towards bulldog, it's really weird.
What is so special about HT ? my simple understanding is that it's just a mere doubling of the Control Unit , it COULD help with data stalls on the main thread ,but it has nothing to do with throughput or increasing data processing .
 
...And half the performance of an equivalent CPU with a dedicated vector unit per core. ...Which is partly why intel is stomping all over bulldog, performance-wise.

No, it's not. Vector loads are one of the BD's strong points. On SSE code, (which most of it still is), BD is competitive. BD is bad at integer, and especially memory ops. When it comes to vector computation, it punches well above it's weight.

Doesn't bother intel. Their core chips have - as you know - a (big, expensive, power hungry) vector unit per core and the chip still draws less than bulldog. Amazing, I know. :)

That's because they have the best process in the world.

Anyway, what software are you running that is integer-limited? Anything at all?

Goddamn everything I do. From databases to compilation to playing NS2. Most contemporary software is limited by scalar integer computation. However, since performance in scalar integer has plateaued, it's designed to work on what we have. You cut the scope until it runs. If Intel or AMD were able to restore the huge yearly improvements in scalar loads, you bet your CPU would get outdated in a jiffy.

a question i have wondered is what would be better, 4x 128FMA or 2x256FMA

4x 128 would definately perform better per cycle, however...

4x 128 FMA from the same register file would require 12 read ports and 4 write ports. Even if you duplicate the RF once it's still 6 read ports and 4 write ports. Multi-ported register files are hard, expensive, hot, and limit frequency.

This is the unification of a CPU and integrated GPU. The silicon that used to be spent on the GPU is distributed between the modules. So there's no extra cost. Also, the utilization of the SIMD units would be no less than those of the legacy integrated GPU.

I disagree. By execution time, most code simply has no DLP that could be exploited by SIMD. In a non-MT design, the SIMD units will sit idle most of the time, no way around that. Widening the units just make the imbalance worse. HT, or sharing them like BD, is a great way to improve utilization. Your proposal cannot possibly get close to that.

Also keep in mind that GPGPU is kind of a failure in the consumer market, despite plenty of unexploited DLP in most applications. That's because heterogeneous programming is too hard and too unpredictable. By fully unifying the CPU and GPU the developers can rely on tools to auto-vectorize their code without bad surprises. So the utilization of the SIMD units will go up for uses beyond 3D.

I think GPGPU fails more because the DLP isn't there than it does because of the inherent hardness. Anyway, I'm not advocating for GPGPU on PC hardware, I'm advocating for MT/BD style sharing that would allow you to increase the total data parallel throughput of the chip while keeping costs low. I serisously think that if AMD was able to fix the caches on BD, it would be a pretty good CPU.
 
What is so special about HT ? my simple understanding is that it's just a mere doubling of the Control Unit , it COULD help with data stalls on the main thread ,but it has nothing to do with throughput or increasing data processing .

Vector code is bursty. Vector units are big and hot. The best possible utilization you can get from the vector unit is typically limited more by the code than the features of the unit. With two threads running, you can feed the vector unit better. And since most data-parallel code is also thread-parallelizable, this helps a lot more than it does in integer.
 
Vector code is bursty. Vector units are big and hot. The best possible utilization you can get from the vector unit is typically limited more by the code than the features of the unit. With two threads running, you can feed the vector unit better. And since most data-parallel code is also thread-parallelizable, this helps a lot more than it does in integer.
You are saying you can have a vector code on the HT thread and an Integer code on the main thread at the same time ? I thought a single thread can have both Vector and Integer codes simultaneously (correct me if I am wrong).
 
You are saying you can have a vector code on the HT thread and an Integer code on the main thread at the same time ? I thought a single thread can have both Vector and Integer codes simultaneously (correct me if I am wrong).

They can. Doesn't mean they will. Basically, in the Intel implementation, HT is almost entirely a frontend detail. The in-order frontend reads up to 4 ops a clock from a single thread, then these go through the renamer, and that's the last point where it matters from which thread they are. All the buffers after that are shared and the schedulers don't know or care from which thread the ops are.
 
This is the unification of a CPU and integrated GPU. The silicon that used to be spent on the GPU is distributed between the modules. So there's no extra cost. Also, the utilization of the SIMD units would be no less than those of the legacy integrated GPU.
I disagree. By execution time, most code simply has no DLP that could be exploited by SIMD. In a non-MT design, the SIMD units will sit idle most of the time, no way around that. Widening the units just make the imbalance worse. HT, or sharing them like BD, is a great way to improve utilization. Your proposal cannot possibly get close to that.
There's nothing to disagree on here. Utilization of the SIMD units is a non-issue for a unified architecture. Graphics code is almost purely SIMD code and the integrated GPU's SIMD units are 'moved' into the unified cores. So there is no additional silicon cost for the SIMD units and overall utilization is the same, or higher once more software takes advantage of them.
Also keep in mind that GPGPU is kind of a failure in the consumer market, despite plenty of unexploited DLP in most applications. That's because heterogeneous programming is too hard and too unpredictable. By fully unifying the CPU and GPU the developers can rely on tools to auto-vectorize their code without bad surprises. So the utilization of the SIMD units will go up for uses beyond 3D.
I think GPGPU fails more because the DLP isn't there than it does because of the inherent hardness.
The DLP is there in most compute limited software. Any loop with independent iterations is a candidate for vectorization. It's just not being exploited yet because the SIMD instruction set did not have a vector equivalent of every scalar operation yet, and the SIMD width was too narrow to make a big difference. AVX2 fixes most of that but it will still take time to get adopted. GPGPU offers the right instruction set but it's heterogeneous and thus too hard and unreliable for most developers to invest into. So there are no good options for extracting DLP, even though it's there. A unified architecture will offer the best of both worlds for exploiting it.
Anyway, I'm not advocating for GPGPU on PC hardware, I'm advocating for MT/BD style sharing that would allow you to increase the total data parallel throughput of the chip while keeping costs low. I serisously think that if AMD was able to fix the caches on BD, it would be a pretty good CPU.
Bulldozer does not increase the total data parallel throughput. And even if they improved the caches, we're only talking about a few ten percent. That's nowhere near enough to catch up with Haswell and it certainly isn't a suitable approach to CPU-GPU unification.
 
It appears that the biggest concern about unifying the CPU and (integrated) GPU is that sequential scalar workloads demand designing for ~4 GHz operation while having wide SIMD units for graphics and compute workloads operate at such frequency is not power efficient. So I've been thinking about an architecture that has its SIMD units running at half the base frequency, while still being homogeneous and offering plenty of throughput...

Thoughts?
Well I would start by questioning to which extend those CPU running @4GHz are power efficient.

For me (not a technical) it looks like trying to conciliate two extremes. Both those CPU and GPU are extreme cases, both burns quite some powers, etc.
So I'm not sure about how even successfully mixing the specificity of both those type of chips (high perfs CPU and high perfs GPU) could end remotely close to something that would qualify as power efficient.

I bias of mine (though honestly based on gut feeling, biases and also how things goes in natures or other fields) is that going for a middle ground would be an easier and more successful path.

I could think an architecture that capture all the "low hanging fruits" could be more successful that one that try to conciliate 2 extremes which are sort of pretty specialized devices in some regards.

For example the last Atom, seems to capture the low hanging fruits wrt single thread/serial performances for an overall extremely reasonable footprint, be it power or die area.
It is still lacking in the raw throughput department though I'm not sure trying to widen the SIMD is the way to go. It seems to me that it has a hefty cost on the design as a whole.

A bit like going past Jaguar/Silvermont type of serial performance, you start to face dimishing return /invest more and more power and die for lesser and lesser returns. As widen the SIMD their are less and less cases where you can efficiently use the SIMD which in turn translate in "fat" one way or another (complex power management scheme and what not /etc. ).

If the goal is computing in the broad sense, so workload with plenty of parallelism though no the extreme cases the GPU deal with, workload where serial execution speed is still important, in a world where the primary concern is power consumption I wonder if Silvermont/Jaguar type of core augmented by a FMA units are more likely to be more power efficient (and overall efficient on more metrics).

For me the big CPU cores and at the other end of the spectrum the high performances GPU are pretty much specialized devices. Conceptually I'm doubtful about the do-ability of a specialized device that would cover efficiently two, somehow opposite, extremes.

Say Intel add a FMA unit to next Silvermont in their next rendition, we speak of 3 FLOPS per cycle. They should burn really few power, be really tiny, you can pack a bunch of them, a eDRAM L4 as in Haswell should provide bandwidth to keep them busy. I don't believe the big cores could rival that ISO power.
 
Well I would start by questioning to which extend those CPU running @4GHz are power efficient.
Designing for 4 GHz doesn't necessarily mean it has to run at 4 GHz. Haswell is a 4 GHz design but there will be chips based on it that are suitable for tablets.
For me (not a technical) it looks like trying to conciliate two extremes. Both those CPU and GPU are extreme cases, both burns quite some powers, etc. So I'm not sure about how even successfully mixing the specificity of both those type of chips (high perfs CPU and high perfs GPU) could end remotely close to something that would qualify as power efficient.
I am proposing to unify mainstream CPUs and their integrated GPU, not a power hungry discrete GPU. Both can be quite power efficient for their respective tasks. So although it's definitely a challenge to unify them, it's not really two extremes. The GPU has a relatively high number of wide SIMD units that run at a low frequency, and my proposed unified architecture shares those characteristics while not sacrificing scalar performance.
I bias of mine (though honestly based on gut feeling, biases and also how things goes in natures or other fields) is that going for a middle ground would be an easier and more successful path.
Going for the middle ground doesn't work. It would fail as a CPU and fail as a GPU. Instead an architecture is needed which achieves high performance regardless of whether the workload contains high ILP, TLP, or DLP, or any mix of it.

History has shown that CPUs which can extract a high amount of ILP from badly written code are more successful than CPUs which look better on paper but require polished code. Note also that while Haswell doubles the DLP with AVX2, it still increases ILP with more execution ports. Compromises to single-threaded scalar performance are simply unacceptable. And that's not just for practical commercial reasons, but it's even a strong theoretical demand of Amdahl's law.
I'm not sure trying to widen the SIMD is the way to go. It seems to me that it has a hefty cost on the design as a whole.
A cost in what sense? Remember that the die area that used to be dedicated to the integrated GPU becomes available for giving the unified architecture two 512-bit SIMD clusters. And the cost in power consumption is addressed by running them at half frequency.
 
Designing for 4 GHz doesn't necessarily mean it has to run at 4 GHz. Haswell is a 4 GHz design but there will be chips based on it that are suitable for tablets.
Well it is still a bit of a waste, it is sort of conceptual thing for me, if there has to be something like "a one size fits it all" it has to be a middleground type of approach. That is popular wisdom but usually proves true, you can't have it both ways, or even though you are Flemish may be you know a bit of French "tu ne peux pas avoir, le beurre, l'argent du beurre, et la crémière" :LOL:
I am proposing to unify mainstream CPUs and their integrated GPU, not a power hungry discrete GPU. Both can be quite power efficient for their respective tasks. So although it's definitely a challenge to unify them, it's not really two extremes. The GPU has a relatively high number of wide SIMD units that run at a low frequency, and my proposed unified architecture shares those characteristics while not sacrificing scalar performance.
I'm actually thinking the contrary (outside of graphics realtime rendering), GPGPU power efficient is more desirable on the "client side" of things. Accelerating the UI, spread sheet, etc. doesn't take much power and should be done better on iGPU (more power efficiently), at the time lots of users don't need even 4 fast cores, I can't see say a quad-core Haswel being competitive (iso-power and for mostly the same die size) against a dual cores+iGPU (especially once you take in account the non graphic related hardware within the iGPU) once you trow in hardware acceleration (form the GPU).
On the other end on the server, cloud side, looking at what IBM Power A2 are achieving and what Intel is aiming at, I think you are right that the homogenous approach (in the majority of the case) is to win.
Going for the middle ground doesn't work. It would fail as a CPU and fail as a GPU. Instead an architecture is needed which achieves high performance regardless of whether the workload contains high ILP, TLP, or DLP, or any mix of it.
You obviously know more than me but to me it sounds like a pipe dream, you have to give up something, sory for the car analogy, but it is a bit the Porshe Cayenne. One wants a fast car with pretty high end performance, the characteristic of a SUV=> you end with a Porsche Cayenne.
Not that it is a failed product but is a really high end one, super expensive, pretty power hungry, it is far remote from should be an average "do it all" car.

As that was just an attempt at having the mods to finally forbid car analogies on the forum... :LOL: I'll try something better. On the server side (at large) you have to deal with a bunch of different workloads, there are a lot of parallelism to be exploited, but also plenty of workload have relatively low achievable IPC. Overall I'm not sold that either the wide SIMD approach or the high IPC design are what would fit in the best way the bulk of those workloads. I think it shows with IBM results through the power A2, though one can argue that if it were that great it should be everywhere /sold more, I would argue that IBM might ask a bunch of money for it.

I don't think that thoes new Atom, Jaguar, or A57 type of of core are failed CPU, on the contrary they capture a lot of the low hanging fruits wrt serial performances within an impressive power budget. If you look at some bench of those new Kabini from AMD you see that it gets both hammered by ULV core i3 but also match it on workloads where you can't extract that much ILP and where the number of cores is important. the comparison is ISO power but those chip are not equal as far as die size is concerned, some workloads (or running multiple workloads) could serve more cores, within a given power budget you could do it Jaguar core not with IB cores ( I guess new Atom would better comparison but they are yet to be released). There is also FP intensive workload and Kabini does well (it should look worse in front of Haswell though).
There is die size and there also how "standard" those chips are, ULV are highly binned parts, kabini are not, most of the chip on a wafer are good to operate within extremely constrained power budget.
Then you have workload if you are not on the client I would think that lots (if not most) of the workload are more like the one where Jaguar set-up with more cores should do better, because there isn't that much room to use either pretty wide SIMD units or ILp to be extracted.

I still think that both the GPU and high serial performances CPU are both corner cases. Now as Andrew.L stated a couple of times any increase in IPC is a miracle by it-self, reliance for a long long time on high sustained IPC is getting into a dead end.
On the GPU side it is a bit of the same scaling is no longer has optimal as it used to be (you would have to push the resolution really high), again I quote Andrew and what he stated on a twiit:
"I would argue that the big GPU require to much parallelism" or something among that line.
Or the other end and on a 45nm process part IBM rules in power efficiency.

History has shown that CPUs which can extract a high amount of ILP from badly written code are more successful than CPUs which look better on paper but require polished code. Note also that while Haswell doubles the DLP with AVX2, it still increases ILP with more execution ports. Compromises to single-threaded scalar performance are simply unacceptable. And that's not just for practical commercial reasons, but it's even a strong theoretical demand of Amdahl's law.
Well my understanding of that is that workload for which Amdahl's law sets massive constrains will no longer see much improvement, things there are plenty of other workloads and parallelism, the real answer is where it is?
As a side note I could see Intel give up on 4 way SMT in its next Xeon Phi cores and indeed vouched for the benefit of OoOE which usefulness seems greater to me than multi threading. The 4 way SMT is sort of a relic of Larrabee and Intel attempt to shoe horn massively parallel (//almost perfect case of data parallelism) into CPU cores along with the constrain of texturing. Now those cores aims at HPC, I think that OoOE will be their call for the reason you state.
I'm also wondering if the next gen of Xeon Phi will be as wide as the cores they replace.I could see them looking as a blend of those upcoming new Atom, Xeon phi cores and haswell:
If you compare Haswell to IB you see that not having proper data path cripple performance. I think I read that in larrabee cores datapath are not 512bit wide, it has to be inefficient.
If datapath are 256bit that should be the width of the SIMD.

So those new Xeon phi cores could be like this:
based on the new Atom cores, datapath increased to 256bit, dual FMA unit (as in Haswell), 8 wide SIMD. THe ISA, AVX 3.1, it seems should offer better support for gather and introduce masking as in the LNBx ISA. I actually think that if doable your idea to have it to run on 16 wide 'wavefront" (SP) at half speed could be great. That is (if I don't mess up) 16 DP FLOPS per cycle.

Those cores should be able to "flex their muscles" a lot more types of workload that their predecessors. They should clock better, Intel with use more of them I would bet in a different "topology". Atom can be linked in group of 2 till 8 I don't expect Intel to rework that. Though I don't expect them to group those groups of 8 together but to a system agent via high bandwidth link which will be connected to the memory controller and one or two Crystal Well interface.

A cost in what sense? Remember that the die area that used to be dedicated to the integrated GPU becomes available for giving the unified architecture two 512-bit SIMD clusters. And the cost in power consumption is addressed by running them at half frequency.
Same at above for extremely parallel workload CPU still lag GPU, they lose badly in perfs and perfs per Watts, raw throughput is not that interesting (the same applies when you run not that optimal workload on big GPU, efficiency crumbles).
 
Last edited by a moderator:
I think a key part of the solution is to do the reverse of what Bulldozer does: Have one scalar execution cluster shared between two threads, and two vector execution clusters each dedicated to one of the threads. The vector clusters would run at half the frequency of the scalar cluster, and as a better alternative to Hyper-Threading they could support AVX-1024 instructions which are executed on 512-bit SIMD units to help hide latency.
doubling latency increases the register pressure. assume you want to write software without stalling for the result, instead of
Code:
a0=b0+c0;
d0=a0+c0;
you'd have to write
Code:
a0=b0+c0;
a1=b1+c1;
d0=a0+c0;
d1=a1+c1;
you'd effectively half the register count.


but aligned to your thoughts, I believe intel will at some point merge the iGPU and the CPU vector units, there is no point in having half the die area being not utilized while the other side is burning.
of course, that would lead to higher latency on cpu side, but I think, just as for memory reads, HT could be used in that case to hide the doubled alu execution. if you'd have to double the register count anyway, why not double some more temporal buffers and shared the rest with 2 or 4 threads? (e.g. branch prediction, decoding...).


oh well, there we are larrabee :)
 
Unutilized die does act as a heatsink. It's been theorized that dark silicon, as it's called, will be needed in the future due to power and thermal density concerns.

You can already see the effect on Sandy, Ivy and Haswell. The middle two cores run hottest, while the two outer cores run cooler, thanks to the PCI-E system agent and graphics nearby.

Of course, it's not necessarily needed or ideal right now, but it does have some use.
 
Sorry, I missed your reply when given -- only noticed because the thread got resurrected.

There's nothing to disagree on here. Utilization of the SIMD units is a non-issue for a unified architecture. Graphics code is almost purely SIMD code and the integrated GPU's SIMD units are 'moved' into the unified cores. So there is no additional silicon cost for the SIMD units and overall utilization is the same, or higher once more software takes advantage of them.

Right now, when I'm playing a game on a system that has a separate GPU and CPU, the CPU handles the game logic and the GPU handles the graphics. Ideally, they could both be fully utilized. On your proposed system, with, say, 8 threads and 8 FPUs, with 4 threads doing logic and 4 doing graphics, 4 FPUs would be fully utilized, and 4 would be almost completely dark.

Should you instead share the FPUs, ideally the OS could pair GPU-like and CPU-like threads per core, achieving full utilization. This gain in utilization could be used to buy wider real units, better memory pipeline, or whatever. Should you share or dedicate some resources, the FPU should be the first one shared and last one dedicated.

Unutilized die does act as a heatsink. It's been theorized that dark silicon, as it's called, will be needed in the future due to power and thermal density concerns.

For this purpose, SRAM not running at top frequency is effectively dark. Rather than see lots of rarely utilized specialized execution resources, I expect we'll just see cores swimming in a sea of cache. The stuff is already on die, we just have to wait until the thermals are convincing enough for the increase in latency for reshaping it around the cores to be justifiable.
 
Going for the middle ground doesn't work. It would fail as a CPU and fail as a GPU. Instead an architecture is needed which achieves high performance regardless of whether the workload contains high ILP, TLP, or DLP, or any mix of it.
You obviously know more than me but to me it sounds like a pipe dream, you have to give up something...
That doesn't seem to be the case so far. Haswell quadrupled the FLOPS per clock over Westmere, while increasing the clock frequency, increasing IPC, and yet lowering the power consumption. So nothing was given up. Haswell is still extraordinary in its role as a classical CPU, while bringing us a big step closer to being a high throughput architecture at the same time. If Skylake features 512-bit SIMD units, then that brings us another very significant step closer, and I doubt Intel would make compromises this time either.
Same at above for extremely parallel workload CPU still lag GPU, they lose badly in perfs and perfs per Watts...
It's not about how much the gap is today. You have to think about how small it could get, and whether it would still relevant at that point given that such cores are far more flexible and don't suffer from any heterogeneous overhead. Current high frequency Haswell chips achieve 6 GFLOPS/Watt, while GPUs achieve up to 18 GFLOPS/Watt. But with 512-bit SIMD units, two clusters of them, at half frequency, running AVX-1024 instructions, how much could it achieve?

That's what I'd really like to know. Or better ways to get there.
 
doubling latency increases the register pressure.
What makes you think the latency (in cycles) would be doubled? If anything, halving the frequency would allow to lower the latency.
but aligned to your thoughts, I believe intel will at some point merge the iGPU and the CPU vector units, there is no point in having half the die area being not utilized while the other side is burning.
Utilization is not the issue. Power consumption isn't scaling optimally with transistor size any more, and at 14 nm and below, there will be so many transistors that you need a good portion of them to be inactive at any given time. So from a utilization point of view, there is nothing wrong with heterogeneous computing. Where things do go wrong is that moving data around is getting more expensive than computing things. Heterogeneous computing implies moving a lot of data back and forth, and GPU architectures have poor data locality due to running many threads. CPU architectures have good data locality and are gaining in SIMD capabilities, so you don't have to move data to heterogeneous cores and back.
 
Back
Top