Indeed it's not a totally new concept. I already mentioned Turbo Boost. But I expect this to be extended to also be able to adjusting the frequency and voltage based on workload type, not just on workload intensity.That's pretty much what they do these days. You're basically writing out the same words Qualcomm uses to describe the power management for Krait in its marketing.
A core with dynamic voltage and frequency control is able to get information from activity counters, firmware heuristics, and possibly the OS scheduler to determine what the workload demands are.
Aggressively integrated gating and dynamic frequency adjustments have made their way into any power constrained environment.
I do not follow why we're on this tangent as if this is a new concept, or how a core that can vary its voltage or clock gate isn't just the exact same set of circuits, except at a different voltage and some of the clocks are not enabled.
Today's integrated GPUs have a lower absolute power consumption than the CPU cores. So to create a unified CPU, which takes over the role of said integrated GPU, the cores that run a highly parallel workload should be throttled down. This also leaves more power budget for sequential tasks, which are more important for the responsiveness of the system. Today's CPUs treat all workloads the Hurry-Up-and-Get-Idle (HUGI) way. But this is only suitable for sequential workloads that do go idle. Parallel workloads need a different strategy for the best user experience.
Nice analogy, except that today's processors have very little that resembles those gears yet. A truck with few gears isn't very efficient at driving up slopes or driving on the highway, or both. So we need processors to not only throttle the gas, but be able to switch gears to adapt to different conditions. We have multi-core and SIMD for basic TLP and DLP extraction, but there's lots more that can be done to increase throughput and lower power consumption. Together with adapting to the workload type with things like long-running wide vector instructions and lowering the voltage/frequency, CPUs would offer plenty of the reconfiguration you're looking for. The rest of the reconfiguration for different uses can be handled in software.I'm talking about computing devices that literally reconfigure themselves. A truck that shifts from second to third gear is still the same truck.
Which is why my power consumption argument had four parts. Don't isolate one and say it's not good enough.Properly supplying quadruple SIMD througput is more expensive than you let on, and I've stated the position that for the desired performance goals by 2018 or 2020 for Exascale, the default power budget is too high to begin with.
To reiterate, the proposed gains are modest and the baseline not good enough.
My point was that despite substantially increasing the throughput per core, Haswell's power consumption will still be lower (per core). And they're only one process node apart.There are reports that there may be Haswell high performance SKUs with 160W+ TDPs.
Westmere stopped at 130.
Can you add some clarification on what you mean by this?
Which is exactly why during a throughput-oriented workload the frequency should be reduced.FinFET is quite impressive in the lower voltage domain, especially in more modestly clocked designs.
The improvent in the 4 GHz 1V+ realm is back the modest tens of percent.
I'm not saying to completely discount it. But the problem with NTV today is that it comes with severe compromises. Having to drop the clock frequency by an order of magnitude just isn't commercially viable outside of some niche markets. It's valuable to drop the voltage a little with every new node, but this doesn't demand the full set of changes required for true NTV operation. In essence, NTV technology is a last resort and we want as little of it as possible.I'm not sure why it's fine to pin hopes on one lab's silicon nanowires that may someday be looked at and a whole NTV Pentium that physically exists and has been manufactured has to be discounted.
Junctionless transistors on the other hand have very promising scaling characteristics. It's a more desirable technology than being forced to go the NTV route. But I'm not pinning all my hopes on junctionless transistors. Intel is also for instance experimenting with III-V TFETs, which might operate at 0.3 Volt but without sacrificing clock frequency. So my main point was that with so much R&D now going into lowering the power consumption of transistors, there's bound to be some progress that makes current trends too pessimistic.
It wouldn't fare worse. It would fare just as badly.Why would a mobile GPU with a short pipeline, relatively simple design and operating point in the hundreds of MHz fare worse with NTV than a Pentium with a short pipeline, relatively simple design, and an operating point in the hundreds of MHz?
There are no long-running vector instructions in consumer CPUs. So even though the gating mechanisms are already largely in place, it's not being taken full advantage of during high DLP workloads.Which steps do you think are left that aren't already heavily gated, because yes, extensive gating is being done already. This is why I've stated your supposed gains are modest, because they seem to be including things that have been done for five years.
The transistor cost has already been going to zero for decades, but the transistor count has been increasing at roughly the same pace. So area cost isn't going to zero. And even though not all of the transistors can be active at the same time, there's still an increasing absolute amount that can. Also, there's large portions that can be gated off temporarily even during high performance operation. All of this means it's still a substantial waste to have duplicate functionality and use only half.Duplicated logic is duplicated stuff with costs going to zero.
I'm not denying the problem of dark silicon, but once again it looks like you're interpreting something that is in fact highly undesirable as something that's somehow an advantage to GPUs. They suffer from this just as much. It may be a hurdle for CPU-GPU unification, but it's definitely not reversing it. And again, there are substantial advantage to unification aside from transistor (area) cost. It improves data locality and makes writing efficient code much easier.
Managing and coalescing the data movement doesn't come for free. It worsens the data locality and you'll end up running latency sensitive code on the GPU. It works form the point of view of minimizing the data transfer overhead, but it's making things less efficient elsewhere.Data movement can be managed and coalesced so that a design can intelligently weight the occassional burst in consumption when starting the offload process versus the ongoing economies in power consumption.
This may sound like a simple matter of optimizing it until you reach the right balance, but it's a veritable programming nightmare. There are tons of heterogeneous system configurations; some with powerful discrete GPUs, some with feeble integrated GPUs, some with lots of registers per thread, some with few, some with good shared memory bandwidth, some with severe bottlenecks, some with crippled double-precision performance, some with limited integer performance, etc.
To make matters worse, things are not getting any better for heterogeneous systems. Bandwidth and latency doesn't increase at the same rate as computing power. So moving data between the CPU and GPU becomes ever more costly. The only solution is unification.
Yes, the hardware prefetcher isn't currently aware of long-running SIMD instructions and their latency hiding qualities. So it would have to be made aware of that. This isn't hard, it's just a gear shift.Why does the memory hierarchy see things as being significantly different? The data cache and the memory controller have very little awareness of what instructions are doing outside of the memory accesses they generate. The long-running SIMD instructions basically make a quarter-wide SIMD demand the same amount of operand data.
Did unifying the GPU cores negate their design features for vertex and pixel shaders? No, they just cater for both now, and leave features unused when not needed by the shader. Likewise, long-running vector instructions would just make the CPU cater for high efficiency DLP extraction, which allows to gate some features aimed at ILP extraction. Those ILP features are still highly necessary for sequential scalar workloads. Again, just another gear shift.This would make me start to question why this is on a big OoO core when it seems all its design features are negated, but it has to jump through hoops to appear simpler.
But that appears to be about homogeneous cores. Sure, it can be extended to heterogeneous ones, but it's not solving the inherent problem that bandwidth and latency between heterogeneous cores is scaling more slowly than computing power. And in fact with a homogeneous ISA (virtual or not) across heterogeneous cores and a unified address space across disjoint memories, the developer becomes less aware of where the code is running and where the data is located. Just for arguments sake, he could be running the OS on the GPU and graphics on the CPU, with both pulling data from the other side. I really don't think that disguising a heterogeneous system as being a homogeneous one fixes things. It might be an improvement on average, but it's really just another convergence step toward full unification.Intel's power control unit has been subtly overriding OS power state demands since Nehalem and possibly one of the Atom designs at the time.
It's already clear that the discrete GPU's days are numbered, so RAM memory, the memory controllers and part of the cache hierarchy all become physically and not just virtually unified. And while this isn't optimal for specific types of workloads, it's way better than a CPU pulling data out of graphics RAM or a discrete GPU pulling data out of system RAM, with the risk of that bringing down performance only getting worse.
It's only a matter of time before the cores have to be physically unified as well, to prevent having to run code on the wrong type of core because bandwidth and/or latency don't allow to migrate it. This problem doesn't occur when all cores are equally capable of extracting ILP, DLP and TLP.
What standards?Going forward, Intel has been putting forward standards to allow system components to communicate guard bands on latency requirements, so that their next SOC will be able to coalesce activity periods at its discretion to better enable power gating.
Improve graphics performance at the expense of GPGPU?It does exactly what I want it to do, and exactly what the customer would want it to do.
Removing memory space barriers means adding hardware features. So you can call it removal all you want, it's still convergence.Or what Daly said, removing things incidental to the real problem.
If it quacks like a duck...When you have a hammer...
Seriously, what you described is definitely closer to software rendering and further away from heterogeneous hardware. Compilation and scheduling are latency sensitive tasks, and since you'll want each core to adapt individually, you want each core to have CPU-like qualities. In theory you could just pair up each GPU core with a CPU core, but since you want a shared ISA as well we can at least unify instruction fetch and decode. Assuming that at this point the CPU side of each core uses out-of-order execution and the GPU side uses in-order execution, you still need ways to synchronize data between them. So the memory subsystem also has to be tightly interwoven, especially since you also want a uniform memory space. This practically just leaves scheduling. But scheduling instructions or scheduling threads isn't horrendously dissimilar. You may as well have just one generic out-of-order SMT scheduler and have things like long-running SIMD instructions to lower the switching activity.