But it's not adjusted based on the workload, which is what I was asking about. To balance a CPU's responsiveness and throughput for a given TDP, it has to adjust things based on the workload, per core.
That's pretty much what they do these days. You're basically writing out the same words Qualcomm uses to describe the power management for Krait in its marketing.
A core with dynamic voltage and frequency control is able to get information from activity counters, firmware heuristics, and possibly the OS scheduler to determine what the workload demands are.
Aggressively integrated gating and dynamic frequency adjustments have made their way into any power constrained environment.
I do not follow why we're on this tangent as if this is a new concept, or how a core that can vary its voltage or clock gate isn't just the exact same set of circuits, except at a different voltage and some of the clocks are not enabled.
I'm talking about computing devices that literally reconfigure themselves. A truck that shifts from second to third gear is still the same truck.
But you asked specifically; "how many more silicon nodes do you think we have left to hide this future in".
Silicon nodes are the primary means for increasing the core count of aggressive OoO straightline cores in a consumer device with one or possibly two dies for a socket (or BGA package if it's Broadwell).
I've posited the use of technologies and design choices that can allow designers to work around this, particularly in areas where silicon nodes are providing less than the necessary scaling despite the progression of Moore's law.
The power consumption might indeed be the trickiest part. But I don't see any reason for despair. First of all, CPU cores can double or quadruple the SIMD throughput (again) without costing an equal increase in power consumption, because it represents only a fraction of the power budget.
Properly supplying quadruple SIMD througput is more expensive than you let on, and I've stated the position that for the desired performance goals by 2018 or 2020 for Exascale, the default power budget is too high to begin with.
To reiterate, the proposed gains are modest and the baseline not good enough.
There's no way Haswell will consume more than Westmere, and that trend will most likely continue.
There are reports that there may be Haswell high performance SKUs with 160W+ TDPs.
Westmere stopped at 130.
Can you add some clarification on what you mean by this?
And then there's the opportunity for long-running wide vector instructions which allow a further reduction in power consumption. Next, there's the piecemeal introduction of NTV technology and adjusting the clock frequency based on the workload.
Modest gains insufficient for the order(s) of magnitude scaling desired, and adjusting clock frequency is officially old hat at this point.
Do you mean something more to adjusting clock frequency than I am interpreting?
And lastly, tons of research is going into lowering the transistors' power consumption now. Multigate transistors were an important breakthrough, and junctionless transistors could be the next major leap which make the ITRS projections highly conservative.
FinFET is quite impressive in the lower voltage domain, especially in more modestly clocked designs.
The improvent in the 4 GHz 1V+ realm is back the modest tens of percent.
I'm not sure why it's fine to pin hopes on one lab's silicon nanowires that may someday be looked at and a whole NTV Pentium that physically exists and has been manufactured has to be discounted.
We don't have to wait and see. The facts are already known. At the "optimal" operating point, the clock frequency is ~9x lower, while the power consumption is ~45x lower. To compensate for this loss in absolute performance, you'd need an order of magnitude more transistors. And that's just to keep the performance level. It offers a nice 5x reduction in power consumption, but at an insane increase in die size. Note that I didn't even calculate in the transistor/area increase due to NTV technology itself yet, nor any performance loss due to Amdahl's Law.
A design that targets density, an economy in complexity and transistors, and has latency-tolerant and highly parallel workloads should be very interested in this. It's just not handy for aggressive OoO desktop cores, but that's not enough to make me discount it.
A design that cuts transistors at the expense of general performance can still appeal to power-limited parallel computation, and the low absolute power consumption is very helpful if using high levels of integration. Die stacking can bring multiple layers of low-power silicon to bear, while also allowing stacking with memory, whereas a die with an order of magnitude more power consumption can severely constrain it.
Transistors may be getting cheaper, but only at the rate of Moore's Law, at best. Only niche markets where low power consumption is way more important than absolute performance, can afford to have chips that nominally run at NTV voltage. The only commercially viable use for consumer products is for low idle power consumption.
So it's
only an exponential curve. This does point to a widening of the scope, since we've moved beyond harvested energy-only products.
Slow in absolute speed. A GPU running at NTV voltage will decimate the framerate. That does compromise the user experience.
Why would a mobile GPU with a short pipeline, relatively simple design and operating point in the hundreds of MHz fare worse with NTV than a Pentium with a short pipeline, relatively simple design, and an operating point in the hundreds of MHz?
Wide out-of-order execution CPUs and DirectX 11 GPUs are coming to mobile devices. So the desktop is still the trendsetter. Regardless, the majority of people aren't gamers. They rarely use the GPU to its fullest. Again just look at the distribution of HD 2500s and HD 4000s. Business desktops benefit more from a quad-core than from a more powerful integrated GPU aimed at gaming.
They're coming to Windows mobile devices, and the Haswell mobile CPUs with integrated GPUs are deliberately expanding the area devoted to the GPU and lowering clocks to derive better performance per watt.
There's nothing wrong about adjusting to the workload. And there's nothing artificial or constraining about it either. When the buffers are full of long-running instructions, the previous stage(s) can be clock gated for a certain number of cycles since there's plenty of work anyway. They might even do this already (they certainly do something similar for the uop cache and decoders, although that's in the in-order part).
Which steps do you think are left that aren't already heavily gated, because yes, extensive gating is being done already. This is why I've stated your supposed gains are modest, because they seem to be including things that have been done for five years.
And yes, you could have different architectures for each and every workload, but then there's duplication of logic, extra data movement, and more programming troubles.
Duplicated logic is duplicated stuff with costs going to zero.
Data movement can be managed and coalesced so that a design can intelligently weight the occassional burst in consumption when starting the offload process versus the ongoing economies in power consumption.
The absolute size of the specialized logic can also play in its favor, since multiple divisions can fit in the same area as a single large core. Their transfer costs would need to be evaluated in this regard as well, on top of power savings related to specialization at the design and silicon level.
Indeed, with long-running SIMD instructions there is potential for making things like prefetching less aggressive. I've mentioned that before in the Larrabee thread.
Why does the memory hierarchy see things as being significantly different? The data cache and the memory controller have very little awareness of what instructions are doing outside of the memory accesses they generate. The long-running SIMD instructions basically make a quarter-wide SIMD demand the same amount of operand data.
They might not be using the Haswell method of gathering vector data, which is apparently what I posited a while back as a microcoded loop of reads.
No. Scalar instructions are interleaved with the long-running SIMD instructions. So they execute at a slower pace as well.
This would make me start to question why this is on a big OoO core when it seems all its design features are negated, but it has to jump through hoops to appear simpler.
Please elaborate on these design choices. And what do their SoCs do behind the scenes that the software isn't aware of (that other designs don't do)?
Intel's power control unit has been subtly overriding OS power state demands since Nehalem and possibly one of the Atom designs at the time. That might have not been a SOC, so that may have been an innacurate recollection on my part. Going forward, Intel has been putting forward standards to allow system components to communicate guard bands on latency requirements, so that their next SOC will be able to coalesce activity periods at its discretion to better enable power gating.
Parallel to this are firmware and hardware changes by upcoming ARM designs that will make even core assignment something much more fluid under the hood than would be visible to software.
But we weren't talking about Kepler's primary function.
I'm talking about doing anything and everything to get the most performance per Watt, including using silicon that is dedicated to subsets of different workloads that may be underutilized or gated fully off at other times.
It does exactly what I want it to do, and exactly what the customer would want it to do.
Crying over potentially underutilized transistors that have halved in price for almost 50 years is not high on my list of priorities. Most of the transistors on a chip have been aggressively kept off for about a decade, so what's a few more to add to the pile?
Bringing it one step closer to unification.
Or what Daly said, removing things incidental to the real problem.
Sounds like software rendering to me.
When you have a hammer...