You missed the point. Everyone's concentrating on lowering power consumption now.
I've pointed to NTV being an interesting tool for HPC and mobile graphics. For HPC especially, the goals put forward for Exascale computing put a rough time frame of 2018 for Intel and others by 2020.
NTV has been implemented at 32nm, so it can be in the development pipeline prior to 2018 or 2020.
I'm not disputing that awesome things are in the pipeline for after 2020, but it's almost trivial to say that there's almost always something awesome way around the bend.
Until then, there may be years for which NTV is a decent enough choice.
Whenever tunnelling heterojuncture nanowire transistors become viable, then NTV may be jettisoned if it provides no significant benefit. It's a tool for the time period that it's needed and can be thrown out when stale. It's not a spouse.
The original Pentium running at several hundred MHz was a design from 1993. So you'll have to wait about 20 years for NTV technology to give your GPU the same reduction in power consumption achieved with that Pentium, but still at today's performance level.
Should I point out that the pipeline complexity of modern mobile GPUs is many years behind that of high-speed modern OoO CPUs?
To clarify, incarnations of the Pentium core at several hundred MHz were being introduced until 1997, not that it matters since I'm thinking you are making a more fundamental assumption that design X at clock speed N if translated to NTV is design Y at clock speed at N/10.
That isn't the case any more than a Pentium ported to 22nm is going to be running at 3 GHz.
There isn't a linear relationship to the difficulties and demands for running at a given speed. The curve gets a lot steeper at the upper end, and it's much more forgiving at the lower.
Intel is already on top of all of this for its CPUs: using 8T SRAM and early adoption of FinFET.
Neither of those are in the same vein as NTV. FinFET isn't about NTV either, although it probably makes it more compelling since the subthreshold slope is steeper that what was available for the 32nm Rosemont.
GPUs suddenly using full NTV technology doesn't allow to catch up with this in any way.
If GPUs suddenly sprouted NTV capability today and they started giving 40-50 DP GFLOPS/W for HPC and you could run games at 2560x1600 in less than 100W, you'd better bet people would notice.
It has to be much better than that. The Radeon 7970 has 65% more transistors than the 6970, and offers 40% higher performance in practice (with only 50% higher bandwidth).
I've found later reviews using more recent drivers Anandtech in the ~50-60%. The Gigahertz edition adds further margin, but it increases power consumption to do so.
And it will surely be extended upon. Gating is a hot research topic (pun unintended). It's obviously disingenuous to look at an Alpha 21264 when discussing the power consumption per instruction of modern CPUs, but it's equally pointless to only look at today's CPU designs when discussing their future scaling potential. For instance branch prediction confidence estimation is said to save up to 40% in power consumption while only costing 1% in performance due to false negatives.
It's not readily comparable, but the Alpha 21264 is also at this point conservative in many respects to a modern OoO core. Its scheduling and issue capabilities are significantly more restricted, and its branch predictor is somewhere between 2-8 times smaller. The error bars on the prediction would be narrower, but there has been a proliferation of predictors in the front end since then and a reluctance in the latest generations to disclose hard numbers. It's probably something somewhat over 4 times in terms of global history, and then there are possibly 1-2K of storage in different predictor types past that.
Since the branch predictor in particular is a very large component of the energy cost, it's something to be aware of. In size terms alone, even a 40% power savings with confidence prediction puts a modern branch predictor's power consumption significanlty higher than a small predictor from a decade ago.
I'm obviously just scraping the surface here. There's hundreds if not thousands of researchers and engineers working on stuff like this. Besides, both the CPU and the GPU have the same switching activity reduction problem.
One has transistors switching at or below 1 GHz. Mobile GPUs and integrated GPUs generally run below. This makes a very large difference.
Developers are not very willing to jump through many hoops for extra performance. The failure of GPGPU in the consumer market clearly illustrates this.
GPGPU is not a primary concern for me. If silicon doesn't like being pushed too far out of its sphere, I'm not averse to gating it off. Perhaps one of the ways it was stymied was the introduction of fixed-function encoding and decoding hardware present in most modern consumer CPUs and mobile SOCs.
Keeping track of the ratio of long-running SIMD instructions that are executed seems pretty straightforward. And while I never said that only one type of core could have access to such knowledge, I don't see what a non-unified architecture could do with it. Care to elaborate?
The hand-off latency on chip is a fixed cost of some number of cycles. If we're relying on accumulated instruction data or software hints, that indicates startup latencies or periods of misjudged demand.
Handoff can probably be shaved down to tens of cycles, and a lot of corner cases for accumulating sufficient data to change the direction taken by heuristics can take tens of cycles.
Once in a steady state, both methods will stabilize, so I suppose I'd need more data to consider them that different.
The penalty for mixing SSE and AVX instructions is easy to avoid, and mostly just a guideline for compiler writers. It's on a completely different level than GPU manufacturers telling you to minimize data transfers between the CPU and GPU, which is nowhere near trivial to achieve.
This penalty, at the very least, is not likely to persist except possibly for enthusiast gamer systems for one or two hardware generations.
It also remains to be seen whether Sandy Bridge's penalty will still exist for Haswell, since each SIMD unit will be capable of 256-bit integer and floating-point operations. And finally, the warmup period is very well balanced to ensure that code without AVX instructions doesn't waste any power and that it's unnoticeable to code that does use AVX.
Agner Fog profiled the transition between cold and warm 256-vector states for a CPU to take hundreds of cycles. Until then, the throughput is halved.
This is generally tolerable because over time the initial cost is amortized over millions of cycles of full throughput as long as no cooling off occurs.
Several hundred cycles of warmup is a lot of margin to fit alternative design choices into. Signalling and pertinent parameters can be transfered on-chip in tens of cycles, and once running the specialized silicon will save power for millions of cycles.
That's hindsight. Today it's plain obvious that we'll never go back to a non-unified GPU architecture. But several years ago it really wasn't cut-and-dry that vertex and pixel processing should use a unified architecture.
The physical argument remains the same. Embarassingly parallel vertex shaders and embarassingly parallel fragment shaders don't make significantly different demands in terms of the units and the silicon implementing them, even when not unified.
"It’s not clear to me that an architecture for a good, efficient, and fast vertex shader is the same as the architecture for a good and fast pixel shader." - David Kirk, NVIDIA chief architect, 2004
G80 was already underway at the time.
If it wasn't clear to Kirk, it wasn't too cloudy, either.
Today's out-of-order execution architectures consume far less power than several years ago. It's even becoming standard in mobile devices (the iPhone 5's CPU is faster than the 21264, in every single metric). And things are only getting better, so let's not exaggerate the cost of out-of-order execution.
It would be interesting to see a comparison of power consumption on a process-normalized basis with a design with the same reordering and execution capabilities as the EV6.
It's a small core these days, and as you said inferior.
A "growing trend" of software and hardware fighting over control isn't a sustainable solution.
There's no more a battle over this than developers targeting OoO designs fret over the exact ordering of the instructions being executed.
If the end result if the same, the answer to software is that it's none of its business.
In the cases where the results aren't the same (typically low-level timing issues), the answer is that the OS, compiler, or developer need to get over it or buy an in-order core.
Developers would have to deal with unexpected behavior of several configurations of several generations of several architectures of several vendors with several driver and several firmware versions.
At the same time, software is challenged by difficulties in load-balancing highly parallel HPC systems with complex code.
Hardware can know a lot of things that software cannot, and dynamic information can be used to inform what software functions or paths to use or remove if they have already been generated, or whether an optimization pass can begin.
It's somewhat similar to some shader optimizations in the hardware and software realms, it would just be done faster.
You can't measure what kind of specialized core a thread might need, before you execute it.
Software is allowed to submit pertinent data to the hardware or OS scheduler.
A brief history of performance events can in the space of hundreds of cycles out of billions provide a decent enough clue.
And threads can switch between fibers that have very different characteristics. So a better solution is to have only one type of core which can handle any type of workload and adapts to it during execution. It doesn't have to do any costly migrations (and explicitly "manage" that at minimal latency), it just adjusts on the spot.
This is the asserted ideal, but not likely to be born out in practice to meet the demands of the market.
This indicates that homogenizing the ISA and memory model at a logical level will eventually lead to unifying fetch/decode and the memory subsystem at the physical level.
Why would I need to drive a 4-wide high-speed decoder, and multiple instruction queues and caches if a workload doesn't need them?
The same ALU doesn't have to perform the same operation twice. The result is typically bypassed to all other ALUs that can operate on it, and for any operation they support.
Only if the dependent instructions are in the position to snoop the bus at the right cycle. Later accesses go back to the register file.
Would it be of interest to you to know that it was claimed that one of the discarded cores before AMD's Bulldozer didn't bypass results?
Also, since the scheduler wakes up dependent instructions in the cycle before the result becomes available, the chances of executing an instruction which can pick operands from the bypass network is very high.
If they are in the scheduler by the time the instruction executes, or are able to pull their operand from the bus and execute immediately with no other data or structural hazards.
It's easier with reservation stations that actually snoop the bus and store things over time, but that's not power-efficient anymore.
Note also that writeback can be gated when the value's corresponding register is overwritten by a subsequent instruction.
For an OoO design that couldn't be guaranteed until all intervening instructions have been found to have no exceptions at retirement and no interrupts come up. It's not apparent until then what values can be safely skipped.
Again, things like tag checking are independent of instruction width. So it becomes insignificant for very wide SIMD instructions.
Additional checking or some in-built shifters or adders are necessary for multi-cycle SIMD instructions. so it's more linear unless it's all single-cycle.
So I'm not arguing that the ORF and the bypass network are the exact same thing, but from a power consumption point of view they can serve similar purposes. And while the ORF saves the cost of tag checking, it doesn't help reduce thread count to improve memory locality and thus causes more power to be burned elsewhere.
They're not related at all. The ORF is the top tier of a multi-level register file, with the upper tier aliased with the ones below. It saves tag checking and reads/writes to the larger register file. It's read like any other register file because it is a register file.