Waiting on III-V is 8 or more years. TFET is something more speculative and may be a decade or more.
You missed the point. Everyone's concentrating on lowering power consumption now. There will be short-term results, and long-term results. Short-term results that keep driving the convergence between the CPU and GPU, and long-term results to make unification an achievable and desirable goal not just for integrated graphics but also to scale beyond that.
The original Pentium ran at several hundred MHz and continued to run in the same ballpark with NTV.
Mobile GPUs run at several hundred MHz.
Niche HPC hardware and FPGAs run at several hundred MHz.
They can all do as horribly as they do right now, just many times more power efficiently.
The original Pentium running at several hundred MHz was a design from 1993. So you'll have to wait about 20 years for NTV technology to give your GPU the same reduction in power consumption achieved with that Pentium, but still at today's performance level. Of course the increasing transistor budget can offset this, but it reduces the gain in power efficiency to only about 4x. Obviously all of this is highly dependent on what happens to semiconductor technology in the next two decades...
In any case, you clearly don't want to adopt full NTV technology any time soon. You want to pick the design changes that allow to lower the supply voltage with the least impact on performance, and only to augment what newer processes and design techniques don't offer. Intel is already on top of all of this for its CPUs: using 8T SRAM and early adoption of FinFET. GPUs suddenly using full NTV technology doesn't allow to catch up with this in any way.
20-30% per node, optimistically. That means that the general-purpose silicon can get its 20-30% increase in transistor count, and then there's 70-80% of the chip that they can put something in or just have unexposed wafer.
It has to be much better than that. The Radeon 7970 has 65% more transistors than the 6970, and offers 40% higher performance in practice (with only 50% higher bandwidth). Even if you account for the slightly higher power consumption and clock frequency, that's reasonably good utilization of those extra transistors. Yes, it's possible that on average only 20-30% of those extra transistors are switching, but the 6970 doesn't have transistors that are switching 100% of the time either so you have to look at the relative increase. And if power was the limiting factor for using more, you'd expect them
not to increase the clock frequency. Also, this is with a process that's approaching the limits of planar bulk transistors. FinFET offers a significant improvement in the short term, and there's multiple promising technologies for the longer term.
And it will surely be extended upon. Gating is a hot research topic (pun unintended). It's obviously disingenuous to look at an Alpha 21264 when discussing the power consumption per instruction of modern CPUs, but it's equally pointless to only look at today's CPU designs when discussing their future scaling potential. For instance branch prediction confidence estimation is said to save up to 40% in power consumption while only costing 1% in performance due to false negatives. When you have multiple cores and you're optimizing for throughput/Watt, this should be tuned to allow for a slightly larger single-threaded performance impact and offer a substantially bigger power saving during badly predictable branchy code (and it could also be tuned dynamically depending on the number of active threads).
Wider SIMD units would increase the relative number of transistors that can be gated when running scalar workloads. And long-running SIMD instructions enable more gating in the rest of the the pipeline during parallel workloads.
I'm obviously just scraping the surface here. There's hundreds if not thousands of researchers and engineers working on stuff like this. Besides, both the CPU and the GPU have the same switching activity reduction problem. So it's not like a unified architecture would in theory be off worse. It's definitely a challenge to ensure that this unification doesn't completely cancel the power optimizations, but there are clearly ways forward, and it solves all of the efficiency and programmability issues that heterogeneous computing is facing.
The lack of a free lunch doesn't seem like a strong detraction from anything.
There is absolutely nothing we've discussed that comes for free.
Developers are not very willing to jump through many hoops for extra performance. The failure of GPGPU in the consumer market clearly illustrates this. AVX2 on the other hand marks an inflection point for auto-vectorization to extract DLP from generic code. That's a free lunch. Likewise TSX is mostly about enabling the creation of tools and frameworks which assist or automate multi-threaded development. That's lowering the developer cost of TLP extraction. Of course it's not completely free on the hardware end, but I'm sure GPU manufacturers wished they could pay that low a price to make GPGPU a selling point.
It strikes me as extra baffling because the earlier part of this discussion concerning long-running SIMD is a design that somehow knows the workload it's running and sort of adjusts itself.
It strikes me as very disingenuous to say that only one type of core can have access to this knowledge.
Keeping track of the ratio of long-running SIMD instructions that are executed seems pretty straightforward. And while I never said that only one type of core could have access to such knowledge, I don't see what a non-unified architecture could do with it. Care to elaborate?
It's particularly true since major shifts in unit activity do incur costs, as we see with Sandy Bridge and its warmup period and known performance penalty for excessively switching between SIMD widths.
There are a lot of transfers and costs that can be considered acceptable if they are within the same ballpark in terms of latency and overhead for incidental events such as that.
If Intel is free to caution programmers not to do X, or risk wrecking performance on SB, the same leeway can be granted elsewhere.
The penalty for mixing SSE and AVX instructions is easy to avoid, and mostly just a guideline for compiler writers. It's on a completely different level than GPU manufacturers telling you to minimize data transfers between the CPU and GPU, which is nowhere near trivial to achieve. It also remains to be seen whether Sandy Bridge's penalty will still exist for Haswell, since each SIMD unit will be capable of 256-bit integer and floating-point operations. And finally, the warmup period is very well balanced to ensure that code without AVX instructions doesn't waste any power and that it's unnoticeable to code that does use AVX.
Or just moving them within millimeters of each other and use physical integration to provide growth in bandwidth.
Whatever gets the job done works for me.
Sure, and every step along the way that gets the job done, happens to be a convergence step.
I would say that the caches, interconnect, and memory controller would be in the same boat.
And that extra coalescing doesn't come for free, if you consider that a valid argument.
Not free, but cheaper than relying on the developer to handle data locality.
The actual hardware demands for the two types in terms of units and data paths weren't that dissimilar.
That's hindsight. Today it's plain obvious that we'll never go back to a non-unified GPU architecture. But several years ago it really wasn't cut-and-dry that vertex and pixel processing should use a unified architecture.
"It’s not clear to me that an architecture for a good, efficient, and fast vertex shader is the same as the architecture for a good and fast pixel shader." - David Kirk, NVIDIA chief architect, 2004
There will come a day in the not too distant future when all programmable computing will be handled by one architecture. And from then on it will be considered plain obvious that all workloads are not too dissimilar, they're just a mix of ILP, DLP and TLP.
The physical circuits and units involved are massively overdetermined for that use case, and the cost of that is significantly higher than zero.
Every load and store would be rammed through a memory pipeline designed for 4GHz OoO speculation and run by a scheduler, retirement logic, and bypass networks specified to provide peak performance and data transfer to portions of the core you declare are fine to be unused.
Today's out-of-order execution architectures consume far less power than several years ago. It's even becoming standard in mobile devices (the iPhone 5's CPU is faster than the 21264, in every single metric). And things are only getting better, so let's not exaggerate the cost of out-of-order execution. The 21264 days are long gone. Also, the scheduling cost is the same, regardless of whether it's an 8-bit or a 1024-bit arithmetic instruction, or an 8-bit or 1024-bit load or store. That, together with long-running instructions to increase the gating opportunities, really means that out-of-order execution will become a non-issue for high DLP workloads. And in fact it improves data locality, thus saving on data transfer power consumption.
I guess it is true that if a programmer has more power than he should have and he does pointless things that there could be a problem. It's sort of solved by the growing trend of the chip's cores, microcontrollers, and firmware very quietly overriding what software thinks is happening.
A "growing trend" of software and hardware fighting over control isn't a sustainable solution. Developers would have to deal with unexpected behavior of several configurations of several generations of several architectures of several vendors with several driver and several firmware versions. With ever more abstraction layers to run ever more complex code on this very wide variety of hardware, with each layer thinking it knows what's best and most of them not under the application developer's control, it becomes incredibly hard to write high performance software. Heck, it's a nightmare just to get acceptable stability and provide bug fixes after shipping.
A unified architecture would eliminate these issues. There's no need to balance workloads between different core types of unknown capability. There's no
unexpected data transfer behavior. And the 'driver' is whatever software libraries you decide to use and ship.
Threads migrate all the time, even between homogenous cores. The costs are measurable and can be scheduled and managed if they aren't explicitly spelled out by the software so that the chip knows what kind of core a thread needs.
You can't measure what kind of specialized core a thread might need, before you execute it. And threads can switch between fibers that have very different characteristics. So a better solution is to have only one type of core which can handle any type of workload and adapts to it during execution. It doesn't have to do any costly migrations (and explicitly "manage" that at minimal latency), it just adjusts on the spot.
Unifying fetch and decode might be a choice, but it too doesn't seem to be strictly necessary since the fetch and decode requirements can be different between cores. There would be no software-visible difference.
Nothing is "strictly" necessary. Having the same shader capabilities for vertices and pixels while having dedicated cores for each is
perfectly possible. That doesn't mean its recommendable though. Even mobile GPUs are going unified.
This indicates that homogenizing the ISA and memory model at a logical level will eventually lead to unifying fetch/decode and the memory subsystem at the physical level.
The value on the bypass bus is whatever value came out of the ALU, irrespective of the destination register, and unless the ALU performs the same operation twice, that value is gone afterwards and further accesses will need to come from the register file.
The same ALU doesn't have to perform the same operation twice. The result is typically bypassed to all other ALUs that can operate on it, and for any operation they support. Also, since the scheduler wakes up dependent instructions in the cycle before the result becomes available, the chances of executing an instruction which can pick operands from the bypass network is very high. Note also that writeback can be gated when the value's corresponding register is overwritten by a subsequent instruction. This can cause a large portion of instructions to execute without even touching the register file.
The ORF is a software-managed set of registers used to keep excessive evictions from occurring from the RFC, and can service multiple accesses across multple cycles. It is guided by the compiler's choices in register ID usage, and because the source value is in the instruction, no tag checking is needed. It's very much not a bypass network.
Again, things like tag checking are independent of instruction width. So it becomes insignificant for very wide SIMD instructions. Fermi used tag checking in the RFC. So I'm not arguing that the ORF and the bypass network are the exact same thing, but from a power consumption point of view they can serve similar purposes. And while the ORF saves the cost of tag checking, it doesn't help reduce thread count to improve memory locality and thus causes more power to be burned elsewhere.