You are massively underestimating the demands of video processing in the next 10 years. I think 3D 4Kx4K HEVC could take 40x the performance of 2D 1080p H.264 High Profile. Even including the large one-time bump with AVX2, I'm very skeptical that CPU performance will scale by 40x in 10 years while remaining in the same thermal envelope. And even if it did, it will most certainly not be "low enough power consumption". The other side of the coin is that the PowerVR VXD392 and VXE382 can both handle 3D 4Kx4K H.264 High Profile today (scaling to HEVC in the future) at significantly less than 1W.
Let's first get something straight. There are three main discussions going on right now:
1) GPGPU versus high-throughput CPU
2) Dedicated versus software graphics
3) Fixed-function versus programmable video
Even if hypothetically for (3) the fixed-function hardware wins hands down (which I'll debate below), that doesn't mean a single thing to (1) and means (2) can still go in either direction, depending on platform, market segment, semiconductor technology breakthroughs, etc. Can we at least agree on that to some degree?
Now, while there's indeed plenty of room for scaling the video processing workload, I sincerely doubt that the majority of consumers want / need 3D 4Kx2K HEVC in the next ten years. Case in point, Blu-ray sales still have to overtake DVD sales. Quad Full HD screens are humongous and currently cost both your kidneys. And while I'm sure one day they'll become somewhat affordable, the question remains whether people actually see much need it. This situation is once again not unlike audio processing. Although higher qualities exist, 16-bit at 44k has proven to be adequate for the masses for decades.
And yes, Sandy Bridge's video encoding doesn't achieve very high quality, but that is not an intrinsic limitation of hardware encoding (even if it's practically never going to beat x264's quality).
It's an intrinsic limitation of fixed-function hardware that it's not forward compatible. Today's H.264 hardware is worthless for tomorrow's HEVC material, even at low resolution. No amount of power savings makes up for not being able to run something.
The handheld market is moving towards a 'reverse Turbo Boost' mechanism rather. similar to what AMD implements on Cayman: you have a maximum frequency and the chip monitors its total power consumption and temperature at a variety of likely hotspots. It automatically reduces the frequency and voltage of different blocks as required to fit within the power and thermal budget specified by the OEM.
As the number of cores increases further on CPUs and both peak power and hotspots become a problem even at the default frequency, Intel will be forced to move to something closer to this even on the desktop. I suspect the CPU will have to be clocked lower when all 8 cores are doing full-throttle AVX2 work.
Whether you increase the frequency with a single-threaded workload, or reduce the frequency with a multi-threaded workload, in the end that's exactly the same thing. It's just semantics whether you call the low or the high frequency the base frequency and boost or reverse boost from there.
Just like High-K was at 45nm, Tri-gate is a one-time improvement at 22nm. It pre-emptively prevents some problems that would have become much more severe at future nodes, but it does not magically fixes the power scaling problem in any general sense. Intel's 14nm process will take a fair bit more power per mm² of silicon than their 22nm process at the default voltage and there's no way around that. I'm not aware of anything revolutionary on 14nm and I haven't even heard anything very exciting for 10/8nm despite having my ears pretty close to the ground for process technology in general.
That's still many years from now. Back when 90 nm caused Intel to ditch the NetBurst architecture, people didn't have the slightest clue about 22 nm tri-gate devices. And even if 14 nm itself doesn't bring anything new to lower power consumption, once again note that AVX-1024 creates a 3/4 clock gating opportunity. That's some dark silicon for you right there.
Everyone is very well aware that new innovation in power efficiency will be required to continue Moore's Law. So a lot of effort goes into it, and it can come in many forms; not just semiconductor process breakthroughs and fine-grained clock gating, but also ISA enhancements and software itself. For example gather replaces a power hungry sequence of 26 uops. Each insert or extract instruction required moving more than 128-bit around (instead of merely the individual elements)! In a way the gather logic adds "dedicated" support for a common operation, but it's still quite generic (in the same way that other vector instructions are generic at least). Again it's all about covering ILP, TLP and DLP for generic workloads. And software innovation also assists in increasing effective performance/Watt through things like dynamic code generation and advanced work culling. And I've also already mentioned the potential of out-of-order execution to actually assist in increasing cache hit ratios and thus reducing the power consumption involved in fetching the data from higher up in the hierarchy.
So I'm not counting on a miracle from the process engineers. There's plenty of opportunity to scale homogeneous architectures for the foreseeable future. That said, I'm curious about the "exciting" things you've heard about, if you haven't heard about any "very exciting" things, yet...
Once again, even if you were right (which you are not), it's not just a question of technical viability but also political considerations inside Intel. The lesson that Intel seems to have learned from the failure of Larrabee is the exact opposite of everything you're saying:
Like I said several times before, the failure of Larrabee as a high-end GPU doesn't mean the IGP won't get replaced by CPU cores. Intel still has plenty of other reasons to turn the CPU into a power-efficient high-throughput device with AVX2 and AVX-1024. So far these "politics" you talk about have not prevented AVX from converging toward LRBni.
If you want to see this happen, your only chance is to implement a kickass AVX2-based DirectX11 renderer in SwiftShader and manage to prove to the world - including Intel itself - that you can be competitive with Intel's own IGP in Haswell in terms of both absolute performance and performance/watt.
This isn't about me, it's about empowering
every developer with limitless capabilities. And so it's not about DX11 software rendering either. A software renderer that's no more than a drop-in alternative for restrictive hardware rendering APIs, would be a failure. Where things get really interesting, is when you leave the beaten path...