The reality is that mixing sequential processing with stream processing increases code and hardware complexity. It is well understood that the complexity of a system can be reduced significantly by dividing it into it's problem classes.
Unification makes the code far simpler, not more complex. If you know how to write a loop with independent iterations, you know how to implicitly mix scalar and vector code. Heterogeneous computing is way more complicated than that if you want acceptable results. As for the hardware, how can a CPU+GPU be less complex than just a CPU? Even with the wider SIMD units, Haswell's abilities as a sequential scalar CPU seem unaffected and they were able to
lower the power consumption while increasing performance, all on the same process node.
That said, hardware complexity doesn't have to be an issue in and of itself. For instance branch prediction is a complex but small piece of logic that is well worth the performance advantage and what it contributes to simplifying software. So while CPU-GPU unification will require a lot of effort from the hardware designers, that's what they're paid to do and shifting the problems from software to hardware is a good thing once process technology allows for it. Again, unified shaders on the GPU didn't come for free either.
The fundamental flaw in your logic is that you are insisting on the complete opposite of this solely because you recently have observed a trend of increasing programmabilty of GPUs and a tighter integration into the system, without any qualified analysis if the trend implies a convergence.
First of all I haven't just "recently" observed a trend of increasing programmability of GPUs. I've observed it since the Voodoo2 added dual-texturing capabilities. Secondly the mere observation of that trend for the past 15 years isn't what makes me believe it will inevitably result in unification. It's just in support of that. The difference is in what you and I believe to be the driving force. You seem to believe that I assume the trend will continue simply due to having been a trend so far. Indeed such reasoning is very fragile and wouldn't necessarily result in unification. But instead I believe there is a strong
desire for full programmability, and hardware is just the limiting factor to get there without losing too much performance.
Of course I can't present much of a "qualified analysis" for what is mostly a desire. But it's pretty obvious that consumers are not going to buy new hardware unless it has novel capabilities. Higher performance of existing applications isn't a novelty for very long. You have to enable new applications. So GPU manufacturers have to increase programmability to stay in business. CPU manufacturers already have ultimate programmability but lack performance to enable new applications, so they adopt things like multi-core and wide SIMD units. So they're really striving for the same things, without losing previous qualities. CPUs aren't going to let go of out-of-order execution to improve throughput computing and GPUs aren't going to let go of wide SIMD units to become better at scalar workloads. The only solution is a unified architecture which combines all those qualities. It's just a matter of time for process technology to enable that.
Intel has performed an analysis with Larrabee, which is a long shot away from your radical suggestion, and the result is that there won't be an convergence even for that. So I wonder if any evidence and reason could possibly convince you.
Larrabee isn't an analysis of that, because it's not a unified architecture. Instead of combining the qualities of the CPU and GPU, they made compromises to aim for something in the middle. The result is that it fails as a CPU and fails as a GPU. It was also a mistake to try to enter in the high-end market. It would have cost a fortune to sell it cheap enough to gain sufficient market share for developers to want to target it. The only market where such an in-the-middle compromise makes some sense is the HPC market, which they've entered with considerable success. But for graphics in the consumer market they should instead focus on architectures that are already worth the money by still being successful as a CPU and slowly becoming capable of replacing a low-end GPU. With the AVX roadmap they seem to be very well on track for that.
So I'm sorry but your evidence of the failure of Larrabee as a high-end discrete GPU is in no way an indication that the unification of the CPU and integrated GPU will also be a failure.
It is a completely different question if this trend will continue in the future (>5 years) at all or even will turn around.
I really don't have the time to link to the panel talks and papers which conclude that IC performance scaling is slowing down at an accelerating rate since the introduction of the 32 nm node. Just trust me that this is an established fact.
Intel claims that Broadwell (14 nm) will be at least 30% more power efficient. And after that they seem to want to widen the SIMD units to 512-bit. So if the performance scaling has slowed down since 32 nm I'm not seeing it. I see absolutely no reason why I should trust you that this is an established fact.
Sure, some companies can no longer continue the historic pace due to increasing costs. This is especially a problem for low volume custom designs. CPUs are produced in very high volume though because they are so widely applicable. This is another reason why unification makes sense. One architecture takes care of all your computing needs.