Ok let's entertain the idea of a super-CPU. An 8-core AVX2 chip @ 4Ghz gives you a single precision teraflop. Sounds impressive until you look at perf/flop or perf/$. That CPU on 22nm will be $400+ unless Intel becomes a charity. The equivalent 28nm GPU will be $150, earlier to market and also offer higher performance. Not so impressive after all - a $120 CPU and $150 GPU will be cheaper and faster than your homogeneous setup.
An 8-core Haswell CPU doesn't have to cost 400+ $ at all. Sandy Bridge is a tiny chip compared to AMD's 6-core Thuban, and the latter currently sells for 160-180 $ (and has no IGP). So producing a cheap 8-core Haswell CPU will hardly be a challenge.
The actual pricing really just depends on the competition. If AMD has even the slightest success with its APUs, Intel can drop its prices at will and still make a bigger profit. And if NVIDIA threatens their position in the HPC market, cheap and highly efficient teraflop CPUs will fix that right away.
And as mentioned before, you have to spec for peak, not average. Texturing workloads aren't spread evenly across a frame and no amount of CPU magic will change that. If you spec your CPU for avg texture throughput it will be severely bottlenecked during texture heavy portions of the frame.
And as I've said before, a CPU with AVX2 support would have higher peak texturing performance than the IGP. And it really does magically balance any workload since it's all just different bits of code executing on the same cores. So really all that matters is the sum of the average required performance for each task. There are no bottlenecks in software rendering, only hotspots.
GPUs have a lot of FF silicon but much of it is active in parallel and not bottlenecking anything.
When the TEX:ALU ratio is too high, GPUs easily get bottlenecked by the texture samplers. Also when the shaders are too short, they get ROP limited.
CPUs are more flexible but also slow - it's a simple tradeoff.
As I've pointed out before, it took only ~5% of extra die space to double the throughput between AMD's Brisbane and Barcelona architectures. AVX and FMA each double the peak throughput again at a minor increase in size. So just because current CPUs are primarily optimized for ILP doesn't mean they can't cheaply achieve high throughput as well. It's not a simple tradeoff in which you have to completely sacrifice one quality to obtain another.
You say that CPUs don't need to burn a lot of flops emulating FF hardware so where are the software renderers that prove that out?
SwiftShader is faster than it 'should be' when you add up all of the computing power it would take to emulate all of the GPU's fixed-function hardware at peak throughput. It currently merely uses 128-bit SIMD, destructive instructions, no FMA, fetching texels takes 3 uops each before filtering, it doesn't have lookup tables to compute transcendental functions, it has to compute gradients and pixel coverage for each polygon, explicitly read / test / write depth values, interpolate inputs, blend and convert output colors, schedule tasks, process API calls, etc. And yet despite that seemingly massive amount of work it achieves 20 FPS for Crysis on a 100 GFLOP CPU which also has to run the game itself and some processes in the background.
So don't underestimate what a software renderer could do with four times the GFLOPS per core, and gather.
Sorry, it's just not going to happen
Never say never, especially when you're just handwaving. Exactly what unique advantage would an APU have left over a homogeneous CPU with AVX2 and AVX-1024? Peak performance/area and performance/Watt would be highly competitive. And I've already debunked the importance of fixed-function components. Add to this the effect of out-of-order execution on cache effectiveness and I can't see what substantial advantage heterogeneous architectures could possibly have.