Uhm, ok, so a CPU releasing - hopefully - in what, 2015 or something, will do two TF. We have GPUs right NOW at four and a half. What do you think a GPU will do in '15; six, eight? More?
The part you're missing is that only a tiny minority buys such monster GPUs, while octa-core will enter mainstream in 2015 (at 14 nm it would measure only about 100 mm²). To put this into context, 2 TFLOPS would be more than the PlayStation 4's entire GPU. From a CPU!
Discrete GPUs are becoming a niche product, and they're focusing on graphics again instead of GPGPU. So application developers have no interest in targeting that. Intel has increased FLOPS more than fourfold between Westmere and Haswell, and could do it again with Skylake. That's the kind of ROI developers are looking for. Not just because it could speed up high throughput workloads 16-fold just five years later, but mainly because it takes relatively little effort to achieve that. You can have any programming language vectorized, without worrying about new threads to synchronize with or shifts in data locality. Your code loops just execute faster, period.
Trying to target NVIDIA, AMD and Intel GPUs instead is a nightmare and you're only getting a fraction of the peak performance due to the heterogeneous synchronization and data transfer overhead, the GPU's poor handling of complex workloads, and the many pitfalls of dealing with various bottlenecks among various architectures.
Regardless it'll be enough to squish any CPU available on the market right then, which by in of itself makes it worth targetting by developers. I know you're a big software/microprocessor fan, but discrete graphics processors aren't going to go away just because 8-core chips become mainstream.
I'm not saying they're going away altogether any time soon. I'm saying GPGPU is going the way of the dodo in the consumer market. Anything that involves sending data back and forth between the CPU and discrete GPU, is doomed to fail due to the bandwidth wall. Integrated GPUs suffer less from that (but don't eliminate it), but they're far weaker to begin with and they still suffer from all the cumbersome heterogeneous programming issues.
Homogeneous computing has a much brighter future, with CPUs getting better at high throughput every generation, while offering a straightforward and well-behaved programming model.
Also, will CPU memory interfaces be able to keep up with 8-core, 512-bit SIMD units...? And in any case, high-end GPUs already have multiples more bandwidth than any CPU releasing in a few years' time.
You're looking at it all wrong. GPUs are wasteful with bandwidth and therefore they need a lot. This is not a good thing! Because they run thousands of threads, lots of on-die storage is required just to hold thread contexts. All these threads have to share tiny caches, so the miss rate is high and they have to reach out for data further away from the execution units. This costs a lot of power.
This problem is only getting worse. It's not the ALUs that consume the most power, it's getting data into them. And power is the limiting factor to scaling performance. CPUs are not nearly as much affected by this. Haswell quadrupled FLOPS per cycle over Westmere, while increasing clock frequency and actually reducing power consumption. So we're seeing a stark convergence in FLOPS/Watt.
And again, the discrete GPU market is in decline. Integrated GPUs have to share the same bandwidth as the CPU. DDR4 and on-package or on-die DRAM can increase that bandwidth, but incur a cost that is delaying them from widespread use. And because GPUs need higher bandwidth per FLOP than the CPU cores, things are evolving in favor of the CPU cores.