GT200b: 55 nm, 1400 Mtr, 470 mm², 1063 GFLOPS
GF110: 40 nm, 3000 Mtr, 520 mm², 1581 GFLOPS
Eliminating the process shrink, that's a factor 0.63 reduction in computing density. And yes, it got more efficient, but I think that barely compensates for the loss.
i7-99X: 32 nm, 239 mm², 166 GFLOPS
i7-2600: 32 nm, 175 mm² (w/o IGP), 218 GFLOPS
That's a factor 1.8 increase in computing density. FMA2 should also nearly double the computing density. Of course FMA doesn't actually double the effective performance, but note that the peak performance/mm² would be quite close to that of the GPU.
Please, don't quote the bogus FLOPS for GT200. I was under the impression we were talking about MADD/FMA(C)-FLOPS, at least I was.
With real, usable FLOPS, we arrive at 3,04 versus 2,85 GFLOPS per sqmm with GF110 at the longer end of the stick. And that's
including additional functionality and programmability.
Also, I cannot follow your reasoning why you must subtract SBs IGP from the die area. Using the same logic, I could argue to remove ROPs, texture filtering and even ff-rasterization from the GPUs' die area.
So you're probably asking why some benchmarks indicate the gap to be much wider? The answer is gather support, and FMA2 is going to fix that too.
"The answers" costs die area, as will it on future CPUs too. Less maybe, but free lunch is over.
Last but not least, the GPU is worthless by itself so you have to take the area of its accompanying CPU into account as well. So again, it's a very simple conclusion: GPGPU has no future.
A CPU is equally worthless without some kind of display device somewhere... And of course, a tiny little chip somewhere on a remote terminal would suffice, but equally, a GPU won't need a massive modern multicore OoO CPU in order to run it's operating system. A small Atom- or ARM-like device would do.
Plus: CPUs have the benefit of way more iterations of hardware in order to perfect programming and compilers. GPUs on the other hand only had five or so years and even less generations in order to both evolve their cores' potential as well as their programming models. I definitely see more headroom for improvement here than there is on the software-side for CPUs.
I am not blind, I don't think GPUs are the be-all-end-all of processing, but neither are CPUs. They buy their strengths at general workloads with larger die area per core. GPUs buy their strength at more specialized workloads with their limited flexibility. CPUs adding more and wider vectors loose their efficiency at their historical roots, GPUs adding more and more flexibility loose theirs at immense speed for specialized tasks. So while I won't make a placative statement like "CPU-HPC" is dead, I'd rather say, compute is going to converge at some point where the optimum balance of flexibility and compute density is found. CPU manufacturers approach it from the high-speed serial side of things, GPU manufacturers from the massively parallel side of things.