Just for comparison, I make up some numbers based on the ancient RV770 (built in TSMC's 55nm):CPU cores: 217.6 GFLOPS / 4 x 3.15 mm x 5.46 mm = 3.16 GFLOPS/mm²
IGP: 129.6 GFLOPS / 4.71 mm x 8.70 mm = 3.16 GFLOPS/mm²
That's for a Core i7-2600 as-is, at baseline frequencies. Haswell adds FMA to the mix...
1200 GFlop/s in 255mm² (including 256Bit memory controller, which could be shared with a CPU) => 4.7 GFlops/mm²
The shader core only (which still includes L1 caches and texturing hardware), measures ~105 mm² => 11.4 GFlop/mm²
Just the SIMD engines without TMUs (but of course with 2.5 MB register files) are roughly 74 mm² => 16.2 GFlop/mm²
Perfect length scaling (without taking any frequency changes into account) to 32nm would get you 33.7 GFlop/mm² for the whole shader core (~48 GFlop/mm² for the SIMD engines alone). That is one order of magnitude difference.