I thought it was already discussed how power hungry that eventually is.
16 registers per thread is plenty to ensure that the vast majority of accesses to reused data are register accesses. I sincerely doubt that having any more registers would have a significant effect on power consumption.
Also, you can't force data accesses to be register accesses. What makes it even more ironic is that on a GPU you want to minimize the number of registers to maximize the number of wavefronts. And that again also worsens cache contention.
I remember the FMA specification first turned up years ago. AMD announced that Bulldozer will support FMA(4) more than two years ago (and before, SSE5 proposed in 2007 also included [differently encoded 3 operand] FMA instructions). The confusion about FMA4 and FMA3 is also already 2 years old. So someone who didn't know that intel will implement FMA(3) must have lived under a stone in a cave somewhere in the middle of nowhere for the last years.
Some people still doubted that Intel would introduce FMA with Haswell, because Sandy Bridge doubled the ALU width already.
For conventional GPUs it does not matter, as you can stream with a higher bandwidth directly from memory. So a huge LLC it is basically wasted. But look on intels iGPU in Sandybridge! They share already the L3. Why do you think it will be any different in future versions of it (or AMD's)?
RAM bandwidth is highly proportional to computing power, both for discrete cards as well as IGPs. However, developers won't create GPGPU applications for something as weak as an IGP, regardless of whether or not it has access to an L3 cache. In other words, even though a large L3 cache can have a profound effect on the performance of a CPU, you don't fix all of a GPU's problems by throwing an L3 cache at it.
And how does a compilation profit from wide vector units in that course?
Again, it doesn't. And it doesn't have to. The CPU is already great at tasks of this complexity. It's the GPU that has a long way to go to become any better at ILP and TLP.
Yes, in the same 65nm process the core area (i.e. just the core including L1)
grew from 19.7mm² to 31.5mm². And the complete die was 57% larger. Doesn't look like a doubling of compute density to me.
Aside from supporting x64, Core 2 also widened from a 3 to a 4 instruction architecture and implemented many other features. Unfortunately these major changes make it impossible to assess the isolated cost of widening the SSE paths to 128-bit.
It turns out a much better comparison is
Brisbane versus
Barcelona. Together with
a slew of other changes which probably don't take a lot of space each, the widening of the SSE path made the core grow by only 23%. That's only 8% of Barcelona's entire die. So 5% for doubling the throughput probably isn't a bad approximation.
Doubling it again obviously costs more in absolute terms, but the rest of the core has grown / become more powerful as well. Sandy Bridge already widened part of the execution paths. Suffice to say that implementing AVX2 in Haswell will be relatively cheap and we can consider it to have twice the throughput at a negligible cost.
That's absolutely not the case for GPUs, unless they start trading fixed-function hardware for programmable cores...