Another thing to consider is that x86 cores have fairly lowsy floating point performance because they're focussed on correctness, IEEE, double-precision, the works.
(I assume we're not talking about the stack based x86 FPU, which is broken due to obvious reasons.)
The x86 FP performance is worse than some other processors (especially Itanium and Power5), but this has nothing to do with correctness, DP and IEEE compliance: those other processors have to do be compliant as well. Which makes me assume that you're comparing against non-traditional processors...
Then, of course, CELL comes to mind. But the trend is against you here. I can't think of a single FPU unit that hasn't evolved, over time, from non-compilant single precision FP to compliant double precision FP. There's good reasons for that:
- Double precision is a very important checkmark in the scientific and engineering community. Correct rounding even more so: they simply laugh you away if you don't have it. I don't expect CELL as it is to be deployed at lot in the S & E community for exactly this reason.
- The incremental cost of double precision and rounding is relatively small. And shrinking. Clearly the multipliers and adders are a solved problem. Additional cost close to zero. The rounding is fairly complicated but not excessive.
- Register files need to double in size, but that's also a fairly small area hit.
- These days, with interconnect now the major delay factor, it's typically not in the main critical path of the overall pipeline anymore.
The other logic stays largely the same. There's no real need to beef up the memory system: sure, your number of FLOPS will be lower in DP mode than in SP when going to external memory, but that's the programmer's choice, just the way it currently is for a CPU.
The other GPU-specific engine that aren't used during GPGPU mode can safely remain SP.
So assuming you're still talking about GPU's, I don't believe for a second that the overall area impact will be double.
Surely it can't be long before Intel decides it's time to put in some single-precision pipelines and steal-back some of the glory that Cell's taken?
Don't count on it. It's not going to happen. For a pure processor, it'd be commerical suicide.
At the same time, ATI knows it is practically impossible to compete on Intel's home turf: double-precision. That complexity may never come to GPUs, it's utter over-kill and immensely costly in terms of transistors (well over 2x I think).
If double precision never comes to the GPU, it will be because the grand vision of the GPU as a massive parallel general purpose calculation engine didn't materialize.
If however that vision does become true and becomes a feature that can make good money, expect full compliant double precision to arrive quickly. DP is a solved problem. It's not complicated (and, with a standard that's been frozen for years, not getting more complicated), it's not overkill and it's not very costly.