You cannot just rip out a bunch of logic from the FPUs and achieve anything. You'd really need to redesign the FPUs from scratch and re-layout the execution resources to save power and area.
I thought it would be somewhat simpler because the FPUs appear to be running along the edges of each core, leaving the other hardware in the center.
It's not like Lego blocks where you can magically disconnect them, or even like a multi-core where you just lop off one core.
A large part of the core is agnostic to the capability of the FPUs, and Nvidia has a history of maintaining both DP and SP pipeline designs.
I did not mean to imply one can wave a wand over the DP pipeline and it magically becomes SP, just that this is but one component of the core and that the rest is very much unaware of FP capability.
Anything not an ALU, decoder, or scheduler will not care (and the latter two can in naive implementations almost not care), and it looks like a fair amount of the rest of the core is kept isolated from the parts that do.
To do DP you need more bits for your operands (mantissa, exponent, etc.) and you have to store and play with them somewhere. That will be very close to the logic that does single precision (which is just a smaller mantissa and exponent).
Nvidia has undertaken the expense of designing both a double precision and single precision floating point pipeline before.
As you've said, one is an incremental evolution over the other.
Taking what amounts to a decrement for consumer hardware could have been planned for and budgeted into the development effort, and there exists the potential for a significant amount of reuse.
Since Nvidia has gone and applied full IEEE compliance to both SP and DP, there doesn't appear to be anything disjoint enough to make a core with just the FPUs replaced impractical.
Frankly at that point you need to significantly redesign almost the whole thing, the scheduler would be somewhat different, the dispatch as well, etc. etc.
Changing this seems handy, but I'd be curious what reasons there are for modifying the scheduler or dispatch being prohibitive or strictly necessary.
The current DP-capable issue hardware is fully capable of not running DP instructions, and if the SP-only hardware is not modified to change any of the SP instruction latencies, what would the scheduler notice?
Not that a front end that dispensed with the extra instructions entirely wouldn't probably be smaller.
Taking the hardware a step down within the same constraints that the heftier and more complex multi-precision DP units have would seem to impose a lesser burden.
You really cannot easily remove DP, any more than you could easily remove x87 from a CPU.
That's pulling in concerns outside of the engineering difficulty.
An x86 host processor that suddenly blows up on code that ran on older chips is worse than useless.
What a slave chip does behind a driver layer that obscures all or part of the internals is much less constrained (or lets hope this is the case for Larrabee).
It's pretty deeply integrated in there, and you'd need to redo your layout anyway to take advantage.
The layout of the core would possibly change, or at some of it would need to change.
Given that this fraction of one component of the whole chip that would likely need global layout changes anyway for a reduced mass-market variant, is this necessarily prohibitive or unthinkable?