I thought it would be somewhat simpler because the FPUs appear to be running along the edges of each core, leaving the other hardware in the center.
I don't really know based on the die photo, perhaps someone with a better knack for it can tell.
A large part of the core is agnostic to the capability of the FPUs, and Nvidia has a history of maintaining both DP and SP pipeline designs.
So what happens when you try and run some code that uses DP on such a design?
I did not mean to imply one can wave a wand over the DP pipeline and it magically becomes SP, just that this is but one component of the core and that the rest is very much unaware of FP capability.
Anything not an ALU, decoder, or scheduler will not care (and the latter two can in naive implementations almost not care), and it looks like a fair amount of the rest of the core is kept isolated from the parts that do.
Register files care as well, since they have to allocate and access 64b regs differently.
I guess my point is that doing a totally modular design would be really really inefficient in many regards. GT200 was modular, but that's because the DP implementation was a total kludge and *inefficient*.
It's true that large portions of the SM don't care - caches, TMUs, etc., but without knowing how the DP was designed, it's really hard to say anything intelligent about how easy or hard to remove they are, and what the engineering effort, die area and power implications are...
Nvidia has undertaken the expense of designing both a double precision and single precision floating point pipeline before.
As you've said, one is an incremental evolution over the other.
Taking what amounts to a decrement for consumer hardware could have been planned for and budgeted into the development effort, and there exists the potential for a significant amount of reuse.
Since Nvidia has gone and applied full IEEE compliance to both SP and DP, there doesn't appear to be anything disjoint enough to make a core with just the FPUs replaced impractical.
You certainly can do a SP only core. But I'm saying that you would have to redo physical design to really compact things. DP is really not just an add-on, it's very entangled with SP for good performance.
Changing this seems handy, but I'd be curious what reasons there are for modifying the scheduler or dispatch being prohibitive or strictly necessary.
The current DP-capable issue hardware is fully capable of not running DP instructions, and if the SP-only hardware is not modified to change any of the SP instruction latencies, what would the scheduler notice?
Not that a front end that dispensed with the extra instructions entirely wouldn't probably be smaller.
Scheduling DP is kind of complicated since it requires both pipelines. you could just use the same scheduler and turn off DP, but then you are wasting area...
It all depends on implementation - is it one scheduler or two, how do they communicate, etc. etc.
That's pulling in concerns outside of the engineering difficulty.
An x86 host processor that suddenly blows up on code that ran on older chips is worse than useless.
What a slave chip does behind a driver layer that obscures all or part of the internals is much less constrained (or lets hope this is the case for Larrabee).
So do you think they will do DP in a library as part of the driver?
The layout of the core would possibly change, or at some of it would need to change.
Given that this fraction of one component of the whole chip that would likely need global layout changes anyway for a reduced mass-market variant, is this necessarily prohibitive or unthinkable?
No, my point is that it requires a global layout change and that doesn't happen overnight!!!
DK