At this stage it's just a theory but I can't help of thinking that they did not got for dedicated DP units in Maxwell; I'd love to stand corrected but revamping clusters with smaller and more efficient datapaths and theoretically going for hybrid units (which burns more power) doesn't sound like it's enough to reach twice the perf/W.
I agree -- that's where I would place my bets as well. The numbers don't work favorably for dp anything. If we assume 40W for the GPU, we wind up with ~10W for dp, and for the alu section, a LOT less. Full-rate dp is 20pJ/op, which would be ~13W. Area-wise, we'd be looking at ~64mm^2 for 640 units, which also seems too large. From a business perspective, this is a scaled-up mobile design, and dp alus in mobile are pointless. I don't see room for full-rate dp at all.
In theory, quarter-rate dp mul is pretty cheap on top of sp mads, but it's a pain to optimize for power using that kind of design for HPC (hence the dedicated units in Kepler, presumably). NV's two optimization problems are at the extremes, and they have two different market needs -- one would expect the designs in the middle borrow from one side or the other and don't have their own market-specific optimizations. Partial-rate dp seems unlikely from that perspective.
One is left to wonder what the marketing guy was smoking when they used a GK110 block diagram instead of a GK107 one. The Anand article is interesting as well, because I don't see how HPC is really a scaled mobile design given the different market needs. There's a tension between optimizing for mobile and hpc, and reusing work across the product line. Given they're stuck on 28nm for awhile, I'd be surprised if they pushed aggressively on the reuse side at this point. Pleasantly surprised, but surprised nonetheless.