Although, now thinking about it in the context of this whole "real threads" .. umm .. discussion, I wonder if DWF isn't breathing life into the quaint and aging assumption that everyone and their brother is always running the same instruction. Seems like the cost of DWF and MIMD might be similar (difference between finding 32 runnable "threads" and finding 32 runnable "threads" all at the same PC is ..?), and we'd get better advantages from MIMD, even if we do need larger instruction caches?
Full MIMD would be larger instruction caches and multiplying the resources at the front end. In terms of issue ports, a physical SIMD of width 16 pushed to 16 MIMD units would require 16x the decoders, issue ports, and scheduling.
It's not necessarily 16x the hardware because these are potentially simpler than the more complex SIMD unit.
Regardless, Fermi is already plenty big.
The primary argument for DWF was that Nvidia's scheduling and register hardware was already oddly complex for what it was doing, and DWF was an incremental increase that could yield throughput decently close to what MIMD could offer for the workloads targeted.
This came up in the old G300 speculation thread. It's quite a trip down memory lane to go back there.
A lot of what Fermi turned out to be reflected a lot of the grumblings at the time, and the apparent die size reflected some of the fears.