Bandwidth costs power.
What's notable about Fermi architecture is that supposedly a variety of local buffers have been removed in favour of buffering through the L1/L2 cache hierarchy. This is a generalisation applied to the producer-consumer nature of a pipelined parallelising throughput processor. This generalisation allows greater flexibility in workloads. e.g. it should lead to higher minimum frame-rates (buffers shouldn't run out of space halting the producer that feeds into them). But it means that some kinds of data are moving further in this architecture than in prior GPUs.
Parallel setup across 4 GPCs in GF100 is one task that creates a power overhead, because there's 2-way synchronisation of workload in order to allow setup to function in a distributed fashion.
Overall, though, NVidia's architecture is not particularly power-efficient. Though problems with older chips (reliance upon a surfeit of TMUs and ROPs that never deliver their rated throughputs) have been tackled quite effectively. Fermi's current TMUs are doing very well as far as gaming is concerned, even if absolute throughput isn't exciting.
e.g. the memory controllers are placed at the centre of the die, as far as possible from the I/O pads that communicate with memory. In ATI the memory controllers are placed as close as possible to the pads (admittedly we've only seen RV770's die). It seems to me NVidia does this because the ALUs, consuming lots of power, need to be spread out across as much area, to minimise localised heating problems. The very high data rates that GDDR5 pins run at mean that the long distance from MCs to pads must use up a fair amount of power.
This layout problem also exists in G80 and GT200.
Scalar ALU processing with fine-grained scheduling of individual instructions means that NVidia has relatively high power overheads, too. A feature from G80 onwards. VLIW approach in ATI saves power and area, but at the cost of increased compilation complexity - though not all compilation problems in ATI are centred upon VLIW-usage.
So solely attributing GF100's problems to Fermi architecture is tricky. And the 40nm fuck-up plays a part too, handicapping bigger chips, simply because the process is naturally leaky and bigger chips/higher temperatures tend to act as positive feedback.
Though it seems to me NVidia dug its own grave on two key points: leaving GDDR5 implementation on any chip until too late and targetting a meaningless performance level when a notably smaller chip (GF104) would get to within 20-30% of that performance level.
But then, for some reason, NVidia has been failing to execute for years now - in that environment GF100 looks kinda doomed anyway.
Jawed