Doing some non-trivial CUDA programming on GK110 now, I have to say, Fermi is a very efficient design, it is much easier to achieve optimal efficiency on Fermi than on GK110.
For GK110, ILP has to be at a pretty good level to obtain maximum efficiency, thats not a easy task for non-trivial applications.
But the reward sometimes justifies the effects, after carefully tunned, cuda codes on GK110 can achieve very significant speed-up comparing to running them on CPU or even MiC, etc.
Its really remind me the good old days when people do programming with machine codes or assembly, Kepler is really a rough and low-level system, it gives you so much more rooms to achieve maximum efficiency, or to mess things up.