Good question OP.
I'd recommend reading my Fermi article (
http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932) and thinking about which features are useful for graphics and which are not.
Generally speaking, NV made a decision to invest resources in what they consider to be general purpose workloads (i.e. HPC). In some cases, those resources also help graphics (caches) - in some cases, they are totally useless for graphics (e.g. DP).
ATI generally focuses only on graphics and has not put nearly the area/power into HPC workloads and programmability...because they want to focus on gaining market share in graphics.
As part of that, NV's architecture is easier to achieve high utilization of resources. In essence, NV spends area/power on control logic and features which do not improve raw throughput, but instead improve average utilization. In some sense, NV made a more 'robust' and 'flexible' microarchitecture.
ATI instead decided to focus on a microarchitecture which matched very well with graphics workloads (i.e. vec4 works fine), but was not very 'robust' or 'flexible', and instead had extremely good peak performance.
Consequently, ATI and NV both have good utilization for graphics due to the explicitly parallel nature of the workload.
However, the utilization falls off substantially for ATI on non-graphics workloads (not to mention, programming is a pain). So their performance there will generally be inferior to NV's.
In essence, it's a case of what each company is optimizing for. ATI is focused on the workloads of the present, and NV on the workloads of the future.
There are some other issues as well - ATI genuinely has better physical design and implementation than NV. And they have more experience with TSMC 40nm, and GDDR5. All those combined give them a much more compact and power efficient product.
DK