Jawed
Legend
KB of shared memory should be used quite regularly, I expect. Worst case, 24 warps, each with one triangle, each with 3 vertices, each with 16 vec4 attributes = 18KB.1. Shared Memory (yes it's used to hold operands for interpolation but that's not its primary purpose)
NLM Denoise has also shown vectorisation (of PS code - though the CS code should benefit equally) is worth doing, not only on ATI but on NVidia - but that's because it's memory bound.2. Going scalar, and the extra overhead that comes with it. We've seen in the past few days examples of CS code that required explicit vectorization to take full advantage of VLIW hardware. At the same time you've put in a lot of work demonstrating VLIW's higher efficiency in game shaders.
I'm convinced NVidia thought it was more efficient/easier than building another VLIW - after all there's a widely held view that VLIWs are doomed to fail because compilers are terrible and absolute utilisation or utilisation per mm² is the pits. ATI's compiler still needs work. Then again, NVidia's compiler still has issues with things like 3DMark Vantage feature tests (hardly the most innocuous application) bouncing down and up in performance with driver revisions.
In my view NVidia made a bet that it could build a more area-/power-efficient "scalar" architecture than VLIW - it still had to be considerably faster at graphics then G71 - NVidia's consistent on the subject of scaling graphics performance for each generation. I don't know if NVidia still thinks pixel shaders are tending towards scalar, but that was the "marketing" at launch. Generally a lot of NVidia's "marketing" has been about the inherent utilisation benefits of scalar, including graphics. Scalar freed NVidia and graphics programmers from the worries about "coding for an architecture" - everything "just works, optimally". Of course one only finds out about the officially recognised failings in utilisation with each new GPU (i.e. the increased register file in GT200, the revamp of Fermi...). Anyway, NVidia appeared to think it could only win with scalar in terms of efficiency, however it's measured and regardless of workload.
Fermi appears to indicate NVidia has realised the wastefulness of G80 style per-thread out-of-order instruction issue. But such fine-grained scheduling appears to go hand-in-hand with the texturing architecture, which raises questions on how texturing (or load/store) instructions are going to be scheduled in Fermi - i.e. how the latency-hiding is scheduled/tracked. Fermi might be relaxed about this solely because L1 is huge in comparison with previous architectures - but I'm dubious because texturing, historically, has been very happy with very small caches. Bit of a puzzler that.
Maybe the benefits of out-of-order per-thread scheduling tail-off as register file size increases (i.e. as count of threads in flight increases), so it was always going to be an interim solution.
Jawed