BTW..I'd take coherent r/w caches over more than one triangle per clock any time of the day.
Agreed, but Fermi's cache actually brings about a really awkward situation in comparing these cards that exposes something that we have yet to see on GPUs too much to this point: you can now write very platform-specific code in HLSL (or OpenCL, etc).
From the Fermi docs it appears that caching will be enabled by default on DirectCompute global UAV accesses which is a big deal. Beyond helping algorithms that actually have unpredictable memory access patterns, it begs the question of how important the local data store is now.
In the more cheesy department, it means that NVIDIA (or anyone) can now write DC code that runs well on Fermi but terribly on ATI parts - even artificially - by just avoiding explicit use of local memory even in places with predictable memory access patterns. Conversely, I'm guessing that the "globallycoherent" modifier was added to DC with Fermi in mind (it doesn't appear to do anything on ATI parts), so ATI (or anyone) could artificially disable this caching in their own code by just putting that on every global buffer, whether it is "needed" or not.
More realistically, this means that code "optimized" on one card will not necessarily run well on the other (particularly in the developed on NVIDIA, run on ATI case for now). This has been the case for some specific operations for a little while, but we're talking about orders of magnitude now. We're in the range where you might need to start writing some different code for different architectures... "performance portability" is turning into a bit of historic thing. This obviously also goes for code scaling to future architectures as well, as it's amusing to see NVIDIA already noting several "legacy" code problems in their newest CUDA programming guides (use of texture lookups for caching, local data store sizes, bank conflict patterns, etc).
Sorry for being a bit OT, but I figured this was a good place to throw down some thoughts given the impending release of competitive benchmarks between these two architectures. It's going to be increasingly tough to declare overall "winners" since it's going to depend a lot more on how code is written in the future (more like CPUs - but more like Core i7 vs Cell or similarly vague comparisons).
[Aside: globallycoherent is a particularly weird modifier in that it seems to apply to only a case that is unsafe to write code against, namely one CS group communicating with another. The problem is that with parallelism and CS group execution ordering left completely undefined, I'm not sure if there is a "safe" use of this functionality. Maybe I'm misreading the usefulness of this though...]