Well, the problem is, AMD's current design, although better in peak performance (density) and in many image processing works, it's not necessarily better in other works. It's much easier to write and optimize for a scalar model rather than a vector model.
Only as long as your algorithm has absolutely no dependency on memory performance. If there is any dependence, of course it's a matter of how much marginal improvement there is to be had in vectorising the data accesses.
In double-precision, ATI's ALUs are scalar for MULs and MADs and vec2 for ADDs. GF100 at 1.5GHz will be slower for DP-ADD than HD5870.
In the end, it's possible that many GPGPU programs may run faster on NVIDIA's GPU than on AMD's GPU even AMD's GPU has higher peak performance.
Yes, you can always wait a year or two for NVidia's performance to catch-up, in the mean time you've got a cosy coding environment and all the other niceties of NVidia's solution. And with CUDA, specifically (in theory anyway), NVidia is going places that AMD won't be bothered about for a couple of years - and there's a decent chance those things will improve performance due to better algorithms, so your loss is likely lower if your problem is at all complex.
Though right now I'm hard-pressed to name anything in Fermi that makes for better performance because it allows for more advanced algorithms (that's partly because I don't know if AMD has done lots of compute-specific tweaks - only hints and D3D11/OpenCL leave plenty of room). Gotta wait and see.
If this turns out to be true, there's no reason why NVIDIA should go AMD's route. Of course, AMD may tries to go NVIDIA's route, but even so they are not going to have very similar architectures, at least in near future.
Broad comparison of compute at the core level:
- ATI (mostly ignoring control flow processor and high level command processor)
- thread size 64
- in-order issue 5-way VLIW
- slow double-precision
- "statically" allocated register file, with spill (very slow spill?) and strand-shared registers
- large register file (256KB) + minimal shared memory (32KB) + small read-only L1 (8KB?)
- high complexity register file accesses (simultaneous ALU, TU and DMA access), coupled with in-pipe registers
- separate DMA in/out of registers instead of load-store addresses in instructions
- stack-based predication (limited capacity of 32) for stall-less control flow (zero-overhead)
- static calls, restricted recursion
- 128 threads in flight
- 8 (?) kernels
- Intel (ignoring the scalar x86 part of the core)
- thread size 16
- in-order purely scalar-issue (no separate transcendental unit - but RCP, LOG2, EXP2 instructions)
- half-throughput double-precision
- entirely transient register file
- small register file, large cache (256KB L2 + 32KB L1) (+ separate texture cache inaccessible by core), no dedicated shared memory
- medium complexity register file (3 operands fetch, 1 resultant store)
- branch prediction coupled with 16 predicate registers (zero-overhead apart from mis-predictions)
- dynamic calls, arbitrary recursion
- 4 threads in flight
- 4 kernels
- NVidia (unknown internal processor hierarchy)
- thread size of 32
- in-order superscalar issue across three-SIMD vector unit: 2x SP-MAD + special function unit (not "multi-function interpolator")
- half-throughput double-precision
- "statically" allocated register file, with spill (fast, cached?)
- medium register file + medium-sized multi-functional cache/shared-memory
- super-scalar register file accesses (for ALUs. TUs too?)
- predicate-based stall-less branching (with dedicated branch evaluation?)
- dynamic calls, arbitrary recursion
- 32 threads in flight
- 1 kernel
I don't think it's been commented-on explicitly, so far in this thread, but NVidia has got rid of the out-of-order instruction despatch, which scoreboards each instruction (to assess dependency on prior instructions). Now NVidia is scoreboarding threads, which should save a few transistors, as operand-readiness is being evaluated per thread and instructions are being issued purely sequentially.
Jawed