I imagine this'll depend on who writes the OpenCL - from what I've seen so far of CUDA/Brook+ programming one has to tweak the algorithm extensively for the granularities of the architecture (if the memory system is at all dominant as a factor in performance).
Thing is, you can't run Cuda on ATi GPUs and you can't run Brook+ code on nVidia GPUs.
So you have no way to make a direct comparison whatsoever. OpenCL will work on both.
Besides, the point behind OpenCL should be: write once, run anywhere.
Ofcourse you'll always be able to tweak for a certain architecture, but that only goes so far.
With CPUs it's common to have multiple paths for various architectures. This can and probably will be done with OpenCL aswell, if required... But even if you have tweaked paths for CPU A and CPU B, there will only be one that is the fastest.
There are even plenty of examples of CPUs that were slower DESPITE having tweaked code. Tweaking code is no guarantee for making a CPU deliver best-in-class performance.
So the argument 'who wrote the code' probably won't hold in practice (except for developers with a hidden agenda). Decent programmers will either write blended code that avoids performance pitfall on both architectures, or they will supply you with two different versions, tweaked for either architecture.
Given those circumstances (which I deem to be both fair and realistic), I wonder if nVidia will be able to show an advantage in most tasks (I'm sure there will always be exceptions... even the mighty Core i7 can't win *every* benchmark out there). Because that's what they focused on in the past few years, and that's largely the reason why their chips are so much bigger than ATi's.
That's where the payoff needs to be.
A simple example is SGEMM. On NVidia you can only get decent performance if you use shared memory. On ATI you get better performance (in absolute terms), and if you use LDS then you get worse performance - ATI caches just work well enough that putting data into LDS slows things down. That's an extreme example, I'm sure - but still, it's entertaining.
Well, if ATi has performance problems with shared memory, that could be an issue... Shared memory is a standard feature of OpenCL. There could be a bit of payoff for nVidia there.
On the other hand, there's folding@home, which runs 2-3x better on NVidia because of shared memory. But this factor tails-off as the molecule increases in size.
Yea, funny how something like that can completely throw the performance per mm2-argument upside-down.
That's basically what I'm talking about here... Sure, in terms of graphics, nVidia seems to have dies that are 'too large'... But what if they are 2-3x faster than ATi's in OpenCL applications?
Then suddenly the die-size is completely justified, and ATi will look like underpowered, ineffecient outdated junk.
Not that I expect this to happen, but still... GPGPU may make nVidia's architecture look better than it does today.
In the experimentation I've been doing with Brook+ (which compiles to IL) and the resulting "experimentation" with compilation from IL into machine code, I see lots of immaturity and brokenness. For example machine code sometimes doesn't even pack scalar registers (it usually does) into vectors - vector registers are the only allocation possible, though they can be freely accessed as scalars. This wastes register allocation on a massive scale
So it seems to me that AMD's compilers are going to look immature. EDIT: I suspect we can see evidence for this in games like Far Cry 2 where performance takes months to get to where it should be.
That would be another payoff then... nVidia having invested quite some resources in the Cuda compilers over the past years.