The evidence is simple: gaming tests with shader heavy games (where the limited texture filtering capacity doesn't handicap the R600). While without AA, the 2900XT keeps up with and sometimes a bit ahead of the 8800GTX, switch on 4xAA and there's suddenly a 10-20% gap, in facour of the GTX. (I dig up a link or two if you need that, these plagued the web in May-June but right now it takes a little longer to find them)
When you're digging, bear in mind two things, first broken performance in R600 drivers and second games that do Z-only passes will see a
much lower zixel rate on R600 in comparison with G80 - and Z-only passes can be quite costly in terms of their percentage of frame rendering time (e.g. doing early-Z or rendering shadow maps).
You should include R580 in your comparisons, too, though beware the z-only capability on that GPU is lower when AA is off.
As for the explanation, I've got a lightweight theory, here goes. I assume when the shader core does AA resolve, then all 64 superscalar units have to do that, no mixing in of PS/VS/GS code is possible.
Yep:
http://forum.beyond3d.com/showpost.php?p=1021653&postcount=867
While I know the basics about how MSAA works, I'm not familiar with the exact operation - still, I assume the 16 ROPs are severely underfeeding the shader core, thereby not only "stealing" shader capacity, but also wasting a fair amount.
How would these ROPs not "underfeed" a ROP-based AA-resolve, if R600 didn't do shader resolve?
Anyway, it seems likely the ROPs are involved in "decompressing" an AA'd render target (because the compression information is managed by the ROPs), before the shader resolve can do anything. The resolve uses texture operations to read this render target, but AMD alludes to a "quick path". Perhaps this consists of the ROPs transmitting compression data to the ALU pipeline (via the memory cache?) - the instantaneous bandwidth consumed by AA resolve, whose samples are otherwise uncompressed, is
phenomenal...
R600 can fetch 16 AA samples per clock through its texture hardware, which means 1/4 rate for the 64 ALU pipelines, or 4 cycles per AA sample per pixel which is 16 cycles per pixel. This assumes no pixel has compressed AA samples. Compression would speed this up dramatically (say 2x or 4x). Assuming the arithmetic on these samples is nothing more than sum(sample*0.25), which is 4 cycles, then the effective total shader duration is 16 cycles worst case, the time to fetch the samples. That's 724fps for a 2560x1600 render target (2.968G pixels/s).
As you say, the ALUs are bottlenecked by the fetches, running at 15% utilisation ( = 25% due to fetches x 60% due to RGB usage of 5 ALUs).
At 60fps 2560x1600, there are 193 ALU clocks per pixel (shared by vertex+pixel). 16 clocks of AA resolve amounts to 8.3%, worst case (no compression - in fact if there were no compressed pixels then the limited bandwidth available would roughly double the cost here).
So, AA resolve on its own isn't very costly. If there are multiple 2048x2048 MSAA'd shadow maps being resolved per frame, then it starts adding up, hence my earlier comment.
I also think that the R600 cannot use all of its shading power on pixel shading (which is still quite dominant in games) - I have a nasty feeling that the fifth sub-unit in each superscalar shader remains mostly idle as the shader codes of today are not parallelisation-friendly. If I'm right, then the practical PS capacity of the R600 is only 10% more than the G80 (let's forget about G80's extra MUL / SP / cycle for now) - and this is why the wasted shader capacity is so painful. Practically, the R600 walks away beaten from a number of battles it could have easily won...
Games (on G80/R600) are rarely ALU-limited. It seems fillrate or bandwidth or texture rate always gets you first.
Also, G80's apparent "efficiency" is not all a gain, e.g. it uses its ALUs to interpolate attributes (texture coordinates) whereas R600 has dedicated hardware. It's good for ALU utilisation in G80, but it can easily hurt pixel throughput too. e.g. bilinear texturing in G84 and G92 appears to show a severe ALU bottleneck (when compared against theoretical).
Jawed