Maybe I am being a little incorrect in my useage of terminology not to mention probably not understanding something here. I had thought AA was primarily memory bandwidth bound with the extra texture data fetches and not particularly decided by fillrate. In the R600's case this then has the extra overhead of shader execution time which is where I had thought the performance penalty comes in when doing 'simple' MSAA compared to the G80 series.
Is that right?
The texture fetches required to do shader AA aren't using "extra" bandwidth as such. The data fetched to perform a hardware AA resolve is the
same as that fetched to perform a shader AA resolve.
This may not be 100% true, because "something" may be happening with the compression tags, in order to support shader AA resolve. Whatever that something is may well be a bandwidth overhead. But then again, it might not.
e.g. the compression tags may be data that gets dumped into the 8KB R/W memory cache (per SIMD) to be fetched by the AA-resolve shader as it progresses across the render target. Judging by that patent document it's possible that the compression data is located in two places: in an on-die tag table and as per-tile status stored in VRAM. Anyway, if the R/W cache is used, then no extra bandwidth is consumed.
So the "overhead" is mostly on the ALU units. Let's do a worst-case guesstimate: say an average of 50 ALU cycles per pixel to perform an 8xMSAA resolve for a 2560x1600 render target at 60fps:
- 4 ALU clocks per pixel drawn on screen (64 ALU pipes, 16 RBEs)
- theoretical fillrate of 742MHz * 16 RBEs = 11.872 G pixels/s
- equals 47.488 G ALU clocks per second capacity
- 2560*1600 = 4096000 pixels
- at 60 fps that's 245.76 M pixels/s
- AA resolve at 50 ALU clocks per pixel, equals 12.288 G ALU clocks
- AA resolve costs 25.9% of ALU capacity of R600
That's a lot
But that still leaves ~350 GFLOPs for all other shading. Approximately the same as 8800GTX's total available GFLOPs...
http://www.bit-tech.net/hardware/2007/06/15/xfx_geforce_8800_ultra_650m_xt/9
Shows 2560x1600 4xMSAA at >60fps on R600. But that prolly needs less clocks for the AA resolve, say 35 for the sake of argument... Prey is not known for being particularly heavy in terms of ALU instructions per pixel. Guessing ~20 per pixel (before, say, 5x overdraw). So maybe 100 ALU clocks per screen pixel of actual shader code? Should add a bit more in there for vertex shading...
I dunno about the ALU clock cost of shader AA. I'm thinking that decoding the compression tags and un-packing the compressed samples is fairly costly. Hmm, comparing box, narrow and wide-tent AA resolve should give some indication of the ALU clock cost...
Having said all that, the cost in terms of texturing rate is notable (since it's a rare resource in R600). Assuming that all 80 samplers can be used for AA resolve but being generous and saying that texturing rate is defined merely by the 16 filtering units, then the 25.9% of ALU cost translates into 5.2% bilinear texturing cost.
In the end, R600's AA sample rate of 32 samples per clock is the real hindrance. e.g. 23.744 G samples/s versus 8800GTS-640's ability to do 40 G samples/s.
Jawed