Disclaimer: I base this solely on the public published information that's available for Xenos, most notably
this B3D article. I have no developer manual and certainly no dev kit.
In terms of computational resources, Xenos' AA really is absolutely and honestly free.
Per "pixel", Xenos computes one color, a depth gradient and determines a subpixel coverage mask. The maks is just four bits. The depth gradient is sufficient because all potentially covered subpixels, while they have variable depth values, are from the same triangle. Hence this connection doesn't need much bandwidth (actually less bandwidth than an equivalent PC part's road to memory because blending is also "free", even without any AA).
Inside the daughter die, the color and z values are replicated to all covered samples according to the mask bits, z test is done and blending is done. So in the worst case, for one incoming pixel the daughter die needs to read/modify/write
four subpixel depth values (for a subpixel-precise depth test) and read/modify/write
four subpixels' colors (for blending). The eDRAM daughter die has very high internal bandwidth and can cope with this all just fine, and this is exactly the reason why this is deemed "free".
The catch:
The eDRAM daughter die, while having that very high bandwidth, only has limited storage space. Rendering in high resolutions with AA
will exhaust this space. Doing 4xMSAA requires four times as much space to be set aside than rendering without AA. If you don't have that space, you can't hold the whole backbuffer at once.
The proposed solution is to split the scene up into tiles that do fit the eDRAM space limits.
E.g. instead of rendering a complete 1280x720 w 4xMSAA frame (which you can't), the rendering process can be split up into three 1280x240 partitions (roughly ~9,8MB each) which are rendered sequentially. You flush out the finished partitions to system memory to make room for the next one. If you have all three partitions down in system memory the frame is done, you can point the RAMDAC there to scan it out and start building up the next frame in the same way.
But this is
not free. Rendering the whole scene in one go is more efficient. Now let me base the explanation on a regular PC GPU (IMR) for the moment:
You'll have to resubmit and hence retransform at least some geometry. You should not need to render a triangle in partition 1 if you
know it will only be visible in partition 2, but a triangle that overlaps two partitions will have to be rastered twice for correct results. "Knowing" is a problem though. For determining resubmission at the triangle level you'd need impractical amounts of (CPU) preprocessing, so you'll end up resubmitting huge gobs of geometry, if not your entire scene geometry, three times for three partitions. I.e. you do three times the vertex processing work and three times the trisetup work as opposed to non-partitioned rendering. As the submission isn't free either, you'll also pay three times the geometry bandwidth costs (shared system memory in case of Xenos) and three times the CPU costs associated with traversing the scene graph, setting up render states and queuing up the draw calls to the hardware. Overall, the only cost that doesn't triple is fillrate (because the partition has less pixels than a whole frame).
There are multiple conceivable ways how Xenos could assist this partitioned rendering process at the hardware level.
1)Xenos might well support nothing more than a reconfigurable viewport, i.e. nothing worth talking about. The same costs as set forth in the above PC based explanation apply.
2)Xenos might buffer up the entire
untransformed scene description. This would remove repeated scene graph traversal from the equation but costs storage space for that buffer. Replaying such a display list can be made slightly more efficient than "talking to the driver" again from the application side.
3)Xenos might do the same as #2 but with
transformed geometry. Costs for retransforming geometry are eliminated.
4)Xenos might do #3 and additionally sort the triangles in the display list to eliminate or at least reduce the amount of "useless for the current partition" triangles.
I
don't know what's really going on with Xenos here, but either way this should explain why performance is going to be lost if you use AA at higher resolutions, even though from a different point of view it truly is free.
Joke:
5)Xenos might be a TBDR, and as such it would do #4 but get even more bang out of the work done during the sorting process.
(this is nonsense because if it were true Xenos wouldn't need 10MB of eDRAM -- it could make do with some
kilobytes of on-chip tile storage)