ATI and Microsoft decided to take advantage of the Z only rendering pass which is the expected performance path independent of tiling. They found a way to use this Z only pass to assist with tiling the screen to optimise the eDRAM utilisation. During the Z only rendering pass the max extents within the screen space of each object is calculated and saved in order to alleviate the necessity for calculation of the geometry multiple times. Each command is tagged with a header of which screen tile(s) it will affect. After the Z only rendering pass the Hierarchical Z Buffer is fully populated for the entire screen which results in the render order not being an issue. When rendering a particular tile the command fetching processor looks at the header that was applied in the Z only rendering pass to see whether its resultant data will fall into the tile it is currently processing and if so it will queue it, if not it will discard it until the next tile is ready to render. This process is repeated for each tile that requires rendering. Once the first tile has been fully rendered the tile can be resolved (FSAA down-sample) and that tile of the back-buffer data can be written to system RAM; the next tile can begin rendering whilst the first is still being resolved. In essence this process has similarities with tile based deferred rendering, except that it is not deferring for a frame and that the "tile" it is operating on is order of magnitudes larger than most other tilers have utilised before.
There is going to be an increase in cost here as the resultant data of some objects in the command queue may intersect multiple tiles, in which case the geometry will be processed for each tile (note that once it is transformed and setup the pixels that fall outside of the current rendering tile can be clipped and no further processing is required), however with the very large size of the tiles this will, for the most part, reduce the number of commands that span multiple tiles and need to be processed more than once. Bear in mind that going from one FSAA depth to the next one up in the same resolution shouldn't affect Xenos too much in terms of sample processing as the ROP's and bandwidth are designed to operate with 4x FSAA all the time, so there is no extra cost in terms of sub sample read / write / blends, although there is a small cost in the shaders where extra colour samples will need to be calculated for pixels that cover geometry edges. So in terms of supporting FSAA the developers really only need to care about whether they wish to utilise this tiling solution or not when deciding what depth of FSAA to use (with consideration to the depth of the buffers they require as well). ATI have been quoted as suggesting that 720p resolutions with 4x FSAA, which would require three tiles, has about 95% of the performance of 2x FSAA.