What *exactly* is the cost of Xenos AA?

Shifty Geezer

uber-Troll!
Moderator
Legend
In the console games section this debate is back, and it seems people are still unclear exactly how much the 'free' 2xAA of Xenos costs. Afterall it's called free, so surely it comes at no extra cost whatsoever to the render pipeline? The GPU can render the same scene with 2xAA without expending any extra effort or taking any longer than without AA? That's not actually the case, is it? As I understand it the pixel shaders are applied once per pixel, but there is 2 vertex samples per pixel, plus a 5% or whatever it is tiling overlap. The actual sample averaging can be considered 'free' as that's handled by the eDRAM logic. So the actual cost of 2xMSAA on Xenos over normal no AA rendering is twice the Vertex transform work + '5%' for the tiling as an AA'd buffer dosen't fit neatly in eDRAM (FP10 HDR being considered as the color data type). The term 'free' is only true in the sense of bandwidth consumption. On a typical GPU the same AA processes are involved, but the hit on BW due to blending is where the price is really paid and where the eDRAM provides it's unique benefit. Right?

Can some really knowledgable person post up a definitive coverage of what's involved in simple terms that people can hear, understand, and not forget? I'm ashamed that this far after hearing all about Xenos I'm still hazy as to what 'free' AA really means, and I don't think I'm alone here :oops:
 
Wouldn't this mean if a game became vertex transform limited Xenos's AA would no longer be able to be used. Though with its USA I don't see this happening much.
 
IMHO, I would consider cost to xbox 360 in this issue as a measure in ns, and not chip computation costs, mostly because most work is done off chip, and somewhat because cycle counting is mostly used for multicore environments, as any program can tell you.

I believe total latency should be low, because of low latency nature of embedded mem.

All hail 360!
 
There have been so many posts on this and several actual answers.

There is no way to generalise X% overhead.

It requires you use tiling for HD resolutions, this in turn requires that the entire display list is available to the Graphics Chip which may or may not have significant memory implications depending on how you do it.

Any batch of primitives that straddles multiple tiles will be transformed for every tile it's in, although pixel shading cost will be no extra, and there will be one extra sync of the graphics pipe for every tile rendered.

You can likely construct artificial cases that are close to 0 overhead and artificial cases that have large overhead, the actual penalty is going to be very game dependant. It's going to be a lot cheaper in almost any case than current PC architectures.
 
Disclaimer: I base this solely on the public published information that's available for Xenos, most notably this B3D article. I have no developer manual and certainly no dev kit.

In terms of computational resources, Xenos' AA really is absolutely and honestly free.
Per "pixel", Xenos computes one color, a depth gradient and determines a subpixel coverage mask. The maks is just four bits. The depth gradient is sufficient because all potentially covered subpixels, while they have variable depth values, are from the same triangle. Hence this connection doesn't need much bandwidth (actually less bandwidth than an equivalent PC part's road to memory because blending is also "free", even without any AA).

Inside the daughter die, the color and z values are replicated to all covered samples according to the mask bits, z test is done and blending is done. So in the worst case, for one incoming pixel the daughter die needs to read/modify/write four subpixel depth values (for a subpixel-precise depth test) and read/modify/write four subpixels' colors (for blending). The eDRAM daughter die has very high internal bandwidth and can cope with this all just fine, and this is exactly the reason why this is deemed "free".

The catch:
The eDRAM daughter die, while having that very high bandwidth, only has limited storage space. Rendering in high resolutions with AA will exhaust this space. Doing 4xMSAA requires four times as much space to be set aside than rendering without AA. If you don't have that space, you can't hold the whole backbuffer at once.

The proposed solution is to split the scene up into tiles that do fit the eDRAM space limits.
E.g. instead of rendering a complete 1280x720 w 4xMSAA frame (which you can't), the rendering process can be split up into three 1280x240 partitions (roughly ~9,8MB each) which are rendered sequentially. You flush out the finished partitions to system memory to make room for the next one. If you have all three partitions down in system memory the frame is done, you can point the RAMDAC there to scan it out and start building up the next frame in the same way.

But this is not free. Rendering the whole scene in one go is more efficient. Now let me base the explanation on a regular PC GPU (IMR) for the moment:
You'll have to resubmit and hence retransform at least some geometry. You should not need to render a triangle in partition 1 if you know it will only be visible in partition 2, but a triangle that overlaps two partitions will have to be rastered twice for correct results. "Knowing" is a problem though. For determining resubmission at the triangle level you'd need impractical amounts of (CPU) preprocessing, so you'll end up resubmitting huge gobs of geometry, if not your entire scene geometry, three times for three partitions. I.e. you do three times the vertex processing work and three times the trisetup work as opposed to non-partitioned rendering. As the submission isn't free either, you'll also pay three times the geometry bandwidth costs (shared system memory in case of Xenos) and three times the CPU costs associated with traversing the scene graph, setting up render states and queuing up the draw calls to the hardware. Overall, the only cost that doesn't triple is fillrate (because the partition has less pixels than a whole frame).

There are multiple conceivable ways how Xenos could assist this partitioned rendering process at the hardware level.
1)Xenos might well support nothing more than a reconfigurable viewport, i.e. nothing worth talking about. The same costs as set forth in the above PC based explanation apply.
2)Xenos might buffer up the entire untransformed scene description. This would remove repeated scene graph traversal from the equation but costs storage space for that buffer. Replaying such a display list can be made slightly more efficient than "talking to the driver" again from the application side.
3)Xenos might do the same as #2 but with transformed geometry. Costs for retransforming geometry are eliminated.
4)Xenos might do #3 and additionally sort the triangles in the display list to eliminate or at least reduce the amount of "useless for the current partition" triangles.

I don't know what's really going on with Xenos here, but either way this should explain why performance is going to be lost if you use AA at higher resolutions, even though from a different point of view it truly is free.

Joke:
5)Xenos might be a TBDR, and as such it would do #4 but get even more bang out of the work done during the sorting process.
(this is nonsense because if it were true Xenos wouldn't need 10MB of eDRAM -- it could make do with some kilobytes of on-chip tile storage)
 
ERP, as always, thanks for the info.

. . .It's going to be a lot cheaper in almost any case than current PC architectures.

This just begs the question, why isn't AA on all X360 games when it can be so easily applied on PC games running similar resolutions? I guess the answer is that devs are opting to use available resources for other effects.
ERP,
Generally speaking, do you anticipate future X360 games will incorporate more AA?
 
zeckensack said:
Disclaimer: I base this solely on the public published information that's available for Xenos, most notably this B3D article. I have no developer manual and certainly no dev kit.

In terms of computational resources, Xenos' AA really is absolutely and honestly free.
Per "pixel", Xenos computes one color, a depth gradient and determines a subpixel coverage mask. The maks is just four bits. The depth gradient is sufficient because all potentially covered subpixels, while they have variable depth values, are from the same triangle. Hence this connection doesn't need much bandwidth (actually less bandwidth than an equivalent PC part's road to memory because blending is also "free", even without any AA).

Inside the daughter die, the color and z values are replicated to all covered samples according to the mask bits, z test is done and blending is done. So in the worst case, for one incoming pixel the daughter die needs to read/modify/write four subpixel depth values (for a subpixel-precise depth test) and read/modify/write four subpixels' colors (for blending). The eDRAM daughter die has very high internal bandwidth and can cope with this all just fine, and this is exactly the reason why this is deemed "free".

The catch:
The eDRAM daughter die, while having that very high bandwidth, only has limited storage space. Rendering in high resolutions with AA will exhaust this space. Doing 4xMSAA requires four times as much space to be set aside than rendering without AA. If you don't have that space, you can't hold the whole backbuffer at once.

The proposed solution is to split the scene up into tiles that do fit the eDRAM space limits.
E.g. instead of rendering a complete 1280x720 w 4xMSAA frame (which you can't), the rendering process can be split up into three 1280x240 partitions (roughly ~9,8MB each) which are rendered sequentially. You flush out the finished partitions to system memory to make room for the next one. If you have all three partitions down in system memory the frame is done, you can point the RAMDAC there to scan it out and start building up the next frame in the same way.

But this is not free. Rendering the whole scene in one go is more efficient. Now let me base the explanation on a regular PC GPU (IMR) for the moment:
You'll have to resubmit and hence retransform at least some geometry. You should not need to render a triangle in partition 1 if you know it will only be visible in partition 2, but a triangle that overlaps two partitions will have to be rastered twice for correct results. "Knowing" is a problem though. For determining resubmission at the triangle level you'd need impractical amounts of (CPU) preprocessing, so you'll end up resubmitting huge gobs of geometry, if not your entire scene geometry, three times for three partitions. I.e. you do three times the vertex processing work and three times the trisetup work as opposed to non-partitioned rendering. As the submission isn't free either, you'll also pay three times the geometry bandwidth costs (shared system memory in case of Xenos) and three times the CPU costs associated with traversing the scene graph, setting up render states and queuing up the draw calls to the hardware. Overall, the only cost that doesn't triple is fillrate (because the partition has less pixels than a whole frame).

There are multiple conceivable ways how Xenos could assist this partitioned rendering process at the hardware level.
1)Xenos might well support nothing more than a reconfigurable viewport, i.e. nothing worth talking about. The same costs as set forth in the above PC based explanation apply.
2)Xenos might buffer up the entire untransformed scene description. This would remove repeated scene graph traversal from the equation but costs storage space for that buffer. Replaying such a display list can be made slightly more efficient than "talking to the driver" again from the application side.
3)Xenos might do the same as #2 but with transformed geometry. Costs for retransforming geometry are eliminated.
4)Xenos might do #3 and additionally sort the triangles in the display list to eliminate or at least reduce the amount of "useless for the current partition" triangles.

I don't know what's really going on with Xenos here, but either way this should explain why performance is going to be lost if you use AA at higher resolutions, even though from a different point of view it truly is free.

Joke:
5)Xenos might be a TBDR, and as such it would do #4 but get even more bang out of the work done during the sorting process.
(this is nonsense because if it were true Xenos wouldn't need 10MB of eDRAM -- it could make do with some kilobytes of on-chip tile storage)


It's public information how this works look for predicated tiling in Daves artical.

Basically though it can mark a primitive as interesting/not interesting during an initial Z rendering pass then simply jump over the prinmitive if it won't contribute to the scene.
 
RobHT said:
ERP, as always, thanks for the info.



This just begs the question, why isn't AA on all X360 games when it can be so easily applied on PC games running similar resolutions? I guess the answer is that devs are opting to use available resources for other effects.
ERP,
Generally speaking, do you anticipate future X360 games will incorporate more AA?

I obviously can't comment on every game, I just don't have real information.

But on a PC you just turn it on, it's transparent, on Xenos you have to render your scene in a particular way to make it work.
 
ERP said:
I obviously can't comment on every game, I just don't have real information.

But on a PC you just turn it on, it's transparent, on Xenos you have to render your scene in a particular way to make it work.
I don't think you guys can get a much better anwser than that.
Thank you ERP.
 
To keep it simple, the cost (seems to be) is very high, otherwise developers would use higher AA levels, or AA at all, which is hardly used right now. The reason might be the tiling performance loss, or a general run out of memory.
 
ERP said:
I obviously can't comment on every game, I just don't have real information.

But on a PC you just turn it on, it's transparent, on Xenos you have to render your scene in a particular way to make it work.

ERP,

What is the coding "cost" on moving from alpha(9800pro)-->beta-->final(xenos); serious re-writes (weeks/months)? I know you have stated before that since there is no comparable PC part that most weren't willing to take a chance on the tiling. But, is there a chance that even the "just after launch window" titles would be willing to code specifically for the Xenos?
 
sorry to go slightly off-topic here (hardly) but I think that the original Xbox and Xbox games should've been set with 4X FSAA mandatory. Then Xbox2 / Xbox360 and games should've had the FSAA at either 4X or 8X with 4X being mandatory and 8X being optional.


2x or 4x FSAA is barely acceptable IMO, but I will take what I can get and have every intention of supporting Xbox360 (i also own an Xbox).

I know it comes down to cost of silicon, bandwidth and fillrate limitations, as far as how much FSAA we can have. just disappointed in the level of anti-aliasing on consoles. that doesnt mean Xbox360 games wont look great. I realize that with 720p and 4X FSAA, that is enough to eliminate most of the jagged edges, unless you look carefully.
 
zeckensack said:
E.g. instead of rendering a complete 1280x720 w 4xMSAA frame (which you can't), the rendering process can be split up into three 1280x240 partitions (roughly ~9,8MB each) which are rendered sequentially.

Note: depending on the type of the game, you may wish to go with vertical tiles instead. That's because stuff like monsters in an FPS, high buildings, trees, etc. would usually get into all of the horizontal tiles, requiring them to be sent to the GPU 3 times. With vertical tiles, you have a better chance that something will only reside in one or two of the tiles instead, especially with a 16:9 aspect ratio.

Of course, with the player looking around, the efficiency could change just between a few frames as well...
 
Last edited by a moderator:
Nemo80 said:
To keep it simple, the cost (seems to be) is very high, otherwise developers would use higher AA levels, or AA at all, which is hardly used right now. The reason might be the tiling performance loss, or a general run out of memory.

This is bulls**t, simply and seriously.

We already heard it several times: launch games are not using it because there wasn't enough time to test the implementation.
 
Megadrive1988 said:
sorry to go slightly off-topic here (hardly) but I think that the original Xbox and Xbox games should've been set with 4X FSAA mandatory. Then Xbox2 / Xbox360 and games should've had the FSAA at either 4X or 8X with 4X being mandatory and 8X being optional.


2x or 4x FSAA is barely acceptable IMO, but I will take what I can get and have every intention of supporting Xbox360 (i also own an Xbox).

I know it comes down to cost of silicon, bandwidth and fillrate limitations, as far as how much FSAA we can have. just disappointed in the level of anti-aliasing on consoles. that doesnt mean Xbox360 games wont look great. I realize that with 720p and 4X FSAA, that is enough to eliminate most of the jagged edges, unless you look carefully.

This is obviously just my opinion but I'm on the exact opposite side of the fence......

Developers should be free to decide what features they support including resolutions, AA, amd what texture filtering. And that's not to say I wouldn't given the choice.

If a developer wants to ship a game at 160x100 with no AA and point sampled textures, he should be able to do that and have the market place decide if that's a good thing.

TRC's are there to protect the consumer to a point, ensuring consistent experience on things like memory cards, I'm not sure I agree with their extension to include things like AA and mandatory HD... If HD or AA become strong selling points developers will adopt them. The cool thing about consoles is watching what developers can eek out of the fixed resources, the more constraints you put in place the less eeking there is.
 
I'm a little curious as to why their "temporal AA" was not included or featured. Is it a high cost in transistors? Supposing the developer was shooting for a constant 60fps, they could get a decent approximation of 4xMSAA with 2xTMSAA.
 
Alstrong said:
I'm a little curious as to why their "temporal AA" was not included or featured. Is it a high cost in transistors? Supposing the developer was shooting for a constant 60fps, they could get a decent approximation of 4xMSAA with 2xTMSAA.

Pretty much any dev could implement temporal AA on top of the library if they thought it would be worthwhile.
At a previous job we messed around with the idea on an Xbox game a long time before ATI "invented" the feature.
Our opinion was that at TV resolutions the artifacts were too irritating to make it worthwhile so we didn't proceed with it.
 
Back
Top