R6xx AA

Jawed

Legend
That is certainly a big part of the problem, but it doesn't explain the big performance drop with when compared to R580. I know there's no hardware AA resolve, but that hit should be very minimal.
Why is performance drop relevant?

As long as R600 is faster than R580, per clock, at AA, then who cares about the squillion fps R600 can do with AA off?

The more fundamental issue with AA/Z is that there are only 2 samples generated, per clock. This especially hurts in fillrate limited scenarios, shadow rendering and early-Z. With the boom in shadow resolution and light count in recent games, this design decision is tragic.

Jawed
 
Assuming ATI is not braindead, they load and uncompress AA samples into the texture cache, and use the filtering units (which have gamma correction hardware) to average them and feed it to the shader, where it's written out to the framebuffer.
This can't work.

The whole point of AA sample compression is that you only load 1 sample per pixel if possible. So loading all samples into cache, filtering (averaging) and then returning them to the ALUs saves zero bandwidth.

While the TUs are fetching pixels to be resolved, the resulting pixels generated by the ALUs have to be written by the RBEs back into memory, consuming bandwidth and adding latency.

I propose that R600 automatically orders the MSAA'd render target as Primary + Secondary textures. If R600 has 3 levels of texture compression (perhaps dependent upon the format of pixels in memory, e.g. fp16) then there might be a Tertiary texture, too. The "texture array" feature (of D3D10 though obviously usable by the GPU even in DX9) may be relevant here, too.

The resolve shader then walks the render target consuming all Primary samples. The RBEs simultaneously supply compression tagging data to the ALU pipes in the form of shader inputs (a register array, say - perhaps communicated via the memory cache). Because compression works on a tiled basis, the texture fetches that are required from the Secondary (and Tertiary etc.) surfaces are bandwidth-efficient, that is they match the burst-size of the memory system and the cache-line size in the texture cache.

Jawed
 
Why is performance drop relevant?
Because it tells you whether AA is a problem or not.

This can't work.

The whole point of AA sample compression is that you only load 1 sample per pixel if possible. So loading all samples into cache, filtering (averaging) and then returning them to the ALUs saves zero bandwidth.
What are you talking about? For a compressed pixel, why do you need to load all four samples from memory when 3 of the samples have useless, indeterminate contents?

DX10 support requires the ability to load individual AA samples into the pixel shader, so clearly the texture units know when a tile is compressed or not. All they have to do is uncompress the compressed tiles into the L1 cache, just like they do for DXTC.

While the TUs are fetching pixels to be resolved, the resulting pixels generated by the ALUs have to be written by the RBEs back into memory, consuming bandwidth and adding latency.
So? That's true for resolving on all graphics cards. Even your method consumes precisely the same bandwidth per resolved pixel.
 
Because it tells you whether AA is a problem or not.
So when the theoretical rates of R600 are the same as R580 per clock, and yet R600 has faster AA, what is the performance drop telling you about "R600's AA problem"?

What are you talking about? For a compressed pixel, why do you need to load all four samples from memory when 3 of the samples have useless, indeterminate contents?
Don't texture compression formats require you to fetch an entire "compression block" in order to decode the value of any one texel?

DX10 support requires the ability to load individual AA samples into the pixel shader, so clearly the texture units know when a tile is compressed or not. All they have to do is uncompress the compressed tiles into the L1 cache, just like they do for DXTC.
Perhaps you can find some synergy with your "texture unit resolve" idea in the compression schemes described in this patent application:

Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same

Paragraphs 88 to 92 seem to hint in the direction you're talking about, with per tile compression flags stored in memory as part of the tile.

Jawed
 
so 6 months after its release people are still unclear about why the 2900xt sucks with AA? i thought not having dedicated hardware was the main reason. why do we not know for certain whats causing its poor performance with AA?
 
i thought not having dedicated hardware was the main reason.
[strike]That's my opinion too[/strike]. People latch on to the fact there's no hardware resolve and decide, without any evidence or reasoning, that it's slow. Then the worst of these people conclude that there's a bug in there, just for that extra little bit of NV30-II frisson.

EDIT: erm read that as you saying "not having enough dedicated hardware". Whoops :oops:

Jawed
 
Last edited by a moderator:
So when the theoretical rates of R600 are the same as R580 per clock, and yet R600 has faster AA, what is the performance drop telling you about "R600's AA problem"?
Not only are the rates not the same (R600 definately has a big edge in real shader throughput, and R580 will be BW limited on occasion), but R600 isn't always faster per clock with AA.

However, I did just think of a reason that the bigger drop with R600 could be decieving: it has twice the Z-only rate of R580 when AA is disabled, but the same rate when AA is enabled. Right? That throws a wrench into the comparison.

Don't texture compression formats require you to fetch an entire "compression block" in order to decode the value of any one texel?
Yup, and a compressed framebuffer requires you to fetch a block as well. I don't know how big it is, but maybe a 4x4 block is stored together. A compressed tile needs 16 bytes read (for 32bpp), and an uncompressed tile needs 64.

Paragraphs 88 to 92 seem to hint in the direction you're talking about, with per tile compression flags stored in memory as part of the tile.
I don't think you'll ever have to keep track of multiple unresolved buffers floating around. You should be able to use the compression flags stored on chip.

Anyway, the point is that the pixel shaders can read a compressed AA framebuffer, as that's part of the DX10 spec. It makes no sense to feed the PS this data through anything other than the texture units. Therefore the TU can read data from a compressed framebuffer. From there, it's a short leap to be able to filter that data also.
 
I don't think you'll ever have to keep track of multiple unresolved buffers floating around. You should be able to use the compression flags stored on chip.

Anyway, the point is that the pixel shaders can read a compressed AA framebuffer, as that's part of the DX10 spec. It makes no sense to feed the PS this data through anything other than the texture units. Therefore the TU can read data from a compressed framebuffer. From there, it's a short leap to be able to filter that data also.
If you are doing the framebuffer resolve you only have one unresolved buffer around. That's not true any more if you're using arbitrary multisample textures. So either the TU can read compression flags from off-chip memory or multisample textures have to be uncompressed.

Thus it may be quite possible that only the RBEs can read compressed multisample buffers.

btw, DX10 doesn't require multisampling at all, so it obviously doesn't require support for multisample textures either.
 
Not only are the rates not the same
The AA sample creation rate is the same...

(R600 definately has a big edge in real shader throughput, and R580 will be BW limited on occasion),
And R600 has double-rate Z with AA off ...

but R600 isn't always faster per clock with AA.
When this happens it's prolly a driver bug. Yet people rely on this as evidence of the failure of shader AA-resolve or as evidence of a hardware bug.

However, I did just think of a reason that the bigger drop with R600 could be decieving: it has twice the Z-only rate of R580 when AA is disabled, but the same rate when AA is enabled. Right? That throws a wrench into the comparison.
:cry: guess I should have been more explicit about this fundamental problem, but then it's no different from the fact that R600 also has more bandwidth, fp16 texture filtering, better texture caches, better hierarchical-Z, independent hierarchical-stencil or more ALU throughput - "the no-AA case on R600 has a squillion fps"...

Yup, and a compressed framebuffer requires you to fetch a block as well. I don't know how big it is, but maybe a 4x4 block is stored together. A compressed tile needs 16 bytes read (for 32bpp), and an uncompressed tile needs 64.
I'm still struggling to understand the patent application in terms of an int8 formatted render target.

For example, in one embodiment the patent application describes how compression is based upon 2x2 pixel blocks, but that 2x2 blocks are aggregated into 4x4 blocks that are stored in memory. Here it seems that 2x2 blocks are a cache optimisation trick within the RBEs, so that when the RBEs are manipulating samples they do the minimum work. So the nature of compression for 4x4 blocks is different...

So, I'm still trying to get my head round it.

I don't think you'll ever have to keep track of multiple unresolved buffers floating around. You should be able to use the compression flags stored on chip.
http://forum.beyond3d.com/showpost.php?p=1021653&postcount=867

There's quite a few useful posts by OpenGL guy there, so make sure to check.

Additionally, D3D10 allows the programmer to access MSAA'd render targets in the form of a "texture resource" where each texel corresponds with a sample - it is no longer a render target. The programmer might choose to access this "MSAA texture" ages after it was generated (in the next frame, for example). So there's no possibility of the compression tags inside the RBEs being kept for some indeterminate time, waiting for the programmer to access "MSAA samples".

My :???: understanding :???: of the patent currently indicates that all the compression information corresponding to the block size is stored in the render target, in memory. The compression tags inside the RBEs give the GPU a fast path to the compression information (i.e. they're just a "copy" of what's in the render target), so that it doesn't have to read memory in order to find out the compression status of each block.

Jawed
 
If you are doing the framebuffer resolve you only have one unresolved buffer around. That's not true any more if you're using arbitrary multisample textures. So either the TU can read compression flags from off-chip memory or multisample textures have to be uncompressed.
I can see that being a good fallback for poorly written applications, but since we're only talking about one bit per tile, it's cheap to keep track of compression on chip for 4MPix or so. There's no reason you can't divide that among multiple multisample textures to handle 99% of the applications. The remaining issue, then, is whether compression flags are on chip or not, and if they are, you think that multisample textures couldn't use them.

Fair enough. I just find it weird that R600 could use a different mechanism for fetching samples from a multisample texture in user defined shaders and fetching them when running the standard resolve shader. There's no indication that anything other than the TU is used to feed data into the pixel shader, is there?

That's what my logic is based on. The TU must be able to read compressed samples, even if I'm wrong in assuming on-chip flags.
 
:cry: guess I should have been more explicit about this fundamental problem, but then it's no different from the fact that R600 also has more bandwidth, fp16 texture filtering, better texture caches, better hierarchical-Z, independent hierarchical-stencil or more ALU throughput - "the no-AA case on R600 has a squillion fps"...
It is different, because all those other things affect AA speed just as much as no-AA speed (except BW, which is supposed to help AA more than no-AA). That's why performance drop is relevent, because it's a metric that filters out all the performance improvements that affect both AA and no-AA.

I just forgot that Z-only speed was only increased for no-AA. So that along with resolve speed are the two things that would change the AA performance drop compared to R580.

My :???: understanding :???: of the patent currently indicates that all the compression information corresponding to the block size is stored in the render target, in memory. The compression tags inside the RBEs give the GPU a fast path to the compression information (i.e. they're just a "copy" of what's in the render target), so that it doesn't have to read memory in order to find out the compression status of each block.
That makes some sense, as it keeps the latency low for the RBEs which need to worry about read-modify-write, while the TU's can access compression info for both shaders w/ sample load and shader resolving.
 
I can see that being a good fallback for poorly written applications, but since we're only talking about one bit per tile, it's cheap to keep track of compression on chip for 4MPix or so. There's no reason you can't divide that among multiple multisample textures to handle 99% of the applications. The remaining issue, then, is whether compression flags are on chip or not, and if they are, you think that multisample textures couldn't use them.

Fair enough. I just find it weird that R600 could use a different mechanism for fetching samples from a multisample texture in user defined shaders and fetching them when running the standard resolve shader. There's no indication that anything other than the TU is used to feed data into the pixel shader, is there?

That's what my logic is based on. The TU must be able to read compressed samples, even if I'm wrong in assuming on-chip flags.
An application might render several high-res multisampled shadow maps exceeding 4 MPix to use them in a single pass afterwards. I don't see what's "poorly written" about that.

The RBEs need to be able to read compressed framebuffer data. They could be capable of feeding that to the shaders. I'm not saying the TUs definitely can't read compressed samples, but it is a possibility.
 
An application might render several high-res multisampled shadow maps exceeding 4 MPix to use them in a single pass afterwards. I don't see what's "poorly written" about that.
That's not a very efficient way to do shadow mapping. The Load function used to access samples uses integer texture coordinates, and you'd have a tough time doing proper PCF too without slowing everything down to a crawl.

MSAA with shadow mapping is something you'll see primarily with VSM, and there you can resolve immediately after sampling. Either that or you could transfer the MSAA samples into a traditional texture for faster regular shadow mapping.

R6xx is feeding the PS multi-sampled buffer data via the RBE's, not texture units.
:???: Really? Is this for all shaders using the multisample Load function, or just the standard resolve?

I guess that means everything I said above is wrong. This just doesn't make sense to me on so many levels.
 
It is different, because all those other things affect AA speed just as much as no-AA speed (except BW, which is supposed to help AA more than no-AA). That's why performance drop is relevent, because it's a metric that filters out all the performance improvements that affect both AA and no-AA.
All those advancements I listed also affect bandwidth, since it's being used more efficiently. So R6xx AA (as well as all other functions, ultimately) gains not merely from having extra bandwidth, but also in using that bandwidth more efficiently.

Jawed
 
All those advancements I listed also affect bandwidth, since it's being used more efficiently. So R6xx AA (as well as all other functions, ultimately) gains not merely from having extra bandwidth, but also in using that bandwidth more efficiently.
Which would suggest that R600 has a smaller perf drop, not bigger. Hence the evidence that something is wrong, since the difference is more than what's you can explain with shader resolve.

Anyway, the Z/stencil-only boost w/o AA explains a lot, so let's forget about this now.
 
I think Rv670 launch might provide some more illumination into this whole thing.

Afterall, isn't it rumored that Rv670 will have faster AA resolve even though it has less bandwidth than R600?

I'm also guessing, that if this rumor does prove to be true, that we probably won't know for sure what key architectural differences between R600 and Rv670 would lead to the increased speed of AA resolve.

Regards,
SB
 
cant you guys look at games where the R600 does very well when AA is turned on and ones where is does horrible when AA is turned on and based on that figure out the most plausible explanation for why AA takes such a hit?
 
isn't it rumored that Rv670 will have faster AA resolve even though it has less bandwidth than R600?
Its rumored that Rv670 will have faster AA even though it has less bandwidth than R600.

Resolve is only a part of AA.

I don't really understand the technical mumbo-jumbo but I get the impression that its the 2 AA samples per clock per ROP (&/or only 16 ROPs) which is the main issue.
 
Back
Top