Custom resolve demo

Humus

Crazy coder
Veteran
I've added a custom resolve demo to my site.

CustomResolve.jpg


The demo lets you compare AA quality between a standard resolve and a custom post-tonemapping resolve done in the shader. Not surprisingly the quality of the custom resolve is much better, especially where the contrast is high.

Download here
 
Ah very cool! It certainly looks a hell of a lot better... curiously, what tone mapping function are you using? It seems to me that the more non-linear it is, the worse doing a standard resolve (and even texture filtering) would be. Have you noticed any problems with texture aliasing when using extremely aggressive tone mapping functions? Personally I haven't really, but I tend to use pretty simple tone mapping functions and haven't played with it too much...

Second question... why is the custom resolve slightly *faster* (160 vs 150 fps) than the hardware resolve on G80? Are you maybe just doing a shader resolve in both cases? No biggie either way, just curious :)

[Edit] Is there normal mapping in the demo? I managed to find a bit of aliasing on the ground at mid-level exposures and very anisotropic angles, particularly in the region where there are two splotches of light, where the rest is more shadowed. I'm wondering whether this is the above mentioned filtering artifact or something more normal...
 
Post-tonemapping resolve is about 59% slower than without one (340 vs. 840 FPS).
This is enabling 8xAA on HD3870.
 
Post-tonemapping resolve is about 59% slower than without one (340 vs. 840 FPS).
This is enabling 8xAA on HD3870.
Now I'm really confused... the results between G80 and R6xx seem backwards to what I would expect. Note that I grabbed my result on a 8800GTX at 1920x1200, 8x AA.
 
1920x1200
Way too much for my trusty old CRT, but I've re-tested under 1600x1200, as follows:

680/235 - CrossFire on;
353/121 - CrossFire off;

Both cases are over 50% performance drop.
 
Way too much for my trusty old CRT, but I've re-tested under 1600x1200, as follows:

680/235 - CrossFire on;
353/121 - CrossFire off;

Both cases are over 50% performance drop.
I updated to the latest beta drivers (174.74) and tried it at 1600x1200:

4x AA: 450/450
8x AA: 301/275

So now at least the performance direction seems consistent, but the results are still a bit puzzling. I would have expected G80 to take a much larger hit than R6xx from a custom resolve but it appears to be the opposite.
 
Maybe the 'fast path" for MSAA samples isn't enabled in DX10 (only under the driver's implementation), while G80 actually uses the same compressed data paths as input no matter whether it's in the shader core or outside it? Certainly this should be more bandwidth limited than anything else, so perhaps the fact you're comparing a 384-bit GPU to a 256-bit GPU plays a role too?
 
Have you considered detecting edges via a prepass and fetching all n samples only at edges/discontinuities? I've found this to be a huge improvement. And using stencil instead of dynamic flow control offers further benefit ( at least on ATI HD 2000 series ).
 
So I got some numbers from my sister to compare to mine... at 1680x1050, 8x AA:

8800GTX (174.74): 350/310
8800GTS 512 (169.25): 244/287

So two notes: first, 169.25's have some weird issues with standard resolves compared to custom ones. There's no reason why the default one should ever be *slower*. Thus it's kind of hard to compare since something is kind of messed up. That said, it seems like the extra memory bandwidth of the GTX does help in this case, but still I'd expect the %-differences to stay the same even with G92, if my sister were using 174.74's (her copy of Vista 64 is bitching about it being unsigned... mine doesn't seem to mind for some reason).

Maybe Humus can clear this up a bit. I'm also really curious as to why R6xx takes such a huge hit... isn't the whole point that it implements even the "standard" resolve in software as well?
 
The standard resolve in R6xx has access to the compressed format of the MSAA'd render target and uses the RBEs to transmit that data to the ALUs at 32 samples per clock.

The custom resolve has to read the uncompressed MSAA'd data as a texture at 16 samples per clock.

Jawed
 
So, the excess bilerp rate on G80/90 moves in advantage, here?
Partly, yes. There's also the small matter of converting the compressed-format MSAA'd render target into a texture. This is only required with custom resolve, so requires a trip through the ROPs and back out to memory to create the texture. The texture is then read in by the custom resolve shader program.

I don't know how these GPUs perform when converting from a compressed MSAA'd render target into a texture - but I presume G80/92 does this at twice RV670's rate simply because during normal MSAA'd rendering it runs its AA-compression hardware at 4 samples per clock per pixel (of input) whereas RV670 runs at 2 samples.

Jawed
 
Sorry for somewhat drunk responses, i've been partying all night, but I'll try to get this right. :p

curiously, what tone mapping function are you using?

It's a pretty standard photographic tonemap operator: 1 - exp2(-color * exposure).

It seems to me that the more non-linear it is, the worse doing a standard resolve (and even texture filtering) would be.

Yup. Linear tonemap operators on the other hand generally suck. Still a bunch of games use them (but I think that's a mixture of ignorance and technical limitations).

Have you noticed any problems with texture aliasing when using extremely aggressive tone mapping functions?

Not texture aliasing. However, antialiasing using standard resolve generally works worse the more extreme the contrast gets. When I first experimented with custom resolves I was surprised though how little contrast was needed until antialiasing was diminished or even killed entirely with a standard resolve. I didn't have to use extreme values, like in the thousands or anything, instead the effect was clearly visible even at ranges like 4-8. One reason why this isn't so much visible in games is that they often use bloom effects that hide the worst artifacts.

Second question... why is the custom resolve slightly *faster* (160 vs 150 fps) than the hardware resolve on G80? Are you maybe just doing a shader resolve in both cases? No biggie either way, just curious :)

This is normal. Unfortunately, it's not true for the ATI cards at the moment, but I hope future cards will show the same performance characteristics. The reason why it's faster is that you save a lot of bandwidth. Instead of first resolving to a render target then reading that render target and tonemapping it, you just read the multisampled target once, tonemap and resolve in the same pass and write to the backbuffer. Basically you save a read-write cycle of an entire fullscreen render target, at the expense of more ALU operations though, but generally the bandwidth gains should matter more than the ALU on modern hardware.

Edit: When I first implemented this I saw the same performance characteristics on the HD 2900 too. Later the performance of the standard resolve was significantly improved through a bunch of clever driver tricks, and it turned out faster in the end. Theorethically it should be possible to make this technique faster on the HD 2000/3000 as well by applying similar optimizations to this technnique, but that's not possible from a plain D3D10 app, but with an app detect I would assume the driver could implement a similar fast path for this technique that would outperform a standard resolve.

Is there normal mapping in the demo?

Yes.
 
Last edited by a moderator:
Have you considered detecting edges via a prepass and fetching all n samples only at edges/discontinuities? I've found this to be a huge improvement. And using stencil instead of dynamic flow control offers further benefit ( at least on ATI HD 2000 series ).

I did try that, but found the performance difference was very small. That was quite a while ago though and I haven't tried again.
 
Humus said:
The reason why it's faster is that you save a lot of bandwidth. Instead of first resolving to a render target then reading that render target and tonemapping it, you just read the multisampled target once, tonemap and resolve in the same pass and write to the backbuffer. Basically you save a read-write cycle of an entire fullscreen render target, at the expense of more ALU operations though, but generally the bandwidth gains should matter more than the ALU on modern hardware.
8800GTX (174.74): 350/310
8800GTS 512 (169.25): 244/287

So, in the end, G92 is predominantly constrained by it's memory BW, compared to the mighty G80. ;)

May be an additional test with the latest drivers will clear out this.

I have to dig out my 2900XT for a quick test.
 
Yup. Linear tonemap operators on the other hand generally suck.
Yes no doubt :) The fact remains however that if we start using increasingly non-linear ones other assumptions of linearity (such as texture filtering) are going to become a problem. That said it may be the case that it just doesn't matter "enough" to cause trouble in the majority of cases. I can definitely see problems for bright, shadow-mapped lights though!

The reason why it's faster is that you save a lot of bandwidth. Instead of first resolving to a render target then reading that render target and tonemapping it, you just read the multisampled target once, tonemap and resolve in the same pass and write to the backbuffer.
Ahhh yes. I should have thought of that :)

Later the performance of the standard resolve was significantly improved through a bunch of clever driver tricks, and it turned out faster in the end.
I take it you can't comment on these "clever tricks"? I'd be interested to know what they are, and why G80 appears either to not implement, or not need them.

Cheers, and cool demo!
 
Back
Top