Sorry for somewhat drunk responses, i've been partying all night, but I'll try to get this right.
curiously, what tone mapping function are you using?
It's a pretty standard photographic tonemap operator: 1 - exp2(-color * exposure).
It seems to me that the more non-linear it is, the worse doing a standard resolve (and even texture filtering) would be.
Yup. Linear tonemap operators on the other hand generally suck. Still a bunch of games use them (but I think that's a mixture of ignorance and technical limitations).
Have you noticed any problems with texture aliasing when using extremely aggressive tone mapping functions?
Not texture aliasing. However, antialiasing using standard resolve generally works worse the more extreme the contrast gets. When I first experimented with custom resolves I was surprised though how little contrast was needed until antialiasing was diminished or even killed entirely with a standard resolve. I didn't have to use extreme values, like in the thousands or anything, instead the effect was clearly visible even at ranges like 4-8. One reason why this isn't so much visible in games is that they often use bloom effects that hide the worst artifacts.
Second question... why is the custom resolve slightly *faster* (160 vs 150 fps) than the hardware resolve on G80? Are you maybe just doing a shader resolve in both cases? No biggie either way, just curious
This is normal. Unfortunately, it's not true for the ATI cards at the moment, but I hope future cards will show the same performance characteristics. The reason why it's faster is that you save a lot of bandwidth. Instead of first resolving to a render target then reading that render target and tonemapping it, you just read the multisampled target once, tonemap and resolve in the same pass and write to the backbuffer. Basically you save a read-write cycle of an entire fullscreen render target, at the expense of more ALU operations though, but generally the bandwidth gains should matter more than the ALU on modern hardware.
Edit: When I first implemented this I saw the same performance characteristics on the HD 2900 too. Later the performance of the standard resolve was significantly improved through a bunch of clever driver tricks, and it turned out faster in the end. Theorethically it should be possible to make this technique faster on the HD 2000/3000 as well by applying similar optimizations to this technnique, but that's not possible from a plain D3D10 app, but with an app detect I would assume the driver could implement a similar fast path for this technique that would outperform a standard resolve.
Is there normal mapping in the demo?
Yes.