msaa & csaa

This method lowers the image quality (very) slightly, as you only store one depth value per pixel from each polygon (shader is run only once for each pixel inside the polygon). MSAA depth buffer stores one depth value per sample on every pixel (it's not replicated like color outputs are).
... that's only assuming you're going to execute your light shaders at sample frequency, which is a bit excessive in most cases! Normally you only *want* to execute it at sample frequency for incoherent pixels, as I believe Humus' latest demo does.

Furthermore I can't see the difference being at all significant in practice... plus isn't there a way in D3D10.1 to declare an interpolant as being at sample frequency (or maybe just to run the whole shader at sample frequency) IIRC?

Anyways it's convenient to reuse the depth buffer as part of your G-buffer since you're generating it anyways, but it's certainly not crippling on D3D10 to not have that convenience. In fact there are some (small) advantages to storing linear view space Z rather than post-projection Z, not that they necessary outweigh the bandwidth savings of one less G-buffer value, small as they may be too.
 
... that's only assuming you're going to execute your light shaders at sample frequency, which is a bit excessive in most cases! Normally you only *want* to execute it at sample frequency for incoherent pixels, as I believe Humus' latest demo does.

With existing DX10 hardware dynamic branching granularity, executing light shaders at sample frequency only for edge pixels should not yield that much (if any) performance improvement for real game scenes (complex high polygon count scenes). Dynamic branching itself causes a small perf hit, and you have to compare the samples also. And hi-stencil culling is not pixel precise either (4x4 blocks or higher) if you want to go that route.

Furthermore I can't see the difference being at all significant in practice...

Agreed completely. The biggest difference is at the polygon edges (both color and depth buffers are multisampled similarly at depth boundaries).
 
With existing DX10 hardware dynamic branching granularity, executing light shaders at sample frequency only for edge pixels should not yield that much (if any) performance improvement for real game scenes (complex high polygon count scenes). Dynamic branching itself causes a small perf hit, and you have to compare the samples also. And hi-stencil culling is not pixel precise either (4x4 blocks or higher) if you want to go that route.
If branching turns out to be of little benefit it could be a good idea to introduce some random dithering into the per-sample lighting calculation. Might as well use that processing power for some proper lighting anti-aliasing.
 
If branching turns out to be of little benefit it could be a good idea to introduce some random dithering into the per-sample lighting calculation. Might as well use that processing power for some proper lighting anti-aliasing.
One other option to move back towards shading at pixel frequency without the hit is to use so-called "interleaved" deferred lighting evaluation, wherein you only evaluate a subset of the lights for each sample. This should generally work pretty well for low-frequency lighting effects, but may have issues with high-frequency normals/specular or similar if one isn't careful. That said it won't be any worse than the typical unfiltered normal maps that we're seeing everywhere in games now anyways ;)
 
Full depth write to each subsample (instead of color replication) is also good for shadowmap rendering. You can basically render the shadow map on 2x2 smaller resolution and with 4xMSAA enabled to gain 4 x fillrate boost to shadow map rendering. Unfortunately on PC you cannot just "cast" the 4xMSAA buffer to 2x2 larger non-multisampled buffer, but with DX10.1 you can access all the four subsamples and pick the nearest when you sample the shadow map texture. This trick could bring some performance improvements if the shadow map genration is fill rate bound... but I am still waiting to get real typeless memory buffer support for PC world.
 
isn't there a way in D3D10.1 to declare an interpolant as being at sample frequency (or maybe just to run the whole shader at sample frequency) IIRC?

If you use SV_SampleIndex the shader will run at sample frequency.

Anyways it's convenient to reuse the depth buffer as part of your G-buffer since you're generating it anyways, but it's certainly not crippling on D3D10 to not have that convenience.

Depends on if you need to support other platforms as well that don't have any such restrictions on MSAA depth.
 
Full depth write to each subsample (instead of color replication) is also good for shadowmap rendering.
That's true, although with "filterable" shadow formats you can actually do a standard "resolve" and get a typical non-MSAA buffer with similar quality advantages.

If you use SV_SampleIndex the shader will run at sample frequency.
Right... is there no way to declare specific interpolants as being at sample frequency though? i.e. to give the implementation a chance to partition your shader by frequency appropriately? I'm guessing not, but it's an obviously useful feature for complex shaders.

Depends on if you need to support other platforms as well that don't have any such restrictions on MSAA depth.
You mean if you can use the same-ish implementation in two places that both read the MSAAed depth buffer? Yeah, sure that's convenient, but again I wouldn't say it's crippling to have to write out an extra fp32 worst case in DX10.0.
 
That's true, although with "filterable" shadow formats you can actually do a standard "resolve" and get a typical non-MSAA buffer with similar quality advantages.

Yes, that true. We are using ESM and 4xMSAA on all our shadowmaps in our soon to be released console title. This method works very well with EDRAM since 4xMSAA rendering does not consume any extra bandwidth.
 
You mean if you can use the same-ish implementation in two places that both read the MSAAed depth buffer? Yeah, sure that's convenient, but again I wouldn't say it's crippling to have to write out an extra fp32 worst case in DX10.0.

Still much less expensive to reuse the depth buffer, and slightly better quality, 1600*1200*4*4 (MSAAx4), is (as you very well know) about 30MB, and a lot of bandwidth...

Even if you consider the average computer according to steam stats, you have 1280*1024*4*4 (MSAAx4) so 20MB, on a 256MB card, that's a lot of (compressed) textures and/or rendertargets lost, especially since you have already allocated that amount of memory for the exact same use anyway...
 
Still much less expensive to reuse the depth buffer, and slightly better quality, 1600*1200*4*4 (MSAAx4), is (as you very well know) about 30MB, and a lot of bandwidth...
I think you're overstating that a bit moving forward... sure it's a chunk of memory but one up-to-30MB chunk (in reality more like 15MB if you're pressed for memory since if you're storing your own depth metric you can usually get away with 16-bits) that is unlikely to scale much in the coming years. There aren't a whole lot of 256MB DX10 cards (other than laptops) and for 512MB+ cards it's fairly insignificant.

[... ]especially since you have already allocated that amount of memory for the exact same use anyway...
Don't get me wrong, it's convenient to read the MSAAed depth buffer! All I'm saying is that you can implement an efficient deferred renderer + MSAA just fine without that ability - and I have. Furthermore the post-projected depth is actually *not* exactly what you want, although it can often be converted back to view space Z or similar without too much precision loss.

Anyways I think we're all in violent agreement here. All I'm saying is that I wouldn't let that lack of DX10.1 support or similar prevent me from implementing a deferred renderer with MSAA. In reality NVIDIA can implement this feature as well with some hackery, so it's a moot point in any case.
 
Anyways I think we're all in violent agreement here.

Indeed, my fault though, I shouldn't have tried to advertise ATi hardware for once :p
(Only some ATi HD3 & HD4 support DX10.1 though ;p)

PowerVR power !!! (what ?! It is my favorite GPU brand after all ;p)

Maybe we can answer more CSAA & MSAA questions now ^^;
 
For CSAA, in your example, what will happen if there are more than 4 colors. which 4 color will be stored? First 4? How to deal with other colors? (use closest color instead?) How to decide the number of colors being stored? Is it a fixed number (like 4?) or this number could be changed?

So the answer to this question WRT NVIDIA's implementation is still 'we don't know' ?
 
Right... is there no way to declare specific interpolants as being at sample frequency though? i.e. to give the implementation a chance to partition your shader by frequency appropriately? I'm guessing not, but it's an obviously useful feature for complex shaders.

How would that work exactly? You mean like you get an array of values for that interpolator instead of a single value? Not sure how else you'd be able to specify just an interpolator to be sample frequency.

You mean if you can use the same-ish implementation in two places that both read the MSAAed depth buffer? Yeah, sure that's convenient, but again I wouldn't say it's crippling to have to write out an extra fp32 worst case in DX10.0.

I mean that if you need to support other platforms as well, like consoles and/or DX10.1, it's quite a bit of pain sometimes to put in a extra code just about everywhere only to work around this limitation in DX10.0.
 
How would that work exactly? You mean like you get an array of values for that interpolator instead of a single value? Not sure how else you'd be able to specify just an interpolator to be sample frequency.
The shader compiler would have to split up the portions of the shader that need to be run at pixel vs sample frequency based on what data is used. So for instance instead of declaring an unused SV_SampleIndex to implicitly "force" your entire shader to run at sample frequency, you could instead declare - say - your texture coordinates as sample frequency and then the code that samples your alpha texture and discards would then be automatically determined to need to be run at sample frequency as well, while any parts of the shader that only depend on pixel frequency inputs can be run only once for the pixel, and only live registers need to be replicated where necessary to loop over samples. This sort of analysis is actually really trivial to do in a shader compiler, and has some neat implications.

There are admittedly a few weirdnesses with respect to derivatives and things like centroid sampling of course, but isn't that always the case? :)

I mean that if you need to support other platforms as well, like consoles and/or DX10.1, it's quite a bit of pain sometimes to put in a extra code just about everywhere only to work around this limitation in DX10.0.
Ah yes, of course, although I'd blame this more on the inadequacies of the binding layer between shaders and "host" programs. It *shouldn't* be hard to add or remove an element from a struct and abstract the encoding/decoding after all, but we're not there yet.
 
The shader compiler would have to split up the portions of the shader that need to be run at pixel vs sample frequency based on what data is used. So for instance instead of declaring an unused SV_SampleIndex to implicitly "force" your entire shader to run at sample frequency, you could instead declare - say - your texture coordinates as sample frequency and then the code that samples your alpha texture and discards would then be automatically determined to need to be run at sample frequency as well, while any parts of the shader that only depend on pixel frequency inputs can be run only once for the pixel, and only live registers need to be replicated where necessary to loop over samples. This sort of analysis is actually really trivial to do in a shader compiler, and has some neat implications.

There are admittedly a few weirdnesses with respect to derivatives and things like centroid sampling of course, but isn't that always the case? :)

That's an interesting idea. Although, I wonder how much those "weirdnesses" would break things and how much special care you'd need for edge pixels where only a subset of the samples are covered.
 
That's an interesting idea. Although, I wonder how much those "weirdnesses" would break things and how much special care you'd need for edge pixels where only a subset of the samples are covered.
It's true and I haven't given it a ton of detailed thought yet. Still, given that centroid sampling + derivatives and a few other weird things tend to work okay in practice in most applications that I've played with, one would think that a reasonable policy could be written.
 
Running the shader per-sample always seemed like a non-feature to me. Who wants to run at 1/4 or 1/8th the speed for a marginal increase in quality?

IMHO, it's far better to be able to have more a flexible interpolation scheme, and the ability to query where edges are (or if arbitrary sample points are inside the primitive or not). Then, you could code a shader that runs at full speed for most cases, but can slow down as needed for the "hard" cases. The hard cases need not be polygon edges; they can also be specular highlights, for example.

This also allows for natural scaling of performance for low-end GPUs, and disconnects the shader math from the multisample mode.
 
Would something like that require API support or could the IHV's force it on like they do with regular old MSAA now?
 
That would require API and developer support. It's not something you can just turn on.
 
NVIDIA's been doing some pretty crazy things without developer support..

For future cards (think GT300) the ability to force RGSS would be nice. Especially for the myriad PC games these days that lack proper MSAA support. :devilish:
 
Back
Top