SSAA vs. MSAA debate

poly-gone said:
On the contrary, I wrote the code so I perfectly know what's going on and what it's limited by. And in fact, IT IS FILLRATE LIMITED because I'm using both HDR Rendering and performing soft-edged shadow mapping (1024x1024 SM).
HDR is more memory bandwidth-limited than traditional rendering. And your results speak for themselves: if you are implementing 2x supersampling with a small performance hit, then you are not fillrate-limited.

Edit:
Made a mistake.
 
Chalnoth said:
HDR is more memory bandwidth-limited than traditional rendering. And your results speak for themselves: if you are implementing 2x supersampling with a small performance hit, then you are not fillrate-limited (you're most likely memory bandwidth-limited, but I can't tell for certain without more information).
OK, time out. You guys really have to read my posts more clearly. I think there has been a major misunderstanding here. PLEASE READ MY POST AGAIN, it says "5 fps lesser with 2x SSAA THAN with 4x MSAA" and not "small hit with SSAA". Does that make sense?
 
Chalnoth said:
HDR is more memory bandwidth-limited than traditional rendering. And your results speak for themselves: if you are implementing 2x supersampling with a small performance hit, then you are not fillrate-limited (you're most likely memory bandwidth-limited, but I can't tell for certain without more information).
HDR becomes a bandwidth limitation at high resolutions, but at 1024x768 (with the SSAA map being 2048x1536) it's not that much of a bandwidth thing. Once you combine this with soft-edged shadow mapping (at 1024x1024), you'd be burning your fillrate like hell.
 
poly-gone said:
HDR becomes a bandwidth limitation at high resolutions, but at 1024x768 (with the SSAA map being 2048x1536) it's not that much of a bandwidth thing. Once you combine this with soft-edged shadow mapping (at 1024x1024), you'd be burning your fillrate like hell.
Erm, high resolution actually slightly reduces bandwidth limitations with respect to fillrate limitations. This is because at high resolution, your textures are more likely to be magnified, resulting in higher texture cache efficiency, and z-buffer compression is more effective, as more pixels are covered by single triangles. If you notice more of a performance hit from enabling FSAA at high resolutions, it's because your lower-resolution scores are limited by something other than fillrate or memory bandwidth, like vertex throughput or the CPU.

Your observation that 4x MSAA is performing 5fps better than 2x SSAA isn't a statement that supersampling can still give good performance even in light of fillrate-limited scenarios. It's a statement that the hardware which you are making use of still could do better at optimizing its multisampling performance. Additionally, since multisampling is much more taxing on memory bandwidth than fillrate (though optimizations reduce the memory bandwidth hit, they cannot entirely eliminate it), the fact that the performance hit between the two is so close seems to again indicate a memory bandwidth limitation.

And please note that even in the presence of heavy memory bandwidth limitation, 2x supersampling should still produce roughly a 50% performance hit (sorry, misspoke on my previous post: it may be slightly less of a hit than 50% due to the same reasons that higher-resolution is less memory bandwidth-limited, but this difference may be negligible or overshadowed by the downsampling).
 
Chalnoth said:
Erm, high resolution actually slightly reduces bandwidth limitations with respect to fillrate limitations. This is because at high resolution, your textures are more likely to be magnified, resulting in higher texture cache efficiency, and z-buffer compression is more effective, as more pixels are covered by single triangles.
Theoritically yes, but bandwidth limitation most likely occurs when you have large textures (with an equivalently high number of random fetches), high poly scenes and a frigging huge framebuffer.

None of those are present in my scene. I use 3 textures for the terrain, 4(+2) for my water (2 normal maps (2 reads each) and 2 maps for reflection/refraction), a cubemap for a couple of glass objects and a 3D noise map (64x64x64) for the clouds. The whole scene renders about 50000-60000 polygons per frame, so clearly it's not bandwidth limited.

Chalnoth said:
Your observation that 4x MSAA is performing 5fps better than 2x SSAA isn't a statement that supersampling can still give good performance even in light of fillrate-limited scenarios. It's a statement that the hardware which you are making use of still could do better at optimizing its multisampling performance. Additionally, since multisampling is much more taxing on memory bandwidth than fillrate (though optimizations reduce the memory bandwidth hit, they cannot entirely eliminate it), the fact that the performance hit between the two is so close seems to again indicate a memory bandwidth limitation.
As I mentioned above, my scene is NOT bandwidth limited. My Quadro FX 4000 certainly can meet higher bandwidth requirements than what's required by my scene.

Chalnoth said:
And please note that even in the presence of heavy memory bandwidth limitation, 2x supersampling should still produce roughly a 50% performance hit.
The fps drops from 75 fps to 43 with 2x SSAA, and to 48 fps with 4x MSAA, and without AF, MSAA alone doesn't give the quality of SSAA. So with that in mind, I'd prefer SSAA. That's the point I was putting across.
 
poly-gone said:
Theoritically yes, but bandwidth limitation most likely occurs when you have large textures (with an equivalently high number of random fetches), high poly scenes and a frigging huge framebuffer.

None of those are present in my scene. I use 3 textures for the terrain, 4(+2) for my water (2 normal maps (2 reads each) and 2 maps for reflection/refraction), a cubemap for a couple of glass objects and a 3D noise map (64x64x64) for the clouds. The whole scene renders about 50000-60000 polygons per frame, so clearly it's not bandwidth limited.
This does not follow. You're still only speaking theoretically. To really know that it's a fillrate wall you're hitting and not a bandwidth wall, you would need to do more testing. But considering your performance numbers for enabling 2x supersampling, I'd say that you're more limited by something else at that resolution.

The fps drops from 75 fps to 43 with 2x SSAA, and to 48 fps with 4x MSAA, and without AF, MSAA alone doesn't give the quality of SSAA. So with that in mind, I'd prefer SSAA. That's the point I was putting across.
Ah, but as I said, the performance hit for supersampling is pretty much static. Just due to the nature of the algorithm, there's basically now way to significantly improve it. But multisampling, on the other hand, can be done with almost zero performance hit. The primary issue with current video cards is imperfect framebuffer compression when multisampling is enabled. ATI does better than nVidia at the moment, but I'd be surprised if there wasn't still room left for improvement with ATI's algorithm.

So your results aren't really a good comparison between 2x supersampling and 4x multisampling: they're only a comparison between a particular implementation of 2x supersampling and a particular implementation of 4x multisampling.
 
poly-gone said:
Theoritically yes, but bandwidth limitation most likely occurs when you have large textures (with an equivalently high number of random fetches), high poly scenes and a frigging huge framebuffer.
A huge framebuffer decreases the bandwidth requirements per pixel slightly.

None of those are present in my scene. I use 3 textures for the terrain, 4(+2) for my water (2 normal maps (2 reads each) and 2 maps for reflection/refraction), a cubemap for a couple of glass objects and a 3D noise map (64x64x64) for the clouds. The whole scene renders about 50000-60000 polygons per frame, so clearly it's not bandwidth limited.
How many arithmetic instructions do your shaders use? Do you use texture compression on all textures (except the 3D-texture)? Fetching from any texture format that has 32 bpt or more is almost always bandwidth limited (that includes FP16 and depth textures) if your texture to arithmetic ratio is too high.

The fps drops from 75 fps to 43 with 2x SSAA, and to 48 fps with 4x MSAA, and without AF, MSAA alone doesn't give the quality of SSAA. So with that in mind, I'd prefer SSAA. That's the point I was putting across.
That is an unusually big hit for MSAA. Do you have render passes that can be completed in a single cycle? Do you render to multisampled rendertargets?
 
Xmas said:
A huge framebuffer decreases the bandwidth requirements per pixel slightly.
Could you elaborate on that please. I'd like to know how :).

Xmas said:
How many arithmetic instructions do your shaders use? Do you use texture compression on all textures (except the 3D-texture)? Fetching from any texture format that has 32 bpt or more is almost always bandwidth limited (that includes FP16 and depth textures) if your texture to arithmetic ratio is too high.
About 35-40 on an average, more in the case of shadow receivers. All textures (except the normal maps) use DXT5 compression.

That is an unusually big hit for MSAA. Do you have render passes that can be completed in a single cycle? Do you render to multisampled rendertargets
If done manually in a shader instead of "on the hardware", it is taxing to fetch 16 samples. Same holds true for every kind of downsampling operation I've seen on my hardware (both my 6800 and Quadro). I'm not too surprised with the performance hit. The shadowing, water and HDR post-processing obviously require most than one render pass, but all the other "straight" rendering ops are done in one pass (there's no offscreen "storage").
 
poly-gone said:
Could you elaborate on that please. I'd like to know how :).
Total bandwidth required is vertex fetch + texturing + framebuffer read & write + scanout.
A bigger framebuffer means more pixels per triangle, which means less vertex data per rendered pixel. It also means increased efficiency of block-based framebuffer access (filled tiles to partially covered tiles ratio). And it means texture magnification will occur more often.

Scanout bandwidth is different because it is independent of framerate and increases linearly with framebuffer size (given a fixed refresh rate). This means an increase in bandwidth per rendered pixel.

I'm not sure which effect is bigger, though. All of them are fairly small.

If done manually in a shader instead of "on the hardware", it is taxing to fetch 16 samples. Same holds true for every kind of downsampling operation I've seen on my hardware (both my 6800 and Quadro). I'm not too surprised with the performance hit. The shadowing, water and HDR post-processing obviously require most than one render pass, but all the other "straight" rendering ops are done in one pass (there's no offscreen "storage").
How do you enable AA? Through the driver panel, or do you create the multisampled/upsized rendertargets yourself?
How do you do the post-processing? Render to multisampled framebuffer, StretchRect to a texture, then render a fullscreen quad onto the framebuffer again?
 
Xmas said:
Total bandwidth required is vertex fetch + texturing + framebuffer read & write + scanout.
A bigger framebuffer means more pixels per triangle, which means less vertex data per rendered pixel. It also means increased efficiency of block-based framebuffer access (filled tiles to partially covered tiles ratio). And it means texture magnification will occur more often.

Scanout bandwidth is different because it is independent of framerate and increases linearly with framebuffer size (given a fixed refresh rate). This means an increase in bandwidth per rendered pixel.

I'm not sure which effect is bigger, though. All of them are fairly small.
Thanks :). So if scanout bandwidth hit was bigger than the others, then wouldn't the bandwidth hit increase with increase in framebuffer size?

Xmas said:
How do you enable AA? Through the driver panel, or do you create the multisampled/upsized rendertargets yourself?
How do you do the post-processing? Render to multisampled framebuffer, StretchRect to a texture, then render a fullscreen quad onto the framebuffer again?
AA is all custom. In the case of SSAA, I render to a bigger target and downsample it using StretchRect (I use the pixel shader downsampling method for rotated-grid sampling). For MSAA, I use a custom 16 tap filter kernel.

The post-processing is done in the usual way, I take the downsampled (antialiased target), downsample it further, blur it and composite it back onto the original target along with tone mapping.
 
poly-gone said:
Thanks :). So if scanout bandwidth hit was bigger than the others, then wouldn't the bandwidth hit increase with increase in framebuffer size?
Yes. But consider the following scenario:
Rendering of a scene is fillrate limited. Now you double the resolution, which halves fps. At the same refresh rate, scanout bandwidth doubles as well. Halving fps means halving vertex bandwidth requirements (same data per frame, less frames).

So if you start out with 1024x768x32@85Hz and 60fps, and switch to 1600x1200x32@85Hz and 24.5 fps, you need 367.5 MiB/s more scanout bandwidth, but save 35.5 * <vertex data per frame> /s. This means, if you have more than 10.4 MiB of vertex data per frame, bandwidth per rendered pixel decreases. That's a realistic number for today's games, and that is not counting the savings from texturing and improved framebuffer access efficiency.

AA is all custom. In the case of SSAA, I render to a bigger target and downsample it using StretchRect (I use the pixel shader downsampling method for rotated-grid sampling). For MSAA, I use a custom 16 tap filter kernel.
I don't see how you can do "all custom" MSAA.

The post-processing is done in the usual way, I take the downsampled (antialiased target), downsample it further, blur it and composite it back onto the original target along with tone mapping.
Back onto the original, multisampled target?
 
Xmas said:
Yes. But consider the following scenario:
Rendering of a scene is fillrate limited. Now you double the resolution, which halves fps. At the same refresh rate, scanout bandwidth doubles as well. Halving fps means halving vertex bandwidth requirements (same data per frame, less frames).

So if you start out with 1024x768x32@85Hz and 60fps, and switch to 1600x1200x32@85Hz and 24.5 fps, you need 367.5 MiB/s more scanout bandwidth, but save 35.5 * <vertex data per frame> /s. This means, if you have more than 10.4 MiB of vertex data per frame, bandwidth per rendered pixel decreases. That's a realistic number for today's games, and that is not counting the savings from texturing and improved framebuffer access efficiency.
Hmmm... true. But that would apply to a game, not my demo though :).

Xmas said:
I don't see how you can do "all custom" MSAA.
It's not "fully custom MSAA" per se, since I obviously cannot employ z-compression or tiled/memory optimized offscreen buffers, it's just that the sampling is done that way. This "subsitute MSAA" is of course what games implementing custom MSAA solutions (due to the lack of hardware MSAA on floating point buffers) would do.

Xmas said:
Back onto the original, multisampled target?
No, the backbuffer.
 
Last edited by a moderator:
Xmas said:
The V5 didn't waste fillrate on downsampling, IIRC. In the benches I've seen it generally didn't drop as much in performance as the GF2. The V5 kept the LOD at the same level as without AA because it rendered separate images and a negative LOD bias decreases the texture cache hit rate.

no, it didn't keep the same LOD bias, at least on my PC, and here's why : you can set it in the 3dfx tools, no 3rd party app needed :), I used LOD -1.5 for FSAA 4x, -0.75 for 2x and -0.25 for no AA (as I read somewhere that VSA/100 has a conservative default LOD).
Performance decrease is rather negligible, maybe you lose 1%, but you'll never notice it.

4x and LOD -1.5 looks tremendous, like anisotropic filtering and with filtered alpha textures. And there wasn't texture aliasing to complain about (LOD -2 was crisper but aliased).

It was slow though, but great for N64 / PSX emulators, old games and CounterStrike (60fps in 800x600)
 
It's not "fully custom MSAA" per se, since I obviously cannot employ z-compression or tiled/memory optimized offscreen buffers, it's just that the sampling is done that way. This "subsitute MSAA" is of course what games implementing custom MSAA solutions (due to the lack of hardware MSAA on floating point buffers) would do.
Rofl. Do you even know what modern MSAA implies? Your implementation has ZERO similarity to it. Obviously with such an implementation, you'd consider so-called SSAA to be better than so-called MSAA... duh. May I congratulate you for not even pointing this out earlier, and daring to start a debate based on such BS?

Uttar
 
Blazkowicz_ said:
no, it didn't keep the same LOD bias, at least on my PC, and here's why : you can set it in the 3dfx tools, no 3rd party app needed :), I used LOD -1.5 for FSAA 4x, -0.75 for 2x and -0.25 for no AA (as I read somewhere that VSA/100 has a conservative default LOD).
Performance decrease is rather negligible, maybe you lose 1%, but you'll never notice it.

4x and LOD -1.5 looks tremendous, like anisotropic filtering and with filtered alpha textures. And there wasn't texture aliasing to complain about (LOD -2 was crisper but aliased).

It was slow though, but great for N64 / PSX emulators, old games and CounterStrike (60fps in 800x600)

exactly how I used to use my v5 :) god I thought those days of LOD adjustment discussons, 16bit dithered to 22bit v 32bit V AA etc etc were gone forever..
 
poly-gone said:
It's not "fully custom MSAA" per se, since I obviously cannot employ z-compression or tiled/memory optimized offscreen buffers, it's just that the sampling is done that way. This "subsitute MSAA" is of course what games implementing custom MSAA solutions (due to the lack of hardware MSAA on floating point buffers) would do.
Then that is not MSAA and your claim that 2xSSAA is better than 4xMSAA is useless. I'm pretty sure that game developers that go all the way to implement some form of AA on non-FP16-MSAA-capable GPUs will most likely make use of the MSAA capabilities for 8 bit per channel rendertargets and compose a non-antialiased tonemapped HDR render on top of it. Or take the easy way of doing supersampling.
 
poly-gone, surely you have to realize how worthless your performance measurements are. MSAA in software compared to MSAA in hardware? They're not the same at all. Why you expect usable performance metrics from your program is beyond me.

I knew something was up when you said you had SSAA and MSAA working with HDR nonchalantly.


Anyway, on the topic: what do offline renderers do? Do they take multiple samples per pixel or do they do the AA within the shader? Or is it part both? I think that's important because they've already solved the problem and the solution might be applicable to real-time graphics.
 
I agree with Randell and Blazkowicz_.

vsa-100_lod.png

(Voodoo 5 PCI, Athlon XP @2100MHz (~2600/2700+), 512MB DDR/400MHz, nForce2 -> Voodoo 5 was the weakest piece in this test, so this graph shows the worst-case situation)

Even in Serious Sam SE (which is probably the most hardware-demanding game optimized for 3Dfx Voodoo) the performance drop (LOD 0.00 -> -0.25) is about 1%. The difference is even lower in other games. I don't think that lower default LOD was a performance-enhancing hack.
 
no-X said:
(Voodoo 5 PCI, Athlon XP @2100MHz (~2600/2700+), 512MB DDR/400MHz, nForce2 -> Voodoo 5 was the weakest piece in this test, so this graph shows the worst-case situation)
No, it doesn't. Because AA is not enabled. That makes a significant difference.
 
Chalnoth: Not according to my experience (@ FSAA 2x, which is the same as FSAA off on Voodoo 4, so I don't see any reason why the result should be different).

I could install SSSE again and prepare new graph. I don't promise anything, but I'll try to find a bit time for it.
 
Back
Top