Summed-Area Variance Shadow Maps (Demo)

Andrew Lauritzen

Moderator
Moderator
Veteran
Hi all!

I've been doing some more research involving Variance Shadow Maps over the past few months and the results have been very positive. Of course there's always more work to be done, but I'm pretty happy with the implementation at this point, so I figured that it's time to release it and get feedback.

The original Variance Shadow Maps paper (you can find it here) and implementation had a few things that I wanted to clean up. In particular:

  • Light bleeding could occur due to the sometimes-loose upper bound Chebyshev approximation employed.
  • High-precision hardware filtering (mipmapping, trilinear, anisotropic) was required.
  • Blurring the shadow map can become expensive, even with the separable O(n) filter.
  • The filter width could not be changed dynamically per-pixel, which is desirable both for standard texture filtering, and plausible soft shadows (with contact hardening and so forth).

The new implementation addresses all of these issues by modifying the approximation function slightly and using summed-area tables.

It is first good to note that light bleeding cannot be eliminated without over-darkening some regions, due to the tight upper bound that Chebyshev's inequality provides. Put simply, VSM is really the best that you can do with N (in this piece, two) pieces of information.

Still, we are willing to accept some over-darkening in certain non-objectionable regions if we can get rid of light bleeding. A simple way is to just clip off the tail of Chebyshev's Inequality. This can be trivially implemented by taking the result of evaluating the inequality and passing it through a simple function like linstep or smoothstep.

The threshold here is artist-editable, and a threshold to completely eliminate all light bleeding can be computed from the ratio of overlapping occluder distances from the light's perspective. The worst case occurs where there is one occluder very close to the light whose penumbra is cast onto two overlapping surfaces that are very close to each other, but both far from the light. I can draw some pictures if that last paragraph doesn't make much sense.

In practice this technique works very well and is exposed in the demo with a corresponding threshold slider for people to play with and see the effect. (The demo uses a simple linstep, but smoothstep looks good as well, although it naturally changes the falloff function).

To solve all of the remaining problems, I implemented summed-area tables. This allows one to sample arbitrary rectangular regions of the shadow map at constant cost with no dynamic branching.

The demo uses hardware derivatives to compute a rectangular filter region (the same way the hardware does it for mipmapping... actually a little bit more accurate), and optionally clamps the minimum filter size to get a softer shadow.

There are several shadowing techniques in the demo which I'll outline briefly here:

  • Shadow Map is a standard shadow mapping implementation. Cringe at the ugly aliased shadows!
  • PCF implements percentage closer filtering to sample the filter rectangle. The results are good, but heavy biasing is required (causing "peter-panning" of the shadow) and performance drops with O(n^2) as the filter width increases (either via softness, or viewing from shallow angles, etc).
  • Hardware VSM uses hardware texture filtering (mipmapping, trilinear, anisotropic) and sets the texture Max LOD to soften the shadow. Since mipmapping is a rather course approximation for magnification, boxy artifacts are clearly visible when "softening".
  • Summed-Area VSM uses summed area tables to do the filtering, not relying on any hardware texture filtering. This implementation provides excellent quality and softening and does exactly 16 texture reads regardless of the filter area.

Anyways enough talk, grab the demo here: Summed-Area Variance Shadow Maps (January 30, 2007)
Note that I will soon release the source for the demo for people to play around with, but I've submitted this work for potential inclusion in GPU Gems 3, and I'm waiting to hear back on that before I post code.

Please note the requirements (as detailed in the included Readme):

  • Any reasonably modern CPU/RAM
  • Windows XP or Windows XP x64 Edition
  • A shader model 3.0 capable video card
    NVIDIA GeForce 8 series card highly recommended
  • DirectX Redist December 2006
    Available free from http://www.microsoft.com (search for the above)
  • Visual C++ 2005 Redistributable Package
    Available free from http://www.microsoft.com (search for the above)
Also note that the demo was really designed and optimized for the G80. Your mileage may vary on other platforms. There are a few "known issues" on other cards detailed in the Readme. In particular PCF is somewhat broken with ATI's latest drivers, although that's an improvement from when loading the shader on ATI would instantly reboot the computer!

Here are some performance results from a GeForce 8800GTX at 1600x1200, 4xMSAA with the view from the first screenshot below. As you can see, the crossover beyond which SAVSM becomes more efficient than PCF is as low as 2x2! Note that at shallow angles (like the second screenshot below), SAVSM is always faster by a large factor since many pixels will require a large filter region, due to derivatives alone. Indeed even in the best (and rare) case for PCF where the entire shadow map is magnified for every pixel, the cross-over is still only at 3x3.

For those of you who don't meet the requirements - or are just too lazy to download the demo - some screenshots follow. Click the image for the high-resolution, uncompressed version.

Car:


Car (shallow angle - note the nice filtering):


Commando:


Commando again:


Spheres (hard):


Spheres (soft):


For those of you still reading, here are a few other notes and future work:

  • One big problem with SAVSM is numeric stability, since both summed-area tables and variance shadow maps eat precision for breakfast. However the error is unbiased, meaning that simply increasing the minimum filter width ("softness") will get rid of it. Once doubles are supported on GPUs, there won't be an issue, but currently large shadow maps can cause numeric trouble. The demo uses several methods to greatly improve precision, and there are plenty more ways. In any case smaller shadow maps work extremely well and look great when filtered properly.
  • Enabling multi-sampling while rendering the variance shadow map works fairly well, but it becomes insignificant once the "Softness" is even a few notches up. It also hurts numeric stability a bit and is thus disabled in the current demo. It was a good idea, but summed-area tables are better.
  • I've played a bit with combining SAVSMs with "Percentage Closer Soft Shadows" and the results are quite promising (constant-time, efficient plausible soft shadows!). However there are quite a few details and boundary conditions to sort out in order to make it robust... I simply do not have time right now. Hopefully someone will get the time to combine this technique with PCSS, or one of the more recent rear-projection algorithms.
  • DirectX 10 should improve the speed of summed-area table generation quite a bit (although it is already pretty fast at <2ms for a 512x512 shadow map on an 8800GTX). Once my new machine with Vista arrives in a few weeks I'm going to port the demo, and I'll post the results if there are significant changes.

Anyways sorry for the long post... I got rambling. Please feel free to ask any questions if I didn't make anything clear. Rest assured if my article is accepted into GPU Gems 3 I will cover variance shadow maps, summed-area tables and all of the details that I've only hinted at here in depth.

Enjoy!
Andrew Lauritzen
University of Waterloo / RapidMind Inc.
 
Last edited by a moderator:
Firstly, works just fine on Vista (8800 GTX, 100.54, x64), sweet!

Does filtering the result of the Cheb. inequality with the threshold ultimately mean the shadow map lookup is mathematically less accurate than before, even though light bleeding is reduced? (Apologies if that's completely n00b-worthy, this is the first time I've thought about VSM properly).

Where's the speedup with D3D10 come from, if you want to say?

And Hubert still hasn't let you know if you're in Gems 3 or not? :oops:
 
Nice stuff, particularly the low-angle car.

I can't run the demo: why aren't there any shadows cast onto the spheres in the soft-shadowed screenshot?

Jawed
 
I can't run the demo: why aren't there any shadows cast onto the spheres in the soft-shadowed screenshot?
I'm guessing the test for whether the pixel is in shadow fails because of the threshold applied to linstep, when it's that soft.
 
I'm guessing the test for whether the pixel is in shadow fails because of the threshold applied to linstep, when it's that soft.
Oh, I was assuming that shadows were softening with distance from occluder to receiver - and with the relatively close spacing of the spheres the shadows would be relatively hard.

Jawed
 
Does filtering the result of the Cheb. inequality with the threshold ultimately mean the shadow map lookup is mathematically less accurate than before, even though light bleeding is reduced?
It just means that it's no longer an upper bound. Thus it can "over-darken" in places. However because of the specific modification that we're making, it'll never darken regions that were completely in light before, so this is often less objectionable than light bleeding in practice. I personally think it's a pretty cool solution especially considering that it's essentially free :)

Where's the speedup with D3D10 come from, if you want to say?
Generating the summed area table takes log(n) passes in each dimension. D3D10 should have significantly less pass overhead, specifically when getting down to the last few pixels and thus speed should be improved. With reduced pass overhead it's also possible to decrease the radix of the algorithm (it is currently 4 as that is fastest on current hardware, but the ideal complexity of the algorithm occurs at 2).

And Hubert still hasn't let you know if you're in Gems 3 or not? :oops:
Yeah he tells me that they are running a bit behind schedule. Should know in the next few days though.

Jawed said:
I can't run the demo: why aren't there any shadows cast onto the spheres in the soft-shadowed screenshot?
There are, they're just so "soft" that you can't notice the contribution. This is physically accurate (consider a large light source near a small occluder - it will cast no discernible shadow) and the same thing happens with PCF, etc.

[Edit] I see you mean the "hardening" of close shadows. This isn't implemented in this demo although as I mentioned I do have a prototype using an algorithm similar to PCSS. SAVSM is a great base to work with for plausible soft shadows though. I hope to see people start to combine it with their plausible soft shadows algorithms. The current demo simply allows a constant softening amount.
 
Last edited by a moderator:
Vista x86 here with 8800GTX, and running fine after dl'ing Dec D3D9 SDK. :smile:
 
Vista x86 here with 8800GTX, and running fine after dl'ing Dec D3D9 SDK. :smile:
Awesome, Vista should be arriving for me in a week or so. Then the porting to D3D10... I have a 64-bit version of the demo as well but it crashes in the NVIDIA driver on XP x64. I'll try again on Vista.
By the way, the D3D9 December redistributable should be fine rather than the whole SDK.

Oh by the way I forgot to mention that you can run a benchmark by setting up your camera/settings and pressing ALT-B (then wait until the UI comes back) - the results will be put into "Bench.csv". The performance results that I posted were from an 8800GTX, 1600x1200, 4xAA, default car angle. I'd be interested to know whether SAVSM is memory bandwidth, ALU or otherwise bottlenecked, as there is room to change/optimize the algorithm. I would have used NVPerfHUD to optimize, but it doesn't appear to work properly with my G80 on XP x64 :(
 
Last edited by a moderator:
Wow, it looks like VSMs are actually more than barely feasible for general scenarios now :LOL:

I am curious though, in general how does it scale with screen resolution (esp. relative to PCF) and/or shadow map resolution? I guess what I'm asking is, how much work is involved on the shadow map lookup side, and how much on the shadow map generation side?

Also, why do you recommend 8800GTX's? Is it because you're doing the filtering on ABGR32F surfaces instead of 16F (which, iirc, the 6000 and 7000 series cards can filter)?

AndyTX said:
Jawed said:
I can't run the demo: why aren't there any shadows cast onto the spheres in the soft-shadowed screenshot?
There are, they're just so "soft" that you can't notice the contribution. This is physically accurate (consider a large light source near a small occluder - it will cast no discernible shadow) and the same thing happens with PCF, etc.
Are you sure? I'm tabbing between the two screenshots, and it looks like they're completely gone. I even did a quick comparison of some pixels very near the shadow receiving edges on one of the spheres that would be affected by the broadening/softening of the shadow and there was no difference, despite the fact taht there should be one.
 
Last edited by a moderator:
Wow, it looks like VSMs are actually more than barely feasible for general scenarios now :LOL:
Hey, they weren't that bad before ;) In any case most of this follows naturally from the original paper so I'm somewhat surprised that people didn't do it themselves.

I am curious though, in general how does it scale with screen resolution (esp. relative to PCF) and/or shadow map resolution? I guess what I'm asking is, how much work is involved on the shadow map lookup side, and how much on the shadow map generation side?
SAT generation scales linearly with the shadow map size (total number of pixels). For example, and 8800GTX has the following SAT generation times:

128x128 => 0.3ms
256x256 => 0.5ms
512x512 => 1.8ms
1024x1024 => 7.1ms

[EDIT: Not sure these timings are reliable (they may be too high)... see later posts.]

I suspect these will improve by a (significant) constant factor with D3D10. They could probably be made pretty fast on the 360 as well due to the EDRAM.

Lookup into the shadow map for SAVSM is exactly 16 texture reads (of 4 with bilinear filtering), regardless of the filter rectangle size. PCF is equal to the number of pixels in the filter rectangle, and also requires dynamic branching.

SAVSM will do better and better the higher the framebuffer resolution, since it is on average cheaper than PCF. In particular for typical scenes with many surfaces at shallow angles, etc. SAVSM will dominate. The only case that PCF will win in is if most of the shadow map is offscreen, in which case the shadow map projection selection/warping scheme used is very poor :)

Also, why do you recommend 8800GTX's? Is it because you're doing the filtering on ABGR32F surfaces instead of 16F (which, iirc, the 6000 and 7000 series cards can filter)?
No actually, no hardware filtering is done. The reasons why I recommend G80s (GTS/GTX/whatever) are:
  • They are by far the fastest card that I've tested with. The 1900XT and 7900GTX are less than half the speed.
  • Some of the code was written with their scalar processors in mind. Data transpositions can be avoided in a several cases with a G80.
  • Secondarily, they support fp32 filtering for hardware VSM.
Still, I think the technique is quite usable on previous generation cards. I suspect the R600 will produce results comparable to the G80.

Are you sure? I'm tabbing between the two screenshots, and it looks like they're completely gone. I even did a quick comparison of some pixels very near the shadow receiving edges on one of the spheres that would be affected by the broadening/softening of the shadow and there was no difference, despite the fact taht there should be one.
To some extent the shadow will fade out as we approximate a larger and larger area light source. However VSM still provides an upper bound, and thus can certainly over-brighten with a complex occluder distribution. This is actually desirable as shadows where none should be is a very bad thing...

Also note that VSM can be mixed with PCF trivially by sampling several VSM rectangles (or texels) and combining the results. This will reintroduce the severe biasing issues of PCF though, and lessen performance. I don't think this is necessary in most cases, but it's an available option if needed.
 
Last edited by a moderator:
Very cool stuff! :)

Now, personally, I think that the SAT generation times are too long there to make this really useful in real-world scenarios yet. For large outdoor areas, I'd likely want 3 or 4 1024x1024 textures for cascaded shadow maps, and 21-28ms generation time just for that feels a tad abusive... ;) If this could get a few times faster with D3D10 (or better, CUDA!) then this would be even more awesome!


Uttar
 
Now, personally, I think that the SAT generation times are too long there to make this really useful in real-world scenarios yet. For large outdoor areas, I'd likely want 3 or 4 1024x1024 textures for cascaded shadow maps, and 21-28ms generation time just for that feels a tad abusive... ;)
Agreed, and I'm hoping that either D3D10 or some other method will fix the ridiculous pass overhead that we're seeing here. My computer with Vista has now shipped, so I should be getting it in a few days. First up on the list is to port the demo to D3D10 :)

The SAT generation times that I quoted were for 4 components (i.e. "Distribute Precision" ON). They are slightly better for 2 components, although for higher shadow map resolutions, this sort of precision distribution will probably be desirable. My CPU is also a bit weak compared to my GPU, so the pass overhead might be too high here... I'll requote the numbers when my Core 2 arrives :)

At the same time, it should be noted that generating a SAT isn't all that much more expensive (if any) than blurring, or doing PCF for that matter. Additionally it only has to be done once, and then arbitrarily many queries of arbitrary rectangular sizes can be made. Furthermore you may not need as high resolution shadow maps when you have good filtering. Note that even 128x128 can look pretty decent in the demo.

Regarding SATs in general, I think they're quite suitable for shadows, since we want to be able to (potentially non-uniformly) blur the shadow map in addition to standard filtering. Furthermore over-blurring (at 45 degree angles) isn't that bad in shadows, and quite preferable compared to the aliasing and swimming that occurs in all PCF implementations that I've seen (albeit not this demo).

The only real remaining problem other than generation time is numeric stability. Doubles on GPUs are coming down the pipe soon enough though and even if they're slow, we only need them for a few adds and muls - then back to single precision land :)

I'll have to play with CSM a bit, but my feeling is that one should be able to get away without needing gigantic shadow maps, especially with a good projection warping scheme like TSM.

However even with the previous implementation of VSM I've gotten feedback that people were using it successfully with smaller, local lights. These are the cases that would want significant softening really, however it'd be nice to at least filter the sun correctly. Still, a small blur and mipmapping/aniso might work very well for the hard, sun shadows (especially on the G80 with it's fp32 filtering). Note that the hardware VSM in this demo is barely slower than basic shadow mapping.
 
After some more though, I don't think there's really an inefficiency here at all. In particular if you're generating 3 or 4 1024x1024 shadow maps for your scene, you're clearly not using all of those pixels directly in the framebuffer. This means one of two things:

1) You're projection warping scheme is bad. Fix it!

2) Large filter regions are being used, in which case SAT will destroy PCF, which has to redo a ton of work for every fragment in the framebuffer.

I'm not convinced that there's a faster way to get the quality that the demo is producing here. In particular, it's rare to see a shadows implementation that does proper LOD and filter region computation even though that was what Reeves et al. described in the original PCF paper. Most people just do a NxN texel neighborhood PCF kernel which is not comparable at all in quality.

As the performance results posted earlier show, PCF will almost always lose in a performance competition in all but the most theoretical (and impractical due to ugliness) of cases.
 
128x128 => 0.3ms
256x256 => 0.5ms
512x512 => 1.8ms
1024x1024 => 7.1ms

I suspect these will improve by a (significant) constant factor with D3D10. They could probably be made pretty fast on the 360 as well due to the EDRAM.
Since it's scaling with the number of pixels, doesn't that mean the pass overhead is pretty small? I thought for a NxN SAT, you have 2N passes and 2N^2 pixels. Did a quick regression in Excel, and it seems the N term is dominated by the N^2 term. Maybe the G80 drivers are already good at rendertarget changes. You're doing it one line at a time, right?

The demo is pretty neat, and I like how SAT's eliminate the need for mipmapping and hardware filtering. Still, I'm a bigger fan of your original VSMs with a separable gaussian blur (which you unfortunately didn't include). Fewer stability problems, faster, and anisotropically filtered in all directions. SAVSM is almost like software ripmapping ;)

To avoid the boxy artifacts you mentioned, what I did was first autogen the mipmaps, then make a slightly blurred version where I sampled from the N-th original VSM mipmap to generate the (N+1)-th blurred mipmap. The top level was a straight copy.

I'm stuck with I16 for now, though, so I'm wondering if I want to buy a XB360 or an 8800. My aging computer is pointing me to the latter, but Xenos is so awesome for VSMs -- 32-bit fixed point!
 
I'm stuck with I16 for now, though, so I'm wondering if I want to buy a XB360 or an 8800. My aging computer is pointing me to the latter, but Xenos is so awesome for VSMs -- 32-bit fixed point!
I thought 8800 can do 32-bit fixed point.

Jawed
 
Since it's scaling with the number of pixels, doesn't that mean the pass overhead is pretty small?
The thing that makes me believe that it we're still losing something significant to pass overhead is that fact that the radix 4 version is still faster than the radix 2 (by a decent margin). I'm hoping that in D3D10 I can switch to the radix 2 implementation which has a better complexity (and fewer operations overall). Still, it isn't really scaling terribly as you note.

I thought for a NxN SAT, you have 2N passes and 2N^2 pixels. [...] You're doing it one line at a time, right?
Nope, I'm doing it with recursive doubling which takes 2*log(n) passes. The overall complexity is actually higher with this algorithm, but the significant reduction in number of passes should more than make up for that. It's tempting to try a "one by one" implementation as well, but I doubt that it'll be as efficient unless the pass overhead is *very* low (1024 passes is a lot!). Maybe it will be in D3D10, I don't know. Shouldn't be hard to prototype.

Still, I'm a bigger fan of your original VSMs with a separable gaussian blur (which you unfortunately didn't include). Fewer stability problems, faster, and anisotropically filtered in all directions. SAVSM is almost like software ripmapping ;)
I still like the original implementation too actually, especially for large shadow maps like the sun. Still I think SAVSM will work nicely for smaller, local lights that one wants to soften significantly. Also they will tend to have lower resolution shadow maps and thus the SAT generation pass will be quite cheap, and numeric problems won't occur (due both to smaller SAT and significant averaging).

It is interesting to note however that I don't notice much advantage in the hardware's anisotropic filtering, even at "bad" angles for the SAT. SAT always looks at least as good as hardware ansio (and this is the G80, which apparently has quite good aniso filtering) in my experience, although I've be happy to be proven wrong by a screenshot on that one. Plus one could screw with the projection to maximize the chance that the camera will view the shadow map along the SAT axes, which means amazingly awesome (and cheap) super-long anisotropic kernels :D

The advantage of SATs over mip/ripmapping is that they don't have arbitrary "fault" lines at the power-of-two coordinates, which makes them ideal for magnification and blurring. The other main reason that I implemented SATs is to facilitate per-pixel filter width selection, which makes it an ideal algorithm to use with plausible soft shadows.

I'll also consider adding the guassian blurred implementation back into the demo, although I'm somewhat put off by the need to hard-code kernels into HLSL... everything was so much nicer with Sh/RapidMind :) Meta-programming rules. Still this time I'm hoping for less whining as the code is pure Direct3D with no additional dependencies.

To avoid the boxy artifacts you mentioned, what I did was first autogen the mipmaps, then make a slightly blurred version where I sampled from the N-th original VSM mipmap to generate the (N+1)-th blurred mipmap. The top level was a straight copy.
That's cool, so basically just a better mipmap generation kernel (gaussian based I assume). Does it completely get rid of the blocky artifacts? Can you enlarge arbitrarily via clamping the LOD and get as nice results as SAT? I'd love to see some screenshots/demos :) I'm really interested in other peoples' results and improvements to VSMs!

but Xenos is so awesome for VSMs -- 32-bit fixed point!
I believe the 8800 can do 32-bit integer filtering as well, but it's not exposed in DirectX 9. Yet another thing to mess with when Vista arrives. Besides, you don't want to be stuck coding in C#/XNA ;)

One disadvantage of integers that I've found though is that you need to be quite aggressive on clamping depth ranges or else you're simply wasting precision. Furthermore scenes that are not nicely distributed (in depth) will tend to work better with FP than FX. This is the standard fixed vs. float depth buffer trade-off as well, although with a linear depth metric. Floats are often nice in that they are fairly scale-independent.

Another problem with fixed point with SAVSM is that you'd have to throw out a ton of precision right off the bat to ensure that no overflow occurs in SAT generation (one does not know the shadow map values ahead of time, although one could probably do better by using the previous frame's mean/variance over the whole shadow map). Still, distributing precision would work a lot better with integers, potentially allowing a full 32 more bits to be gained. This alone warrents investigation...
 
Last edited by a moderator:
Nope, I'm doing it with recursive doubling which takes 2*log(n) passes.
Ahh, okay. I'm fitting the wrong values then (for operation count). Still, the cost is quite high at around 4 cycles (of the whole GPU! :oops: i.e. 1200 shader cycles) per SAT pixel. Can you elaborate on what you mean by "radix n"? It seems like there are quite a few ways you can balance between pass count and pixel/fetch count.

Plus one could screw with the projection to maximize the chance that the camera will view the shadow map along the SAT axes, which means amazingly awesome (and cheap) super-long anisotropic kernels :D
I was thinking the same thing, but there are a couple of drawbacks:
- You'll only be able to align it in the centre of the screen. The left and right edges will lose the benefit completely. For a low viewing angle where high anisotropy is needed, only a few pixels would get super-long kernels.
- A view dependent shadow map alignment would probably create some swimming (well, I suppose object motion does it anyway), though VSMs should help eliminate that.

The advantage of SATs over mip/ripmapping is that they don't have arbitrary "fault" lines at the power-of-two coordinates, which makes them ideal for magnification and blurring. The other main reason that I implemented SATs is to facilitate per-pixel filter width selection, which makes it an ideal algorithm to use with plausible soft shadows.
The "fault lines" problem is why I did the blurring the way I did. I'm always sampling from a higher resolution map and rendering to a lower one, so the error is minimal between this method and a true (expensive) filter directly from the original level 0 VSM.

As for the per-pixel softness selection, that's the biggest reason I did this fancy blurring for all mipmaps because I can enlarge them without ugly mipmap blockiness. I've seen some demos use DOF using mipmaps, and needless to say it ain't pretty. :smile: I just figure out how blurry I want it, and then select the appropriate mipmap. For the most part the blockiness is gone with a single sample, and if not, a few samples from the next higher-res mipmap gets rid of them.

I believe the 8800 can do 32-bit integer filtering as well, but it's not exposed in DirectX 9. Yet another thing to mess with when Vista arrives. Besides, you don't want to be stuck coding in C#/XNA ;)
If you're right, that'll make my decision easier.

Does it support MSAA of 32-bit fp or integer as well? It really helps with aliasing. I know you were saying that with SAT's MSAA during VSM rendering doesn't make much difference because of the softness, but if that is the case, then you're rendering a larger VSM than you have to!

MSAA is perfect for VSM because you get 4x the samples almost for free, expecially on XB360 with its EDRAM. If you can get access to the unresolved MSAA and Z buffers, my guess is ordered grid 4xMSAA on a 512x512 VSM should look identical to a 1024x1024 VSM rendered without MSAA.

One disadvantage of integers that I've found though is that you need to be quite aggressive on clamping depth ranges or else you're simply wasting precision. Furthermore scenes that are not nicely distributed (in depth) will tend to work better with FP than FX. This is the standard fixed vs. float depth buffer trade-off as well, although with a linear depth metric. Floats are often nice in that they are fairly scale-independent.
FP may be easier for your demo scenes, but what I've found is that in real scenes with shadows coming from nearby objects (e.g. character self shadowing) and far ones (e.g. buildings), integer is much better. With FP, one of the two shadow outlines will have horrible banding in the variance term regardless of where you put depth=zero.

Another problem with fixed point with SAVSM is that you'd have to throw out a ton of precision right off the bat to ensure that no overflow occurs in SAT generation (one does not know the shadow map values ahead of time, although one could probably do better by using the previous frame's mean/variance over the whole shadow map). Still, distributing precision would work a lot better with integers, potentially allowing a full 32 more bits to be gained. This alone warrents investigation...
Yeah, even scaling by 1/1024 will only knock off 10 bits, so you're guaranteed ~22-bits precision after subtracting your SAT samples. With FP, you only have 23-bits in the mantissa to start with, and if you subtract similar numbers (as is often the case using a SAT, especially sampling on the large number sides) you're going to wind up with a lot less precision.

IMHO, integer is definately the way to go for both SAT and VSM. The artifacts are both smaller and more consistent.

(EDIT: Whoops, 1/1024 could be inadequate for even a 64x64 SAVSM. Still think FP is worse, though.)
 
Last edited by a moderator:
Ahh, okay. I'm fitting the wrong values then (for operation count). Still, the cost is quite high at around 4 cycles (of the whole GPU! :oops: i.e. 1200 shader cycles) per SAT pixel. Can you elaborate on what you mean by "radix n"? It seems like there are quite a few ways you can balance between pass count and pixel/fetch count.
Yeah I'm really hoping that I can bring that down, potentially with D3D10, but if you guys have ideas, that would be awesome too.

What I'm doing is pretty similar to ATI's paper here. By "radix", I'm referring to the number of texture samples that I'm taking per pass. So for example radix 2 will sample the current pixel, and one other one, add those two and write. Radix 4 will sample 3 other pixels, add all four and write and so on.

Radix 2 has the best computational complexity, but the greatest number of passes (and thus more writes as well). I've tested radix 2, 4, 8, and 16 and in all cases, radix 4 has been the fastest, although sometimes only by a hair. I believe ATI's paper quotes the same results with older hardware, so I'm assuming the API pass overhead is coming into play here as well.

[Note: the number of passes is log_base_radix(n) in each dimension, that's where I was getting the 2*log(n) passes from.]

If pass overhead were *really* low, the "line by line" approach would be theoretically the best. Unfortunately without assuming some evil things about the way that the GPU's cache works, it would technically need some very expensive ping-ponging ;)

I was thinking the same thing, but there are a couple of drawbacks: [...]
Both true, but even "near rectangular" pixels should benefit. In any case it shouldn't be any *worse* on average than a naive projection, at least if the blurring can eliminate any swimming. I'm also assuming that *some* projection warping scheme will be used in any production shadows implementation.

The "fault lines" problem is why I did the blurring the way I did. I'm always sampling from a higher resolution map and rendering to a lower one, so the error is minimal between this method and a true (expensive) filter directly from the original level 0 VSM.
That sounds really good then. Do you use tex2Dlod or MAX_LOD (state) then to clamp the minimum filter width, or do you perhaps blur the original shadow map slightly before even generating the original mipmaps?

I'm really interested to see how this would look. Do you have screenshots/demo that you're willing to share? Would you mind if I prototype something similar (ideally with your help) in the current shadows demo (with all due credit to you of course)?

For the most part the blockiness is gone with a single sample, and if not, a few samples from the next higher-res mipmap gets rid of them.
Neat. That seems like a really good solution for high-end hardware then. Does it work well enough/look good enough to work with different per-pixel softnesses?

Does it support MSAA of 32-bit fp or integer as well? It really helps with aliasing. I know you were saying that with SAT's MSAA during VSM rendering doesn't make much difference because of the softness, but if that is the case, then you're rendering a larger VSM than you have to!
Yes MSAA works with all formats that I've tried, which is handy. It was a bit buggy on the initial G80 drivers release, but that's probably cleaned up a bit by now. Still I dropped it really for two reasons:
  • It made numeric stability worse. Probably not an issue if not using SATs.
  • Really bumping up the Max LOD even one or two notches (which is the minimum softening I'd want anyways) completely hides any noticeable changes with MSAA on.
  • The improved quality was not enough to let me decrease the softness or shadow resolution and get similar results.

I'd love to get an implementation where it really makes a difference though, and I'm sure that NVIDIA would as well :)

If you can get access to the unresolved MSAA and Z buffers, my guess is ordered grid 4xMSAA on a 512x512 VSM should look identical to a 1024x1024 VSM rendered without MSAA.
It seems that way in theory to me too, but I was never able to get impressive results. It's worth noting that one also has to fight with the stupid 'always-on' MSAA gamma correction of the drivers. On NVIDIA this can be disabled globally (in the CP), but I'm not sure about ATI.

FP may be easier for your demo scenes, but what I've found is that in real scenes with shadows coming from nearby objects (e.g. character self shadowing) and far ones (e.g. buildings), integer is much better.
Hmm ok, I'm willing to accept that. In any case I'm certainly happy to support both in the demo (once I have D3D10 up and running). I am still a little concerned about wasting a lot of precision for scenes that have a foreground, background and void between then, but maybe it's not that much of an issue.

Yeah, even scaling by 1/1024 will only knock off 10 bits, so you're guaranteed ~22-bits precision after subtracting your SAT samples.
It's worse than that though... it's log(w) + log(h), so for 1024x1024 you technically have to drop 20 bits of precision! In practice it's not this bad, especially with a mean bias, but FP handles "overflow" and scaling a bit more gracefully than integer. Still, I'm certainly interested in very promising precision distribution possible with integers (with floats, tons of precision is just wasted).

With FP, you only have 23-bits in the mantissa to start with, and if you subtract similar numbers (as is often the case using a SAT, especially sampling on the large number sides) you're going to wind up with a lot less precision.
That's certainly true. One neat trick that I've not yet implemented though is to center your SAT around [0.5, 0.5], playing nice tricks with the sign bits. The advantage here is that the most low precision areas are now outside of a standard circular light :)

In any case I'm really interested in your ideas and approach as well. I'd love to integrate some more techniques (and scenes, if these are bad examples :)) into the demo if possible, although if you have a demo up and running, that's even better! Still I'm always interested in different shadowing methods, both for my own personal projects, and to promote to others who ask :D

Thanks a lot for all of the good replies so far (everyone)!
 
Back
Top