Summed-Area Variance Shadow Maps (Demo)

On the MSAA note, I made a series of screenshots to demonstrate the quality of MSAA at a few shadow map resolutions and softness settings (min filter widths).

Please find the (uncompressed) image here.

As can be seen from the GUI, the first row is 256x256 with "full MSAA". I'm not sure exactly how many samples as this is simply the highest "quality" setting of non-maskable multisampling in D3D9. The second row is 512x512 with no MSAA and the third row is 512x512 with full MSAA.

The columns represent minimum filter widths of 0, 1, and 2 respectively. Note that much beyond this the difference that MSAA makes is indiscernible, and so it's just a waste to enable it. Thus MSAA seems more useful for hard shadows than (uniformly) soft ones.

Also please ignore the performance numbers... I was taking screenshots in quick succession in a very small window, so they are pretty unreliable. Do note however that fp32 MSAA is not exactly "free", although it's not terribly expensive either on the G80.
 
Contact Obsidian and teach them a thing or two about shadow rendering, as their implementation in NWN2 is ass-tastic to say the least, at least in terms of perfomance;)

Interesting demo by the way. If you were to target the R5XXs and the GF7s directly, do you think you could get some more oomph out of them, or ar they inherently bad(well, much worse then 8800s) for your algorithm(the question is more as food for thought for me, not a suggestion to you in any way:))
 
Contact Obsidian and teach them a thing or two about shadow rendering, as their implementation in NWN2 is ass-tastic to say the least, at least in terms of perfomance ;)
Part of the reason that I've been doing shadows research is indeed that I've been pretty unhappy with the shadows implementations that I've seen in many games. The rest of graphics is marching merrily towards photorealism, and yet even the latest games have ugly, aliased, swimming shadows. I was really unhappy to see that even Crysis' shadows look terrible at shallow angles!

Now that's not the developers' faults - shadows are a difficult problem to solve. Thus I'm hoping that VSMs will provide people with a good tool to get relatively high shadows without swallowing an absurd performance cost.

If you were to target the R5XXs and the GF7s directly, do you think you could get some more oomph out of them, or ar they inherently bad(well, much worse then 8800s) for your algorithm
Yes, the demo could certainly be optimized for other hardware, although sometimes at the expense of performance on the G80. Thus rather than write different code paths, I just tended to pick the ways that were fastest on the 8800 series.

Theoretically the technique can work at the similar quality all the way back to shader model 2.0 hardware, but the performance there would probably be unacceptably slow.
 
Nice demo AndyTX! :)

Have you considered extending this technique to do real soft shadows by adjusting the filter size dynamically depending on the distance between the occluder and the light and the distance between the receiver and the light? I'm thinking you could do a SAT lookup with a fixed filter size in the shadow map first to find a reasonable estimate of the distance to the occluder so that you don't get a sharp change in filter size when you go across an edge in the shadow map, then use that depth and the distance to the surface to compute the size of the final filter. I don't think this would add that much to the cost.
 
I'm thinking you could do a SAT lookup with a fixed filter size in the shadow map first to find a reasonable estimate of the distance to the occluder so that you don't get a sharp change in filter size when you go across an edge in the shadow map, then use that depth and the distance to the surface to compute the size of the final filter. I don't think this would add that much to the cost.
Actually I prototyped exactly that idea, although I went a bit further and computed the blocker lookup window by rear-projecting the current fragment onto the shadow plane. Then assuming only two planes in the filter region (a huge assumption, but probably fine in general), and knowing the second plane location (the current fragment), one can compute the depth of the first plane by solving the moments equations.

Using that blocker depth and the parallel planes approximation from "Percentage Closer Soft Shadows" (Randy Fernando IIRC), one can get a new filter width to plug into the sampling algorithm be it percentage closer filtering (like in the original paper) or SAVSM here.

I only worked on this for an hour or two but I actually got some reasonable results, and great performance! Unfortunately there were a bunch of details to work out to do with numeric stability, border conditions and so forth, and I haven't had time to look at it since then.

It's a great idea though, and really uses the advantages of a summed area table. Besides, constant-time plausible soft shadows is enough to make anyone drool :)

Hopefully I'll be able to find some time to work out some of the details in the next little while before things get too busy at school again. If not, I'm really hoping that someone else carries on with the work (or related ideas, as Mintmaster has suggested), especially when I release the source.

Thanks for the feedback Humus. Your demos have always inspired me, so it's good to be able to give something back for a change :)
 
I'm really interested to see how this would look. Do you have screenshots/demo that you're willing to share? Would you mind if I prototype something similar (ideally with your help) in the current shadows demo (with all due credit to you of course)?
I do, but I'd prefer to polish it up a bit. I'm a bit fired up by this discussion, so maybe I'll get around to getting it ready to share in the near future.

Neat. That seems like a really good solution for high-end hardware then. Does it work well enough/look good enough to work with different per-pixel softnesses?
Per pixel softness is really tricky. It's unfortunately not as easy as finding the distance to the nearest occluder because sometimes the softness comes from objects that don't intersect the line of sight to the centre of the light source. My solution is a bit of a hack where I use a heavily blurred VSM sample to get an idea of the z distribution for nearby samples, but it didn't come out quite like I want it to.

Regarding MSAA, I think the reduction in swimming and edge crawling for moving objects is huge. I don't have MSAA on high precision formats with my 9800, but I've tried simulating it via supersampling and it makes a difference. Maybe the hardware resolving is somehow messing it up for you. I guess I notice it more because I'm really trying to get the dynamic softness working, so naturally there will be some areas with hard edges where MSAA really helps.

It's worse than that though... it's log(w) + log(h)
Yup, I edited my post when I realized that. Still, I'm not really convinced that FP could be better. Integer just seems like a much better fit to this application, and it really shows in theoretical calculations.
 
Nice demo AndyTX! :)

I'm thinking you could do a SAT lookup with a fixed filter size in the shadow map first to find a reasonable estimate of the distance to the occluder so that you don't get a sharp change in filter size when you go across an edge in the shadow map, then use that depth and the distance to the surface to compute the size of the final filter. I don't think this would add that much to the cost.


There's so much fun to be had with occluder distance! But you have to start looking at 'deep' or layered shadow maps to capture multiple occluders. The method I've used in the past is to construct a fp depth buffer with the 3 occluder distances furthest from the light and the single (tradional) occluder distance closest to the light. You then walk through the values to determine where you fragment lies and which occluder is between it and the light. That's usually enough unless you get pathalogical with the scene - thinking sphere cube grid ;)


deepshadow.jpg


notice how the area occluded by the red sphere uses the sphere as it's occluder rather than the bridge. The sphere surface is correctly occluded by the bridge
 
Last edited by a moderator:
There's so much fun to be had with occluder distance! But you have to start looking at 'deep' or layered shadow maps to capture multiple occluders.
That's true, although this itself is still an approximation! To get the proper projection happening, you really need render the light from several points sampled from the area light source plane.

I'm certainly not looking to do something that complicated. Really it would just be nice to get the first penumbrae reasonably plausible, and anything with a complex occluder distribution to look "decent".

Mintmaster said:
It's unfortunately not as easy as finding the distance to the nearest occluder because sometimes the softness comes from objects that don't intersect the line of sight to the centre of the light source.
Certainly, that's where the rear-projection and blocker search steps come in. Have you read the papers "Percentage Closer Soft Shadows" and "Real-Time Soft Shadow Mapping by Back Projection" (EGSR 2006 IIRC)? I'd like to try something similar to them, except instead of the costly sampling algorithms that they employ, replace that with solving the moments system to find blockers, and using VSM to sample the resulting area.

The method I've used in the past is to construct a fp depth buffer with the 3 occluder distances furthest from the light and the single (tradional) occluder distance closest to the light. You then walk through the values to determine where you fragment lies and which occluder is between it and the light.
Right, so a deep shadow map clamped to 4 distances. Seems like that would give reasonable results. I've consider depth peeling or something similar for VSM, but the problem is that a discontinuity in the *first* layer will propagate all the way down to the Nth. I've haven't come up with a good way of getting subsequent occluder "planes", as I'm pretty certain that it would depend on the filter size, which is no good.

Mintmaster said:
Regarding MSAA, I think the reduction in swimming and edge crawling for moving objects is huge.
I certainly agree for small filters, but once you get to 5x5 or higher, the contribution of those few extra MSAA resolved pixels just isn't significant. With SATs at least, it seems to be cheaper to just increase the minimum filter width rather than using MSAA. With soft shadows however, I can certainly see it helping a lot. In any case, the option is there for people to play with if they desire and figure out how it works in their scenes/engines.

Mintmaster said:
Still, I'm not really convinced that FP could be better. Integer just seems like a much better fit to this application, and it really shows in theoretical calculations.
I tend to agree, and look forward to trying it when I get my DX10 capable computer :)

I'm really interested in your plausible soft shadows stuff now Mintmaster, even unpolished screenshots ;) I'm not sure that I'm going to have the time to follow through on that research, but I really did design/implement the current algorithm with that in mind, hoping that someone would do something similar. VSM seems like such a perfect fit when you can get reasonable results with O(1) lookups of arbitrarily large filters!

So go, go, go; I can't wait to see the result! :D
 
Part of the reason that I've been doing shadows research is indeed that I've been pretty unhappy with the shadows implementations that I've seen in many games. The rest of graphics is marching merrily towards photorealism, and yet even the latest games have ugly, aliased, swimming shadows. I was really unhappy to see that even Crysis' shadows look terrible at shallow angles!
im 100% in agreement that shadows sux in most/all games
not knocking what youve done but
looking at your shaders youre doing heaps of texture lookups per fragment == expensive, whilst for a demo u have this luxury with a game u dont, also as has been mentioned the shadows dont take into consideration the distance from caster -> receiver.
very interesting anyways
 
looking at your shaders youre doing heaps of texture lookups per fragment == expensive, whilst for a demo u have this luxury with a game u dont
The workload necessary for this at a given resolution is 100% predictable. Research tends to aim to be usable for next-gen games; not last-gen ones, which would have been released before the research's publication, anyway! ;)
 
Certainly, that's where the rear-projection and blocker search steps come in. Have you read the papers "Percentage Closer Soft Shadows" and "Real-Time Soft Shadow Mapping by Back Projection" (EGSR 2006 IIRC)? I'd like to try something similar to them, except instead of the costly sampling algorithms that they employ, replace that with solving the moments system to find blockers, and using VSM to sample the resulting area.
Actually, my first taste of soft shadows was from Akenine-Moller and Assarson and their more recent penumbra wedge stuff. I did some work on reducing the shader length and the draw call count (they had 4 calls per silhouette edge :oops: ). The nice thing about geometry based techniques is that they don't have the problem of overlapping geometry from the light's POV. These soft shadows are absolutely gorgeous with perfect penumbrae, and the technique is mathematically beautiful also. Going down this route I really saw the limitations of shadowmaps that arise from being single-valued. Of course, the down side of being geometry based is, well, being geometry based. ;) Silhouette finding and extrusion are expensive.

However, VSMs gave me renewed hope for shadow maps working with realistic soft shadows. That's why I've been really blown away by your work.

I certainly agree for small filters, but once you get to 5x5 or higher, the contribution of those few extra MSAA resolved pixels just isn't significant. With SATs at least, it seems to be cheaper to just increase the minimum filter width rather than using MSAA.
Yeah, but 5x5 is pretty big. Even with a 1024x1024 map, you're not going to cover much scenery with that if you want at least a little detail in your shadows. I want to see how access to the unresolved buffer pans out. Part of the reason that MSAA still gives you some blockiness with VSMs is that you can't get any mixing of samples between pixels when downsampling. It's sort of like the "fault lines" phenomenon you mentioned. I know NVidia used to let you get the unresolved buffer with StretchRect() in DX9.

VSM seems like such a perfect fit when you can get reasonable results with O(1) lookups of arbitrarily large filters!
Exactly. That's what made me so excited about your work. Remember my first PM to you? I really do believe you've optimally solved realtime shadows for all practical purposes.

Sure, 1200 shader cycles per SAT pixel and 16 lookups per screen pixel is feasable on a 8800GTX for demo scenes, but the potential for equally good results with one lookup and way faster VSM generation is what really gets me drooling.

I'll see what I can do this weekend.
 
However, VSMs gave me renewed hope for shadow maps working with realistic soft shadows. That's why I've been really blown away by your work.
Can't possibly agree more.
I'm confident there's a full family of algorithms between VSM and deep shadow maps (I consider those 2 ideas at the antipodes..) waiting to be discovered/developed.
 
looking at your shaders youre doing heaps of texture lookups per fragment == expensive, whilst for a demo u have this luxury with a game u dont
Well only four texture lookups are really needed (with bilinear filtering), which is quite reasonable I think. Furthermore with deferred lighting/shadowing, occluded pixels are free and thus at least the fragment shading cost of a shadowing algorithm in a confined tech demo is comparable to its cost in an arbitrarily complex scene. Many engines are moving to this sort of deferred rendering design for this reason and others.

Also note the performance results: SAVSM becomes cheaper than PCF at very low sample counts, so it's not as if games aren't already paying this price for a different technique...

Mintmaster said:
Going down this route I really saw the limitations of shadowmaps that arise from being single-valued.
I couldn't agree more, and that's why I think some hybrid VSM/deep shadow map algorithm will eventually be necessary. That or we get enough performance to just render lots of shadow maps to get area lights.

Mintmaster said:
Yeah, but 5x5 is pretty big. Even with a 1024x1024 map, you're not going to cover much scenery with that if you want at least a little detail in your shadows.
That's true, although I have it on good authority that several next gen games are using PCF filters as big as 7x7 to eliminate aliasing and swimming. Personally I'd much rather get rid of aliasing than have detailed shadows, although getting both is certainly ideal :)

Mintmaster said:
Sure, 1200 shader cycles per SAT pixel and 16 lookups per screen pixel is feasable on a 8800GTX for demo scenes, but the potential for equally good results with one lookup and way faster VSM generation is what really gets me drooling.
I'm certainly one of the first to get excited about that prospect as well! Still I think SATs are valuble in this context due to the really good filtering results and performance, as well as the ability to easily scale it up to higher-degree filtering (if box filtering isn't good enough). I'm also convinced that performance can be greatly improved for this technique - with D3D10, and potentially with just a better programmer than me ;)

I'd be really happy to see a faster implementation with equal or better quality though. Until then though, I think there's a place for SAVSM.
 
Upon further reflection, I'm not sure those SAT generaton time numbers are reliable. In particular, a 1024x1024 SAT apparently takes ~7ms just to generate, and yet runs at 100Hz @ 1920x1200... that doesn't seem correct.

It's possible that my timing method isn't working as expected. It uses the DXSDK's recommended method of inserting new command tokens to sync the GPU. Does anyone know of a better way to time composite operations like this? I'd like to get an idea of the combined CPU and GPU cost.
 
Upon further reflection, I'm not sure those SAT generaton time numbers are reliable. In particular, a 1024x1024 SAT apparently takes ~7ms just to generate, and yet runs at 100Hz @ 1920x1200... that doesn't seem correct.
Doesn't seem that impossible considering you're running a simple demo of a single feature with one light and there isn't much else going on... and you're rendering very basic lighting on an 8800, which suggests that the other rendering tasks are probably quite trivial. I wouldn't be surprised if the real figure is only slightly below what you're seeing. Certainly 7 ms is way too slow for a game, but for a simple demo to run that fast sounds plausible. If it was running on a real game environment on a full-blown game, I'd certainly expect less than 5 fps.
 
Doesn't seem that impossible considering you're running a simple demo of a single feature with one light and there isn't much else going on...
The thing that makes it inplausible is that that's *only* the SAT generation time, not including sampling. Sampling should be the most expensive part, especially @ 1920x1200. Of course everything but the shadowing is trivial (as it should be in a shadows demo!), but the SAT generation is only a small part of the shadowing cost, and thus I'm not sure that I trust the timings. I'll do a few sanity checks when I get home on Sunday/Monday.

If it was running on a real game environment on a full-blown game, I'd certainly expect less than 5 fps.
I'm pretty sure that SAVSM could be made to run fast enough in a full-blown engine, at least on G80 class hardware.
 
The thing that makes it inplausible is that that's *only* the SAT generation time, not including sampling. Sampling should be the most expensive part, especially @ 1920x1200.
Is the sampling really that expensive though? Can't G80 properly hide an fp32 texture fetch, whereas G7x couldn't, as long as you have some non-dependent math?

I know this doesn't solve the conundrum, and I'm not qualified to discuss this stuff, but I was under the impression that fp32 texels work much better on G80 than prior NVidia GPUs...

Jawed
 
Last edited by a moderator:
The thing that makes it inplausible is that that's *only* the SAT generation time, not including sampling. Sampling should be the most expensive part, especially @ 1920x1200. Of course everything but the shadowing is trivial (as it should be in a shadows demo!), but the SAT generation is only a small part of the shadowing cost, and thus I'm not sure that I trust the timings. I'll do a few sanity checks when I get home on Sunday/Monday.
Hmmm... that does sound wrong. But bizarre results using Microsoft's "suggestions" and QueryPerformance****() routines seems about par for the course. Even otherwise, texture fetches aren't the most expensive thing. Granted, I've gotten used to seeing shaders that just slam through dozens of fetches without any major problem, but since you're talking about a different scale of problems...

I'm pretty sure that SAVSM could be made to run fast enough in a full-blown engine, at least on G80 class hardware.
I'm sure of that too, not that G80 class hardware is the hardware of concern for me. Personally, I'm thinking a lot could be done faster just letting the CPU handle it.
 
Is the sampling really that expensive though? Can't G80 properly hide an fp32 texture fetch, whereas G7x couldn't, as long as you have some non-dependent math?
Sampling *is* expensive, which is why SAVSM beats PCF in a lot of cases. In particular, the currently implementation does 16 4xfp32 texture lookups... at 1920x1200 that's gotta be significant.

ShootMyMonkey said:
Even otherwise, texture fetches aren't the most expensive thing.
In which case, I'm seeing *lots* of CPU overhead and SAT generation should be able to get a *lot* faster :)
 
Sampling *is* expensive, which is why SAVSM beats PCF in a lot of cases. In particular, the currently implementation does 16 4xfp32 texture lookups... at 1920x1200 that's gotta be significant.

In which case, I'm seeing *lots* of CPU overhead and SAT generation should be able to get a *lot* faster :)
Out of curiosity, are those 16 lookups generally dominating your shader? I didn't really download anything myself since I don't have any hardware at home that could possibly run it (may have to try at work unless you have an NV20 codepath :p ), so I wouldn't have been able to check. I know I mentioned that I've gotten used to shaders that just eat through loads of lookups (far more than 16 when you include shadows), and that's on G70-class hardware. But the main thing is they pale in comparison to the number of computational instructions, so there is plenty of non-dependent stuff in there to cover up all the latency. On the other hand, things like a post-process pass are comparatively less efficient (though not costly) because there are just 2 or 3 texture reads and 2 or 3 computational instructions.

If your case is closer to the latter than the former, you could probably throw on some complexity at no cost.
 
Back
Top