Layered Variance Shadow Maps

TBH I'm still a bit skeptical that GS amplification/deamplification is something that can be much more efficiently implemented in hardware than otherwise. It always seemed like a bit of an unnecessary feature to me, given the presence of so-called "transform feedback" anyways. That said, maybe they'll make it super-fast next generation, although I'd be wary of anyone devoting too much hardware to that end.
The answer will probably seem obvious once I hear it, but what do you mean by "transform feedback"?
 
Certainly GS performance is a bit weird, but even with minimal or no amplification, it's still too slow. At one point I was getting a 50% performance hit on a simple shader just by *having* a GS, even if it did nothing and just passed the triangles straight through. Now I can expect some overhead, but 50% is absurd, even for a simple shader. Perhaps drivers have improved since then, but my initial experience was not particularly positive.

The answer will probably seem obvious once I hear it, but what do you mean by "transform feedback"?
aka. "Stream Out" in DX10. It's the silly name that they decided to call buffer output after GS.
 
Certainly GS performance is a bit weird, but even with minimal or no amplification, it's still too slow. At one point I was getting a 50% performance hit on a simple shader just by *having* a GS, even if it did nothing and just passed the triangles straight through. Now I can expect some overhead, but 50% is absurd, even for a simple shader. Perhaps drivers have improved since then, but my initial experience was not particularly positive.

aka. "Stream Out" in DX10. It's the silly name that they decided to call buffer output after GS.

You can do transform feedback without a GS (previous testing on NVidia cards showed this as a faster hardware path). I was doing GL_POINT to GL_TRIANGLE conversion in the VS alone (obviously requires a 2nd draw call to draw the triangles).

Also in this GS testing, I tried two situations, (a) an "empty" VS stage which simply passed VBO data to the GS stage for full computation there, and (b) moving as many computations as possible from the GS stage to the VS stage. The operations were exactly the same, and on NVidia cards, (b) was quite a bit faster than (a). This hints that there are some limitations on how well the GS stage can be parallized (perhaps the GS stage wasn't being SIMDed well by the driver). Transform feedback in VS only is obviously easy to make fast because you have a fixed number of inputs and outputs which are the same for all verts (matches the design of the hardware quite well).

GS is SIMD unfriendly. The common case of having only a fraction of GS invocations with expensive divergent branching is an obvious problem. Having variable number of outputs is an obvious problem. Not sure how the hardware handles this? I could pre-allocate the maximum number of outputs per call and later coalesce the buffer in the case of stream out (which could be expensive), or simply pre-clear the buffers and later process the degenerates and tosses them (in the case without stream out), or do something really awful and use atomic operations to do proper packing (early 8 series cards didn't have atomic ops in CUDA, so I'm guessing this wasn't the case).
 
You can do transform feedback without a GS (previous testing on NVidia cards showed this as a faster hardware path). I was doing GL_POINT to GL_TRIANGLE conversion in the VS alone (obviously requires a 2nd draw call to draw the triangles).
Yes certainly, I didn't mean to imply otherwise. All I meant to say was that the presence of a stage that outputs results directly after the VS/GS, or being able to render to VBO, or similar makes the GS amplification/deamplification even less interesting to me, particularly with its current speed.

GS is SIMD unfriendly.
That's the root of my biggest beefs with it. It's an operation that can't be "obvious" done faster in hardware to me, and thus I see no problem with letting the programmer choose an algorithm best suited to the data set. I would write a pack very differently if I expected to be throwing out 99% of the elements, rather than if I expected to be throwing out 1%.

I'm really not sure how the hardware handles it, but I've certainly heard some things that hint that at some level they have to allocate the maximum buffer size, then pack it (or simply index it I guess, in the case of resubmitted geometry) later. This may be done at a finer granularity than the entire buffer (for instance, buffers for each SM or something), but I would be surprised if it was implemented strictly in terms of atomics or similar, which would make the "min/max hints" to the API - and even amplification limits - unnecessary.
 
Last edited by a moderator:
I'm really not sure how the hardware handles it, but I've certainly heard some things that hint that at some level they have to allocate the maximum buffer size, then pack it (or simply index it I guess, in the case of resubmitted geometry) later. This may be done at a finer granularity than the entire buffer (for instance, buffers for each SM or something), but I would be surprised if it was implemented strictly in terms of atomics or similar, which would make the "min/max hints" to the API - and even amplification limits - unnecessary.
Yep. There's a reason you specify the max vertex count when writing a GS. ;)
 
In the beginning they'd run GS code on just one cluster, and (maybe) stop running all threads on the rest of the chip too, until the GS was done for that particular stream. So performance was shit. Not sure if that's still the case, haven't written any GS code in a while (and neither has the games industry :LOL: ).
 
Error in ESM

Hi, Andrew, I feel very curious about the
sentence u wrote here. From my point, the ESM will
unavoidly introduce error if the filter size is big &&
several occluder exists. When the param C is like >20,
the shadow result will easily be over-exposured after
bilinear filtering and result in light bleeding.

Can u explain a little bit more about how to use the
negative warp -e^(-c*depth) in conjunction to avoid the
problem? or Can u show some simple shader code so
that I can test it by myself? Gracious thanks!
:D

Effectively yes, but you have to remember to warp the fragment depth using e^(c*depth) as well. Furthermore as described in the paper you can also use the "negative" warp -e^(-c*depth) in conjunction to avoid some more problems (i.e. store 4 components total). [Edit] I can post some code for this if you guys want... it's neither hard nor complicated but probably easier to understand in code than otherwise.

But you can cut that down to 4 shadow passes with GS-cloning or instancing. If you're gonna do silhouette extraction on the GPU using GS, you have to give the same benefit to the shadow map algorithms.


I'm not 100% convinced of how "efficient" it is, particularly for complex geometry. GS amplification/deamplification does involve either memory allocation, a"pack" operation or both and that's not cheap, even when implemented in hardware.


Much less though, which is key. Remember that even though you may need more "passes" to more render targets, the rendering itself is extremely cheap due to very few state changes (really only vertex shader and depth output). In any case it would be an interesting comparison, and I'm certainly willing to use whatever is fastest for the job! :)


Certainly true that LVSM/CSM/ESM/PCF attack *filtering* not soft shadows. The whole "edge softening by clamping the minimum filter width" is really just a side-effect rather than the goal IMHO. This is a really important thing to remember because if you start thinking of the edge softening as the *goal*, then it's both a physically incorrect approach, and potentially inefficient way to do it.


Well, that's *one* goal of real-time rendering. I think if you ask any game developers though they don't give a damn about "solving the rendering equation" and rightfully so. Hell, even movies spend more time fudging stuff than doing it physically correctly. Physical correctness is another tool IMHO, not the end goal.

That said, please do realize that VSM et al. are *filtering* algorithms. Ray traced shadows, shadow volumes and DCS do not address shadow filtering *at all* - thus you are forced to super-sample in screen space to avoid aliasing. Thus DCS isn't really the end goal/answer for shadows either IMHO since I think VSM shows pretty conclusively that we can do a good job on shadow filtering and avoid inefficiently super sampling the whole screen buffer. This is the same case as using texture filtering in ray tracers... technically you can handle it via screen space super sampling, but in reality it's a hell of a lot more efficient to do some prefiltering.

Let me reiterate: edge softening is a "bonus" of PCF/VSM/etc., not the goal. It's even presented that way in the original PCF paper, which I highly suggest that everyone working in shadows should read.
 
Hi, Andrew, I feel very curious about the
sentence u wrote here. From my point, the ESM will
unavoidly introduce error if the filter size is big &&
several occluder exists.

Yes and no. You can have any depth complexity and zero errors if your receiver is planar within the filtering window.
Imagine a tree casting a shadow over the ground, the shadow on the ground won't contain any error (no matter how compled the depth function is due to the thousands of leaves..), while the self shadowing on the leaves will contain errors.

When the param C is like >20,
the shadow result will easily be over-exposured after
bilinear filtering and result in light bleeding.
This are accuracy and range issue, try with log filtering (btw..c = 20 is HIGH!)


 
Last edited:
Hi, thanks for ur quick reponse:)

Could u pls provide the downloading url
for ur ESM implementation? I am a poor
guy and had no money to buy the book:)

For the param C, my feeling is since the heaviside
shadow comparison func is very high frequency.
We usually want to use big value >30 to try to
approximate the heaviside curve. I strongly hope
the ESM can get practical good results since its cost
is really low. I implemented the esm before and
experienced some errors when the bilinear filter size
is big. That make me feel confused. It would be
great to see some simple shader code and implement
by myself to test it:)

Cheers





Yes and no. You can have any depth complexity and zero errors if your receiver is planar within the filtering window.
Imagine a tree casting a shadow over the ground, the shadow on the ground won't contain any error (no matter how compled the depth function is due to the thousands of leaves..), while the self shadowing on the leaves will contain errors.

This are accuracy and range issue, try with log filtering (btw..c = 20 is HIGH!)


 
Hi, Andrew, I feel very curious about the
sentence u wrote here. From my point, the ESM will
unavoidly introduce error if the filter size is big &&
several occluder exists. When the param C is like >20,
the shadow result will easily be over-exposured after
bilinear filtering and result in light bleeding.
ESM works when the furthest samples in your filter kernel are the same distance from the light as the pixel you're shadowing. In nAo's example, nothing is further away than the ground, so the shadows drawn there are fine. For the leaves, even if a tiny filter weight goes to a texel from the ground, you won't get any shadow.

If you think about the Heaviside function, an exponential is a very close match until x=0. After that the error is enormous. That's why anything even slightly farther away than your shadow reciever makes everything light. A higher value of C won't help the bleeding.

Can u explain a little bit more about how to use the
negative warp -e^(-c*depth) in conjunction to avoid the
problem? or Can u show some simple shader code so
that I can test it by myself? Gracious thanks!
:D
Andy isn't really talking about ESM here. He's talking about using VSM with a positive and negative exponential warp.

VSM bleeds when the ratio of distances between occluders gets too small, i.e. if A overlaps B and B overlaps C, you get light bleeding (B's shadow on C) when B-A approaches or exceeds C-B. The exponential warp makes it so that B-A is almost always smaller, but unfortunately you run into the same problems as with ESM (A's shadow on B bleeds, because C is farther), but at least the shadow on C is okay now.

The negative warp is kind of neat. You do the opposite, so light always bleeds through B and you basically isolate the shadow of A only. Now you take the min of this shadow and the above, and all your shadows look pretty good now.
 
Last edited by a moderator:
Gracious thanks for your reply! and basically that's
also my feeling about ESM.

In fact, several months ago I tried quite hard to work
on soft shadow based on ESM. However finally I gave
it up and it seems the ESM will introduce lots of artifacts
even using constant filter size for each screen pixel. The
reason I chose it is because its low memory cost.

We can test ESM with a simple scene with several square
quads overlapped a little bit together and distributed
evenly from near to far (compared to the light pos). We
should be able to see the error of shadow in the middle
square quad. Of course, for some complex stuffs like
trees in games, u will not notice the self-shadow problem
easily and the shadow on the ground plane is the most
important. However, from the theory side, I somehow
feel the ESM is difficult to be extended to fake soft
shadow effect.



ESM works when the furthest samples in your filter kernel are the same distance from the light as the pixel you're shadowing. In nAo's example, nothing is further away than the ground, so the shadows drawn there are fine. For the leaves, even if a tiny filter weight goes to a texel from the ground, you won't get any shadow.

If you think about the Heaviside function, an exponential is a very close match until x=0. After that the error is enormous. That's why anything even slightly farther away than your shadow reciever makes everything light. A higher value of C won't help the bleeding.

Andy isn't really talking about ESM here. He's talking about using VSM with a positive and negative exponential warp.

VSM bleeds when the ratio of distances between occluders gets too small, i.e. if A overlaps B and B overlaps C, you get light bleeding (B's shadow on C) when B-A approaches or exceeds C-B. The exponential warp makes it so that B-A is almost always smaller, but unfortunately you run into the same problems as with ESM (A's shadow on B bleeds, because C is farther), but at least the shadow on C is okay now.

The negative warp is kind of neat. You do the opposite, so light always bleeds through B and you basically isolate the shadow of A only. Now you take the min of this shadow and the above, and all your shadows look pretty good now.
 
Wow, entire conversations have begun and ended before I even noticed... tsk tsk me... that's what I get for writing theses and moving out of my house ;)

Anyways I would answer something... but Mintmaster and nAo already answered everything better than I could. I'll keep an eye on the thread again though if you have any further questions!

Cheers,
Andrew
 
Hi,
I implemented VSM. It works allright, except for light bleeding. The edges are soft.
My question is how to implement EVSM...I simply replace those depths with exp(depth), it looks like:
before:
PixelShadow()
{
Color.x = depth;
Color.y = depth * depth;
Color.z = 0;
Color.w =1;
}

PixelScene()
{
d = current_d;
mm = tex2D(g_VSM, tex);
if ( d < mm.x )
LightAmount = 1;
else
{
sigma2 = mm.y - mm.x * mm.x;
LightAmount = sigma2 / (sigma2 + (d - mm.x) * (d - mm.x));
}
}

Now:
PixelShadow()
{
Color.x = exp(depth);
Color.y = exp(2*depth);
Color.z = 0;
Color.w = 1;
}

PixelScene()
{
d = exp(current_d);
mm = tex2D(g_VSM, tex);
if ( d < mm.x )
LightAmount = 1;
else
{
sigma2 = mm.y - mm.x * mm.x;
LightAmount = sigma2 / (sigma2 + (d - mm.x) * (d - mm.x));
}
}

However, there are many artifacts, the edges are zigzag, and there are shadow acnes...
Help me please....

btw:shadow map is 1024 * 1024, and there is a gauss filtering pass after generating shadow maps...

Thanks..
 
Color.x = exp(depth);
That should be exp(c*depth) where c is a parameter. Use approximately 42ish if you're using fp32 VSMs. Same change goes for the warping of your fragment depth later.

Color.y = exp(2*depth);
Change this to Color.y = Color.x*Color.x

However, there are many artifacts, the edges are zigzag, and there are shadow acnes...
There should not be too many artifacts after those changes. You may get a tiny bit of acne - same with VSM w/o clamping minimum variance - but it can be handled in the same way as VSM. However, note that you need to scale the minimum variance constant by the square derivative of your depth scaling... i.e. (c*exp(c*x))^2 in this case.

Note that the above will give you exponentially warped VSMs, which will be fine for small values of "c". For larger values (like 10+ probably...) you'll want to implement the full EVSM with both warps (i.e. exp(c*x) and -exp(-c*x)) as described in the paper and answered in the above posts. The latter warp will help to avoid ESM-like multi-receiver filter region artifacts.
 
Hi, I apologize for this stupid question, my English level is a bit limited...

Do you recommend using something like 42.0f for c, right ? (not sure what -ish means)

In that case, both warps needs to be implement according to your recommendation, did I understand correctly ?
 
Do you recommend using something like 42.0f for c, right ? (not sure what -ish means)
Yes, c = 42 approximately uses your entire 32-bit floating point range without overflow with a little "wiggle room". Feel free to fiddle with the exact value, but you certainly can't go over 44 without artifacts.

In that case, both warps needs to be implement according to your recommendation, did I understand correctly ?
Actually you can use different c values for each warp if you want, they have no dependence on one another (they are just two different estimators). I don't see much of a need to though.
 
If you don't want to run out of range remember that you can always use log filtering :)
It works well even with half precision floats!
 
Actually I would like to try this with a parallel split VSM implementation. I had a similar idea some time ago to reduce bleeding, which is in one way similar to storing the exp. I used the following method to re-project the depth [0..1] in an non-uniform manner (to "flatten" some range). It helped reducing huge Z differences, for instance when you had two shadows overlapping on the ground, one from a tree and the other from a aircraft that was very far away (big light bleeding)

Instead of storing the linear Z, I reproject the linear Z to:
- from 0.0 to 0.1: objects behind camera
- from 0.1 to 0.9: objects which Z is in the frustum range
- from 0.9 to 1.0: objects which Z is behind frustum (or actually, shadowed part of the frustum)

With that scheme the differenct between the aircraft Z and the tree Z won't be that big.

Code:
vDepthLightSpace.Set(0.0f, fFrustumMinZ-fMinZCaster, fFrustumMaxZ-fMinZCaster, fMaxZCaster-fMinZCaster);

float AdjustDepthInterval(in float fDepth, in float4 vDepthLightSpace)
{
	float fRet;

	float4 vDiv = float4(vDepthLightSpace[1] - vDepthLightSpace[0],
			vDepthLightSpace[2] - vDepthLightSpace[1],
			vDepthLightSpace[3] - vDepthLightSpace[2],
			1.0f);
	float4 vDif = saturate((fDepth - vDepthLightSpace) / vDiv);
	fRet 	= dot(vDif, float4(0.1f, 0.8f, 0.1f, 0.0f));
		
	return fRet;
}

I guess using exp() would have the same effect. I will try.



I asked if one needs to use both warps because of that:

Note that the above will give you exponentially warped VSMs, which will be fine for small values of "c". For larger values (like 10+ probably...) you'll want to implement the full EVSM with both warps (i.e. exp(c*x) and -exp(-c*x)) as described in the paper and answered in the above posts. The latter warp will help to avoid ESM-like multi-receiver filter region artifacts.

as 42 is > than 10 I thought one must use the two warps (I haven't read the article so I don't know what the second warp is for).



nAo, what is log filtering ?
 
nAo, what is log filtering ?
It's about filtering an exponential shadow map in log space.
Now you are filtering exponential values which rapidly go out of range, to avoid this issue you can filter the logarithm of these values (and the go back to exp space)

For example if you averaging two exponential value such as exp(A) and exp(B) you have:

a*exp(A) + b*exp(B) (a and b are some filter weights)

but you can rewrite the same expression as:

exp(A) * (a + b*exp(B-A)) ,

exp(A) * exp( log (a + b*exp(B-A)))),

and:

exp(A + log(a + b*exp(B-A))

Now your sum of exponential is written as a single exponential, if you take the logarithm of it you can then just work on its argument:

A + log(a + b*exp(B-A))

Basically you end up filtering the argument of your exponential functions, which are just linear depth values, so you enjoy the same range you have with less exotic techniques. Just don't forget to go back to exp space when you use the final filtered value.

You can also find the same info here (slide 24)

Marco
 
Instead of storing the linear Z, I reproject the linear Z to:
- from 0.0 to 0.1: objects behind camera
- from 0.1 to 0.9: objects which Z is in the frustum range
- from 0.9 to 1.0: objects which Z is behind frustum (or actually, shadowed part of the frustum)
There's actually no need to maintain any depth ranges beyond those that you're going to query (i.e. in the camera frustum). You can safely just clamp those to 0 or 1 and remap the center range. This is actually done automatically if you do the Lloyd relaxation step of Layered VSMs in camera space rather than light space.

The problem is that objects still in the frustum can still cause significant light bleeding in some instances. For these cases, layers or an exponential warp will reduce or eliminate this bleeding more uniformly.

Still your observation is apt and even applies to normal shadow maps: when using frustum partitioning, be sure to remap the depth range of each split to the absolutely minimal required range, both to maximize precision and (in the case of probabilisitic methods), improve the approximation.

PS: I got the log filtering stuff working with EVSM today Marco... just gotta fiddle it a bit more and then see how well it works with very high C values. Results to come...
 
Back
Top