Optimal way to render PSSM shadowed scene?

sebbbi

Veteran
We are using Parallel Split Shadow Maps / Cascading Shadow maps in our new game, and I was pondering about the most optimal way to sample the shadowmap(s). I have read about using stencil, depth bounds, etc to speed up the process and multipass the scene (one depth region at time). But multipassing the scene costs performance, so I personally implemented PSSM like this:

1. I create a single depth texture to hold all of my 3 PSSM textures (3072x1024)
2. I render each 1024x1024 view cone partition to the same texture (shifting viewport or resolve position accordingly)
3. Instead of scaling the shadow map projection matrices to generate [0,1] range like in normal shadow mapping (or when using multiple textures on PSSM), I scale the x of the view cone partitions to [0,1/3], [1/3,2/3] and [2/3,1] range.

My pixel shader is like this (pseudo code):

Code:
float shadowDepthIndex = saturate(log(psIn.posVS.z*0.1)*0.5)*2;
float4 posSM = psIn.posPSSM[shadowDepthIndex];
float3 texSM = posSM.xyz / posSM.w;
float lightZ = tex2D(shadowMapSampler, texSM.xy).x;
float lightMul = (lightZ <= texSM.z);

With this setup I only have to sample one depth map texel per rendered pixel, and there is no need for dynamic branching. I am still a bit concerned about texture cache usage of this method, as the texture coordinates can vary a bit between adjacent pixels. However usually a single rendered object is inside a single viewport depth partition, and most of the sampled texels are from the same view partition.

Is this a good way to implement the PSSM texture sampling?

(Currently the shadowmap is stored on R16 (fixed point) texture, but this might change in the future as I implement support for better soft shadow technique.)
 
Sounds like quite a nice way to implement things, possibly similar to what a few other developers are doing on the next-gen consoles. I think the degree to which the cache would be thrashed would vary greatly depending on the material you are using (and how many textures that it samples).

Have you tried implementing one of the other (more standard) sampling methods and simply profiling? :) It's difficult to say which would be better with actually measuring, but the gut feeling is sampling from 1 texture albeit a large texture is always going to be better than sampling from any of 3 different, albeit smaller ones depending on a dynamic branch, in some hardware it's possible that all 3 textures would actually be sampled anyway with the results being discarded depending on the results of the branch (please correct me if this is not the case).

Have you considered compositing the three into a viewport-sized shadow texture (similar to the defered CSM technique mentioned by Engel in X5?), this is assuming of course you can spare the memory for an additional render target, which may very well not be the case, this technique is becoming increasingly popular as well :)
 
Sebbi,

That's what I'm doing now. But instead of using the shadowDepthIndex to select among precomputed shadow map coords, I use the technique described in ShaderX6 in the article "Stable rendering of cascaded shadow maps" which basically notices that each split is a scale/offset from the base split. So you can just pass those scale/offset values to the pixel shader and then use the view space z distance as a mask to select among the scale/offset values. The big advantage is here that you don't need as many interpolators. Also your shader code is the same for 2/3/4 splits, which is nice.

Another issue I've run into with this method is when you want to enable mip-maps (for VSM, ESM). You get lines at each split in the scene. The mipmap selection gets screwed up when switching between partitions. So you have to manually calculate the mipmap lod and use tex2dlod to fix it.

I don't feel like I've optimized it fully yet, I'd be interested in what bottlenecks you end up running into.

Brian Richardson
http://bzztbomb.com/noise/
 
I use the technique described in ShaderX6 in the article "Stable rendering of cascaded shadow maps" which basically notices that each split is a scale/offset from the base split. So you can just pass those scale/offset values to the pixel shader and then use the view space z distance as a mask to select among the scale/offset values. The big advantage is here that you don't need as many interpolators. Also your shader code is the same for 2/3/4 splits, which is nice.

I was actually thinking about implementing something very similar last week, but I had so much other things to experiment (TSM, VSM, ESM, my own soft shadowing technique and my own TSM-esque screen space projection technique), that I totally forgot about this. I agree that this is the way to go. Less interpolators and less vertex shader work is always a good thing to have (as it's only one extra mad for the pixel shader).

Our ShaderX6 is still on the mail. Ordered it last week. Seems to have a lot of good articles.

Another issue I've run into with this method is when you want to enable mip-maps (for VSM, ESM). You get lines at each split in the scene. The mipmap selection gets screwed up when switching between partitions. So you have to manually calculate the mipmap lod and use tex2dlod to fix it.

We had this same problem with our old terrain renderer using large texture atlases (instead of texture arrays). This issue can be solved.
 
So you can just pass those scale/offset values to the pixel shader and then use the view space z distance as a mask to select among the scale/offset values.
Definitely use scale/offsets as storing independent matrices for the splits is total overkill and complicates filtering calculations. One mad is all you need to transform into any of the cascade spaces.

One improvement that you can do on the z-select scale/bias scheme is to select the cascade based on the most detailed cascade that the current fragment falls into. Since each projection is just a mad and a few compares can tell you whether you're in a split, this is actually pretty cheap to implement. On lower end hardware the extra math may be overkill, but on higher-end stuff math is pretty much free :) Switching from a simple z-based selection to a fill selection based on which region the projected shadow map sample (including derivatives, etc.) falls into in each of the cascades lowered my frame-rate by 0fps and increased quality quite a bit (on a 4870).

Also of course it's probably obvious but if you're on DX10, texture arrays are the obvious choice over atlases. They solve border and mipmap issues nicely.

Another issue I've run into with this method is when you want to enable mip-maps (for VSM, ESM). You get lines at each split in the scene. The mipmap selection gets screwed up when switching between partitions. So you have to manually calculate the mipmap lod and use tex2dlod to fix it.
Yep this works, although the potentially easier way to solve it is just by taking derivatives of the original shadow map projected texture coordinates, scaling them by the cascade scale (bias is irrelevant) and using TEXD (sample with explicit derivatives). That way anisotropic filtering still works which is a huge quality improvement over simple trilinear for filterable shadow maps.
 
Yep this works, although the potentially easier way to solve it is just by taking derivatives of the original shadow map projected texture coordinates, scaling them by the cascade scale (bias is irrelevant) and using TEXD (sample with explicit derivatives). That way anisotropic filtering still works which is a huge quality improvement over simple trilinear for filterable shadow maps.

Heh, good point. I answered this before my second cup of coffee. ;) I just took the derivatives of the original shadow map coordinates and fed it to tex2dgrad. That got rid of the lines, but I'm pretty sure that is not sampling the shadowmap optimally. So I scaled the coordinates by the cascade scale, but then the lines came back. I guess because there's still a discontinuity. My plan was calculate a line of best fit of the scale factors. Then use the viewspace z and this line to figure out how much to scale the uv's by. I haven't tried it yet, I'm not sure how well it'd work.

Would that all be unnecessary if I were using the highest detail cascade of the fragment?

Thanks for the tips!

Brian Richardson
http://bzztbomb.com/noise/
 
That got rid of the lines, but I'm pretty sure that is not sampling the shadowmap optimally.
Yes indeed, that will overestimate the filter width which will give you over-blurry shadows.

So I scaled the coordinates by the cascade scale, but then the lines came back.
Uhhh... weird... there shouldn't be any issues with that. Does this still happen even when you add a bit of extra padding around the edges of cascades (to ensure larger filters work)? I have had that scheme up and running for a while in my code and never noticed any artifacts what-so-ever. If you take derivatives before any divergent calculations/control flow they should be consistent. Double-check the scaling perhaps... note that you should be doing something like:

Code:
float4 ScaleBias = g_SplitScaleBias[Split];
tc = ScaleBias.xy * Orig_tc + ScaleBias.zw;
dtdx = ScaleBias.xy * Orig_dtdx;
dtdy = ScaleBias.xy * Orig_dtdy;

...

texShadow.SampleGrad(sampAnisotropicClamp, float3(tc, Split), dtdx, dtdy);

Obvious, I know, but I just want to make sure I'm being clear. If you're still seeing artifacts with that than I'm confused and wondering whether it might have something to do with something other than pixel quad/derivative divergence...

One other thing: do note that if you're packing things into a texture atlas you'll need to take that scaling into account too. Just make sure you take the derivatives of your texture coordinates in their "real" atlas space (i.e. [0, 1/4] in this case) before scaling by the cascade values and you should be fine.

(Aside: I also described a totally brutal hack to make sure all of the pixels in a quad choose the same split that works fine too, but is much less nice than the derivative scaling.)
 
One improvement that you can do on the z-select scale/bias scheme is to select the cascade based on the most detailed cascade that the current fragment falls into. Since each projection is just a mad and a few compares can tell you whether you're in a split, this is actually pretty cheap to implement. On lower end hardware the extra math may be overkill, but on higher-end stuff math is pretty much free :) Switching from a simple z-based selection to a fill selection based on which region the projected shadow map sample (including derivatives, etc.) falls into in each of the cascades lowered my frame-rate by 0fps and increased quality quite a bit (on a 4870).

In my first prototype (that had only 2 shadow maps and used multiple matrices) I implemented the cascade selection like this (if not inside the first, use the second). With the scale/bias system the calculation will be usable for any amount of splits. Seems to be really efficient also, as your tests prove.

This is the code I have used with 2 splits:
Code:
float3 texLmap1 = psIn.posPSSM[0].xyz / psIn.posPSSM[0].w;
float pssmIndex = any(floor(texLmap1));

For some reason I still dislike adding "if" statements to my shaders. I have used shader assembler for so long time that I still do not fully trust HLSL compilers to create the optimal code (especially when statements that can cause dynamic branching are used).

Also of course it's probably obvious but if you're on DX10, texture arrays are the obvious choice over atlases. They solve border and mipmap issues nicely.

The hardware we are developing for supports texture arrays. I am going to switch the system to use texture arrays when I implement VSM/ESM or a similar shadow sampling technique that requires mipmapping.

Another quick question: Do you use TSM/PSM/LISPSM or other shadow map projection altering techniques in your view cone split partitions? In my testing I get visible undersampling/oversampling at the beginning/end of each cascade (the view cone is split to 3 partitions). Or should I just split the view cone to more partitions?
 
Last edited by a moderator:
For some reason I still dislike adding "if" statements to my shaders. I have used shader assembler for so long time that I still do not fully trust HLSL compilers to create the optimal code (especially when statements that can cause dynamic branching are used).
Hehe, then just use "[flatten]" before the if statement (similarly "unroll" for loops) which forces the compiler to convert the control flow to the equivalent predication. It's a simple conversion so no need not to trust it ;) Feel free to look at the generated byte-code if you still don't believe it. That said, do profile... on DX10-class cards coherent dynamic branching is plenty-fast!

Another quick question: Do you use TSM/PSM/LISPSM or other shadow map projection altering techniques in your view cone split partitions?
I haven't because the quality of the standard partition has been "good enough" and further warping just complicates any prefiltering (blurring for edge softening, etc) somewhat. It's not intractable by any means, but the simple solution has worked well for me so far, and with a few 1024 splits w/ 4xMSAA for each I don't really have any issues with magnification aliasing. Of course YMMV depending on your scene and so forth.
 
Yes indeed, that will overestimate the filter width which will give you over-blurry shadows.


Uhhh... weird... there shouldn't be any issues with that. Does this still happen even when you add a bit of extra padding around the edges of cascades (to ensure larger filters work)? I have had that scheme up and running for a while in my code and never noticed any artifacts what-so-ever. If you take derivatives before any divergent calculations/control flow they should be consistent. Double-check the scaling perhaps... note that you should be doing something like:

Code:
float4 ScaleBias = g_SplitScaleBias[Split];
tc = ScaleBias.xy * Orig_tc + ScaleBias.zw;
dtdx = ScaleBias.xy * Orig_dtdx;
dtdy = ScaleBias.xy * Orig_dtdy;

...

texShadow.SampleGrad(sampAnisotropicClamp, float3(tc, Split), dtdx, dtdy);

Obvious, I know, but I just want to make sure I'm being clear. If you're still seeing artifacts with that than I'm confused and wondering whether it might have something to do with something other than pixel quad/derivative divergence...

One other thing: do note that if you're packing things into a texture atlas you'll need to take that scaling into account too. Just make sure you take the derivatives of your texture coordinates in their "real" atlas space (i.e. [0, 1/4] in this case) before scaling by the cascade values and you should be fine.

(Aside: I also described a totally brutal hack to make sure all of the pixels in a quad choose the same split that works fine too, but is much less nice than the derivative scaling.)

I'll have to try a few more things and let you know what happens. I was scaling the point then taking the ddx/ddy of it after. Looks like you scale the result of ddx/ddy. Thanks for giving me stuff to think about!
 
But instead of using the shadowDepthIndex to select among precomputed shadow map coords, I use the technique described in ShaderX6 in the article "Stable rendering of cascaded shadow maps" which basically notices that each split is a scale/offset from the base split. So you can just pass those scale/offset values to the pixel shader and then use the view space z distance as a mask to select among the scale/offset values. The big advantage is here that you don't need as many interpolators. Also your shader code is the same for 2/3/4 splits, which is nice.

Got my ShaderX6 and studied the article. Some nice optimizations there for the cascade selection (the move and scale bias system). However I cannot directly translate all of the optimizations to my system, as I am not using a depth based cascade selection any more (I need to calculate the pixel texture coordinate inside each cascade shadow map to get the most detailed cascade it's included in).

I also noticed that my bounding area calculation for the cascade frustrum is tighter (less wasted space) than the MEC (minimal enclosing circle) method used in the article. I am calculating an optimal light space bounding rectangle for the eight cascade frustrum corner vertices. At first I used the optimal up vector (tightest light space bounding box), but I settled up use the camera up vector to reduce the sub-pixel based error (moving stair steps / shimmering).

With three 1024x1024 shadow maps, the shadow quality is very good for the resolution (1280x720) and graphics content we use (indoors with limited vision range).
 
Last edited by a moderator:
Back
Top