more memory efficient deferred rendering idea?

Graham

Hello :-)
Veteran
Supporter
It's been a while since I've posted here, too much work... but I thought I'd share this idea I had the other night


I have not had the time to test this out, but at least in my head it works.


The last time I implemented deferred rendering, I had three render targets using MRT, one for normals, one for material colour, one for position. Because of the hardware limits and position requirements, they were all FP16 render targets. AFAIK this is the 'standard' way to do DR.

Now, I was thinking, if you don't need the position then you don't need to use FP16. Both normals and colours don't need high precision.
I was also thinking that to get around not having position is actually much easier than I thought it would be. Simply use a depth surface texture for the zbuffer. Then when you need the position, use the pixel on screen position + read back the depth, multiply by the inverse of the original projection/view matrix and you'd get the original position.

Does this make sense? At least in my mind you would go from 3 FP16 render targets to 2 int8 targets without any significant side effects...?

As I say I have yet to actually implement it yet. Anyone see any major problems with this?
 
Graham said:
Now, I was thinking, if you don't need the position then you don't need to use FP16. Both normals and colours don't need high precision.

I was also thinking that to get around not having position is actually much easier than I thought it would be. Simply use a depth surface texture for the zbuffer. Then when you need the position, use the pixel on screen position + read back the depth, multiply by the inverse of the original projection/view matrix and you'd get the original position.

Does this make sense? At least in my mind you would go from 3 FP16 render targets to 2 int8 targets without any significant side effects...?

As I say I have yet to actually implement it yet. Anyone see any major problems with this?
Well, the normal render target should probably be two int16's, but int8 would work as well.

Also, because of the depth value allotment, it's possible that you won't get enough precision at far distances. Hence, you might want to keep with three RT's, but they can all be 32 bits/pixel.

Aside from that, it may be restrictive to have only the diffuse color per surface. Some use specular color separately. In general, there are other fields (shininess, ID tag) that you might need in a real game that you could get around in a demo.

But it should work, yes. The question is over how acceptable the result would be.



My idea, which has likely be thought of and tossed aside by smarter people, is to render depth and normal (and possibly, but not necessarily, more) to MRT and then accumulate each light's contribution to each pixel. The idea is similar to deferred shading, but would sum raw light contribution rather than percentages of final pixel color. You'd still get good batching, you could still support a ton of dynamic lights and you could still support hundreds of individualized pixel shaders.

But as I said, it's probably already been thrown out. :p
 
I originally was thinking the depth precision might be a problem. If it's 24bit and you loose some accuracy in the matrix multiplication, however I guessed you'd still end up with better depth accuracy than 16bit FP value if using a render target. But the precision bias to close objects I had not thought of. I'll need to test this when I get time. My previous attempts have not shown any obvious problems. All I guess you would see would be less accuracte light z-falloff in the very far distance?

As for storing the normal in higher precision, you could always store the normal X/Y in RG/BA, as you can safely assume the Z component will face the viewer. Afaik there are some shader assembly commands to do this, but no HLSL commands. Although you could just use fmod() to get the two components I guess.

Material index in the alpha of the diffuse map? Of course you can just use another map. I was just using how I had originally gone about implementing DR as an example (I wasn't using any specular calculations at all - wasn't appropriate for the environments being rendererd).

Thats the other quetsion is how well supported are 24bit depth surface textures? From memory radeon x800s and lower don't support these?

I actually did just accumulate lighting information and then as a final step multiplying by the colour texture (using alpha as an overbrighting factor for lights, lasers, whatnot). Worked very well.
 
Last edited by a moderator:
Graham said:
As for storing the normal in higher precision, you could always store the normal X/Y in RG/BA, as you can safely assume the Z component will face the viewer. Afaik there are some shader assembly commands to do this, but no HLSL commands. Although you could just use fmod() to get the two components I guess.
Since the normal will be of length 1 and normal.z will be positive, why can't you just store normal.xy? Then decoding would be:

normal.z = 1/( rsqrt(1 - dot(normal.xy, normal.xy)) )

Which should compile to 4 asm instructions, I would think. If GPUs load 0 into unused channels, you could remove the ".xy" inside the dot product and save a cycle.

I actually did just accumulate lighting information and then as a final step multiplying by the colour texture (using alpha as an overbrighting factor for lights, lasers, whatnot). Worked very well.
Hmm... surely there must be some downside to it if no one's using it, right? I mean, it can't have occured to just us.
 
assen said:
How will you calculate specular? You need position to calculate the eye vector.
Reread the OP. You can easily compute position if you know the screen XY position, the projection matrix, the viewport transform and have access to the Z-buffer. If you only want to know the direction of the eye vector (you only need direction, not length, for specular), you can even skip the Z-buffer part.
 
assen said:
How will you calculate specular? You need position to calculate the eye vector.
Specular would be a problem, but not because of position. Position can be generated easily. If you only have the surface's normal and position, though, you can't tell how shiny it is and therefore adding diffuse + specular light to one buffer would not work. Surfaces respond to those two colors independently. Writing diffuse incoming light to one RT while writing specular incoming light to another RT might work. The specular light would be pre-exponential falloff, but the system should hold together.


I just realized that Epic is doing a sliver of this in UE3.0. They're just deferring the shadows, if I'm right.
 
I read that paper and I thought it missed a prime batching opportunity. It has to set scissor state and make a Draw( ) call for every light. Why not just render the 3D bounding geometry of the light and use the vertex shaders for something? You could render more than one light using geometry instancing or the like. Seems to me you'd get less overdraw that way, and if you had a shadowing light, you might render the bounding geometry with stencil first to determine which pixels are actually close to the light.
 
Mate Kovacs said:
It definitely does. Here's a tutorial using the very same method.

Awesome so it does work. :)

Although I haven't seen many, all actualy implementations of DR I've seen in games/tech demos always appeared to be using what I'd done in the past. Ie, multiple FP16 targets. So it's nice to know it can be done better.
If only I'd known this time last year.

Since the normal will be of length 1 and normal.z will be positive, why can't you just store normal.xy? Then decoding would be:

normal.z = 1/( rsqrt(1 - dot(normal.xy, normal.xy)) )

Which should compile to 4 asm instructions, I would think. If GPUs load 0 into unused channels, you could remove the ".xy" inside the dot product and save a cycle.

Yes but to get higher than int8 percision I was suggesting using RG for X, and BA for Y, so effectivly int16. I'm pretty sure there are byte packing asm commands (at least nvidia has some I think) but the same could be done with fmod.
Unless you can actually use an int16 RG target with int8 RGBA targets? I thought most hardware did not allow this...
 
Graham said:
Yes but to get higher than int8 percision I was suggesting using RG for X, and BA for Y, so effectivly int16. I'm pretty sure there are byte packing asm commands (at least nvidia has some I think) but the same could be done with fmod.
Unless you can actually use an int16 RG target with int8 RGBA targets? I thought most hardware did not allow this...
You could indeed pack into an RGBA render target, or you could use a G16R16 or G16R16F render target. Given that you're only storing two values, I'd go with the latter (if possible).
 
Inane_Dork said:
I read that paper and I thought it missed a prime batching opportunity. It has to set scissor state and make a Draw( ) call for every light. Why not just render the 3D bounding geometry of the light and use the vertex shaders for something? You could render more than one light using geometry instancing or the like. Seems to me you'd get less overdraw that way, and if you had a shadowing light, you might render the bounding geometry with stencil first to determine which pixels are actually close to the light.
Hehe, that's exactly the discussion I was making to nAo in the console forum. Apparently Gears of War uses partially deferred rendering - no normal maps, and only uses the depth for determining position in a shadow mapping pass. (EDIT: Whoops, you mentioned you already knew that.)

To me, the only way it seemed you get noticeable savings is by using this very idea. Draw the backfaces of a simple convex bounding volume (e.g low poly geosphere), setting a stencil bit on z-fail. Then draw the frontfaces, running your shadow mapping routine only where the bit was set, and clearing the stencil also.
 
Last edited by a moderator:
BTW, I realized my above solution for specular lighting would only work if each surface had the same (simple) specular model. So no microfacets or anisotrophy unless you wanted to output to another G-buffer.


Mintmaster said:
Hehe, that's exactly the discussion I was making to nAo in the console forum. Apparently Gears of War uses partially deferred rendering - no normal maps, and only uses the depth for determining position in a shadow mapping pass. (EDIT: Whoops, you mentioned you already knew that.)
Actually, thanks for bringing this up, because I had a question about it. I've read in several places that Epic argued against mandatory MSAA on X360 because their shadowing gets run in this deferred shadowing method. So, if true, why can't they do 1/N the quality of shadow mapping where 'N' is the number of AA samples per pixel? Most shadow map shaders I've seen can easily be split into smaller chunks, and reading from different parts of the shadow map per AA sample should not be too difficult.

To me, the only way it seemed you get noticeable savings is by using this very idea. Draw the backfaces of a simple convex bounding volume (e.g low poly geosphere), setting a stencil bit on z-fail. Then draw the frontfaces, running your shadow mapping routine only where the bit was set, and clearing the stencil also.
You might restrict pixels even more if you set the Z function to pass when you draw the frontfaces. Or you could do the two-sided stencil operations (like stencil shadowing) before drawing the frontfaces.
 
Mintmaster said:
Hehe, that's exactly the discussion I was making to nAo in the console forum. Apparently Gears of War uses partially deferred rendering - no normal maps, and only uses the depth for determining position in a shadow mapping pass. (EDIT: Whoops, you mentioned you already knew that.)

So as a guess, because there are no stored normals are they using X/Y delta of the two depth's to approximate the suface normal? - or am I missing something?
 
Graham said:
So as a guess, because there are no stored normals are they using X/Y delta of the two depth's to approximate the suface normal? - or am I missing something?
I think UE3.0 is computing only the shadow amount in its deferred pass, hence, it needs just the Z buffer. How they handle multiple shadowing lights per frame, I don't know.
 
Thanks for all the replies everyone.
The various comments have certainly given me a few more ideas I want to try out, and have led me on to some new ideas of my own. All good :)
Don't expect a demo anytime soon though :p

Once I do get tests of these ideas into my engine project, then all I'll have to do is come up with a catchy name.... hmmm...
 
Last edited by a moderator:
Graham said:
Thanks for all the replies everyone.
The various comments have certainly given me a few more ideas I want to try out, and have led me on to some new ideas of my own. All good :)
Don't expect a demo anytime soon though :p

Once I do get tests of these ideas into my engine project, then all I'll have to do is come up with a catchy name.... hmmm...

I'll consider adding it to my engine too, no eta either.
 
Graham said:
So as a guess, because there are no stored normals are they using X/Y delta of the two depth's to approximate the suface normal? - or am I missing something?
I made a bit of a mistake. I didn't mean no normal maps, but rather no render target for normals.

Here's how I think it's done:
-For each light, a shadow map is rendered.
-The first pass is probably just ambient lighting, populating the Z-buffer also.
-The Z-buffer is resolved into a texture (possibly unnecessary for an uncompressed Z buffer on non-Xenos architectures)
-Using the method I described, the magnitude of incoming light (from attenuation, shadow mapping, and maybe a lightmap too) is stored in the alpha channel.
-The scene geometry is sent again with the pixel shader calculating the lighting using normal maps, and this is modulated by the value in the alpha channel and added to RGB.
-The previous two steps are repeated for each light. It's probably useful to keep the stencil bit set until both of these steps are done, and then clear it afterwards.

I still think it's more efficient to do it all in one pass with DB for limiting lights if you really want to, but nAo has repeatedly said that shadow mapping is much more efficient this way. I can see how that's the case if the stencil culling helps, but otherwise I don't see the reason for it. Deferred shadow mapping would hurt coherency if anything. :???:
 
Inane_Dork said:
Actually, thanks for bringing this up, because I had a question about it. I've read in several places that Epic argued against mandatory MSAA on X360 because their shadowing gets run in this deferred shadowing method. So, if true, why can't they do 1/N the quality of shadow mapping where 'N' is the number of AA samples per pixel? Most shadow map shaders I've seen can easily be split into smaller chunks, and reading from different parts of the shadow map per AA sample should not be too difficult.
That's a good idea, assuming they are indeed doing M shadow map samples for each pixel. However, they could be doing DB to vary the number of samples. Both NVidia and ATI are preaching this.

You'd need access to the unresolved Z-buffer, though. When you take bandwidth into account, the work per chunk isn't really 1/N. Intriguing nonetheless.

You might restrict pixels even more if you set the Z function to pass when you draw the frontfaces.
Naturally. :smile: Similar to stencil volumes, but the the convex volume allows us to collapse the shadowing pass and frontface stencil pass.
 
Back
Top