Deferred vs Forward+ etc. *spin-off*

sebbbi

Veteran
Lighting target: RGBA16f
Normals: RG16
Diffuse albedo + BRDF ID: RGBA8
Specular albedo + roughness: RGBA8
Tangents: RG16
Depth: D32

6 RTs lol. Even if they could fit I think you'd use up virtually all the ESRAMs bandwidth (if not more). But how much of a performance hit are we expecting when placing RTs in the main X1 RAM (2 or 3)?
MJPs g-buffer layout is actually only two RTs in the g-buffer rendering stage and one RT in the lighting stage. And a depth buffer of course. Quite normal stuff.

On GCN you want to pack your data to 64 bpp (4 x 16 bit integer) render targets because that doubles your fill rate compared to using more traditional 32 bpp RTs (GCN can do 64 bit filling at same ROP rate as 32 bit filling).

I assume that the packing is like this:
Gbuffer1 = normals + tangents (64 bit)
Gbuffer2 = diffuse + brdf + specular + roughness (64 bits)
Depth buffer (32 bits)

Without any modifications this takes 40 megabytes of memory (1080p).

The lighting step doesn't need extra 8 MB for the 4x16f RT, because compute shader can simultaneously read and write to the same resource, allowing you to to lighting "in-place", writing the output over the existing g-buffer. This is also very cache friendly since the read pulls the cache lines to L1 and the write thus never misses L1 (GCN has fully featured read & write caches).

It's also trivial to get this layout down to 32 MB from the 40 MB. Replace gbuffer1 with a 32 bit RT (32 MB target reached at 1080p). Store normal as 11+11 bit using lambert azimuth equal area projection. You can't see any quality difference. 5+5 bits for tangents is enough (4 bits for exponent = mip level + 1 bit mantissa). 11+11+5+5=32. Also if you only use the tangents for shadow mapping / other planar projections, you don't need them at all, since you can analytically calculate the derivatives from the stored normal vector.

This layout is highly efficient for both g-buffer rendering and lighting. And of course also for post processing since all your heavy data fits in the fast memory. Shadow maps obviously need to be sampled from main memory during the lighting, but this is actually a great idea since the lighting pass woudn't otherwise use any main memory BW at all (it would be completely unused = wasted).
 
MJPs g-buffer layout is actually only two RTs in the g-buffer rendering stage and one RT in the lighting stage. And a depth buffer of course. Quite normal stuff.

On GCN you want to pack your data to 64 bpp (4 x 16 bit integer) render targets because that doubles your fill rate compared to using more traditional 32 bpp RTs (GCN can do 64 bit filling at same ROP rate as 32 bit filling).

I assume that the packing is like this:
Gbuffer1 = normals + tangents (64 bit)
Gbuffer2 = diffuse + brdf + specular + roughness (64 bits)
Depth buffer (32 bits)

Without any modifications this takes 40 megabytes of memory (1080p).

The lighting step doesn't need extra 8 MB for the 4x16f RT, because compute shader can simultaneously read and write to the same resource, allowing you to to lighting "in-place", writing the output over the existing g-buffer. This is also very cache friendly since the read pulls the cache lines to L1 and the write thus never misses L1 (GCN has fully featured read & write caches).

It's also trivial to get this layout down to 32 MB from the 40 MB. Replace gbuffer1 with a 32 bit RT (32 MB target reached at 1080p). Store normal as 11+11 bit using lambert azimuth equal area projection. You can't see any quality difference. 5+5 bits for tangents is enough (4 bits for exponent = mip level + 1 bit mantissa). 11+11+5+5=32. Also if you only use the tangents for shadow mapping / other planar projections, you don't need them at all, since you can analytically calculate the derivatives from the stored normal vector.

This layout is highly efficient for both g-buffer rendering and lighting. And of course also for post processing since all your heavy data fits in the fast memory. Shadow maps obviously need to be sampled from main memory during the lighting, but this is actually a great idea since the lighting pass woudn't otherwise use any main memory BW at all (it would be completely unused = wasted).

At the time we had some banding troubles in low-roughness specular using < 16bit precision for normals, but we certainly didn't exhaust all of our options regarding precision and packing techniques. I definitely don't think that we could have gotten away with only 5-bits for tangents. We were using them for anisotropic specular, which means it was just as important as the normals for avoiding artifacts in specular reflections.

As for the shadows, I wonder if it would actually be more beneficial to keep the shadow maps in ESRAM instead of G-Buffers. With tiled compute you only need to read from the G-Buffer once per pass as opposed to multiple fetches for PCF with a shadow map, so they might actually get more benefit from being in high-bandwidth memory. Lots of experiments would be required for sure!
 
We were using them for anisotropic specular, which means it was just as important as the normals for avoiding artifacts in specular reflections.
In that case 5+5 bits are definitely not good enough. You could store the whole tangent frame (including normal) using a normalized quaternion. That can be stored as 10+10+10+1=31 bits (calculate the missing component in shader). That would fit to the same memory footprint.

...multiple fetches for PCF with a shadow map, so they might actually get more benefit from being in high-bandwidth memory. Lots of experiments would be required for sure!
PCF doesn't use extra main memory BW, since the extra fetches will mainly come from L1 cache (same texels accessed multiple times by neighbor pixels, not any new texels accessed).
 
People tend to believe that forward rendering makes MSAA really easy and efficient. This is not correct anymore. Modern rendering pipelines have big amount of post processing steps, and tone mapping must be performed last. You can't resolve the MSAA after the first rendering pass. You need to keep the subsamples around for the whole rendering pipeline, read them, modify them and process everything in subsample precision. If you do the common shortcut of resolving subsamples before your post processing, you will have rendering errors on object/depth edges. Also tone mapping at the end of the HDR pipeline will pretty much remove the antialiasing from the high contrast edges, since the tone mapping curve is highly nonlinear. This trick isn't good enough for next generation games, but some forward rendered games (such as Forza 5) still seem to use it. This trick is causing huge amount of problems (and practically eliminates AA from the most important edges), so I wouldn't regard it highly. Without this trick forward rendering needs to pay for the extra MSAA subsample bandwidth, just like deferred does.

Forza 5 antialiasing issues (they use MSAA, and resolve it before the post processing):
http://www.vg247.com/2013/06/28/for...n-issue-unless-developing-at-4k-says-turn-10/

“When you have a really bright sky it’s going to create a high contrast which makes lines look jagged. So, what we had with Froza 4, was a game that was running at 720p and now with Forza 5 we have 1080p and 60fps and there’s a lot of post processing going on as well."
 
People tend to believe that forward rendering makes MSAA really easy and efficient. This is not correct anymore. Modern rendering pipelines have big amount of post processing steps, and tone mapping must be performed last. You can't resolve the MSAA after the first rendering pass. You need to keep the subsamples around for the whole rendering pipeline, read them, modify them and process everything in subsample precision. If you do the common shortcut of resolving subsamples before your post processing, you will have rendering errors on object/depth edges. Also tone mapping at the end of the HDR pipeline will pretty much remove the antialiasing from the high contrast edges, since the tone mapping curve is highly nonlinear. This trick isn't good enough for next generation games, but some forward rendered games (such as Forza 5) still seem to use it. This trick is causing huge amount of problems (and practically eliminates AA from the most important edges), so I wouldn't regard it highly. Without this trick forward rendering needs to pay for the extra MSAA subsample bandwidth, just like deferred does.

The tone mapping non-linearity issue is pretty easily solved with some smart filtering during the resolve step. A lot of tone mapping curves are invertable (or the inverse is easily approximated), and if you make use of that you can achieve high quality results. I've even experimented with more generalized filtering kernels that will produce high-quality results regardless of the tone mapping being used. Edge issues in depth of field and motion blur are a little trickier to handle since you're relying on depth data, but it's hardly unsolvable. In most cases I would say that the artifacts aren't even very noticeable to begin with, but that's more a matter of opinion and content.
 
The tone mapping non-linearity issue is pretty easily solved with some smart filtering during the resolve step. A lot of tone mapping curves are invertable (or the inverse is easily approximated), and if you make use of that you can achieve high quality results. I've even experimented with more generalized filtering kernels that will produce high-quality results regardless of the tone mapping being used. Edge issues in depth of field and motion blur are a little trickier to handle since you're relying on depth data, but it's hardly unsolvable. In most cases I would say that the artifacts aren't even very noticeable to begin with, but that's more a matter of opinion and content.
There are also downsides of doing tone mapping in custom resolve step just after the rendering. If you write the tone mapped colors to the render target, it's no longer linear, so you can't use hardware alpha blending anymore for transparencies or particles. You can sidestep this by rendering the transparencies to off-screen buffers and combining the results, but it will cost extra. Also in post process shaders you need to first convert the color value back to linear space and at the end of the shader convert it to tone mapped space. The extra ALU cost can be quite big (especially if you do your tone mapping for luminance and need to separate/rebuild the chroma every time).

Writing linear values to the render target is of course also possible (and I expect this is the way you handle it). Custom resolve: Tone map all MSAA samples, average the tone mapped values, inverse tone map the average value and write it to render target. This is less correct, but doesn't need repeated tone mapping / inverse tone mapping steps later. I suspect this looks good in most cases (especially if the post processing pipeline is simple), but when you start adding lots of atmospheric effects (fog, volumetrics, etc) the error will get worse (as the linear assumption doesn't hold for the MSAA edge pixels). But it shouldn't be that bad, since the absolute worst case is that the object edge gets one pixel narrower or wider (and you lose antialiasing for that edge). Not a biggie, unless it happens often. I would be more worried about losing the antialiased edges because of the transparencies (and volume fog rendering). If you have lots of big flying soft fog particles around, the image will pretty much lose all antialiasing.

I don't personally like modern forward rendering techniques (Forward+ descendants) in general, because you need do a depth pre-pass. This doubles your geometry cost. Our new (64 bpp, full fill rate) g-buffer rendering pipeline is able to render the whole g-buffer approximately as fast as a depth pre-pass would take (it's primitive setup bound). This kind of deferred rendering is very difficult to beat by forward techniques that need to render their geometry twice.

2xMSAA is not enough by itself. If you resolve the MSAA at the beginning of the pipeline, you don't have separated (sample precision) values anymore for the edges, and thus you cannot use this data to improve the PPAA quality. Deferred pipelines can combine MSAA and PPAA better. Pure 2xMSAA isn't enough anymore (it has too many problem cases).
 
Last edited by a moderator:
There are also downsides of doing tone mapping in custom resolve step just after the rendering. If you write the tone mapped colors to the render target, it's no longer linear, so you can't use hardware alpha blending anymore for transparencies or particles. You can sidestep this by rendering the transparencies to off-screen buffers and combining the results, but it will cost extra. Also in post process shaders you need to first convert the color value back to linear space and at the end of the shader convert it to tone mapped space. The extra ALU cost can be quite big (especially if you do your tone mapping for luminance and need to separate/rebuild the chroma every time).

Writing linear values to the render target is of course also possible (and I expect this is the way you handle it).

Indeed, the latter approach is what I was referring to. I don't think that there's any advantage to writing out tone-mapped values to your render target(s).

Custom resolve: Tone map all MSAA samples, average the tone mapped values, inverse tone map the average value and write it to render target. This is less correct, but doesn't need repeated tone mapping / inverse tone mapping steps later. I suspect this looks good in most cases (especially if the post processing pipeline is simple), but when you start adding lots of atmospheric effects (fog, volumetrics, etc) the error will get worse (as the linear assumption doesn't hold for the MSAA edge pixels). But it shouldn't be that bad, since the absolute worst case is that the object edge gets one pixel narrower or wider (and you lose antialiasing for that edge). Not a biggie, unless it happens often. I would be more worried about losing the antialiased edges because of the transparencies (and volume fog rendering). If you have lots of big flying soft fog particles around, the image will pretty much lose all antialiasing.

Sure: you get "correct" results (compared to performing resolve after all post-processing/tone mapping) only for steps that were rendered into your MSAA render target prior to resolve. In our case our volumetrics are performed in a deferred pass at lower resolution, so we make have to make sure that our upscale/composite is MSAA aware in order to ensure that we get antialiased edges. All of our "normal" transparents are rendered at full resolution into the MSAA target before resolve, so they are handled. Really the only issue we have is depth of field, where have a sharp foreground against an out-of-focus background. In those cases if you're not MSAA aware, you can easily "overwrite" your antialiased edges and re-introduce aliasing. I don't have any silver bullets for this, but I'm pretty sure that by using the MSAA depth buffer you could perform a smarter composite that doesn't completely destroy the edges.

I don't personally like modern forward rendering techniques (Forward+ descendants) in general, because you need do a depth pre-pass. This doubles your geometry cost. Our new (64 bpp, full fill rate) g-buffer rendering pipeline is able to render the whole g-buffer approximately as fast as a depth pre-pass would take (it's primitive setup bound). This kind of deferred rendering is very difficult to beat by forward techniques that need to render their geometry twice.

Indeed, I don't think you can beat optimized deferred renderer when it comes to geometric complexity and perhaps lighting complexity as well. It's cool that you guys got your G-Buffer pass to be so efficient! In our case we were trying to develop and upgrade our shading models so quickly and we had so much diversity, that it was just simpler to not always have to go through a G-Buffer. However I think this is starting to get a bit off-topic from our original point regarding MSAA performance.

2xMSAA is not enough by itself.

Sure, but why would you only do plain 2xMSAA? It's trivial to add additional PPAA, and with a little more work you can get some drastically improved results over a simple box filter resolve. Of course there's also the EQAA modes on GCN as well, which add a lot of interesting possibilities (especially when coupled with a custom resolve) even if you don't want to pay for a heavy sample footprint.

If you resolve the MSAA at the beginning of the pipeline, you don't have separated (sample precision) values anymore for the edges, and thus you cannot use this data to improve the PPAA quality.

Why can't you use the MSAA depth buffer (or any other MSAA information) as part of PPAA? That alone should be helpful for better identifying triangles edges.

Deferred pipelines can combine MSAA and PPAA better.

I don't see why that would be the case, outside of having a bit of extra data available from your G-Buffer (if you've kept it around at MSAA resolution).
 
All of our "normal" transparents are rendered at full resolution into the MSAA target before resolve, so they are handled.
Isn't this quite heavy for bandwidth if you are using RGBA16F render target? Transparencies are often BW bound even without MSAA. RGB11F/11F/10F seems quite handy for this kind of usage. I was pretty skeptical about this format at first, because the 10 bit float formats were quite badly implemented in last generation hardware. But it seems to work fine for our physically based rendering output (just enough to keep the banding away). 16 bit floats are of course easier to work with, because you don't need to care that much about losing some precision somewhere in your pipeline.
Really the only issue we have is depth of field, where have a sharp foreground against an out-of-focus background. In those cases if you're not MSAA aware, you can easily "overwrite" your antialiased edges and re-introduce aliasing. I don't have any silver bullets for this, but I'm pretty sure that by using the MSAA depth buffer you could perform a smarter composite that doesn't completely destroy the edges.
Keeping the MSAA depth buffer around is a good idea. I didn't even think about that. Solves some problems nicely.

Indeed, I don't think you can beat optimized deferred renderer when it comes to geometric complexity and perhaps lighting complexity as well. It's cool that you guys got your G-Buffer pass to be so efficient!
The thing I have hated most about deferred rendering is that you need to decompress your DXT data and store that data into an uncompressed buffer, and then read that uncompressed buffer later. That increases the BW cost of each texel quite a lot. We solved that problem, so I am again happy about deferred :)

On Xbox 360 we have always used 64 bit g-buffer (2x32 bit = half fill rate), and on GCN you can increase your g-buffer up to 128 bits to keep the fill rate acceptable (2x64 bit = half fill rate). I wouldn't personally go beyond that, but many 30 fps games are even willing to go to 1/4 fill rate, and that's quite a bad for many things (especially foliage and trees). I would choose forward over a very fat g-buffer any day.
In our case we were trying to develop and upgrade our shading models so quickly and we had so much diversity, that it was just simpler to not always have to go through a G-Buffer.
Agreed. Traditional g-buffers limit your data layout (as you store your data there). If you store a link to your data (or can calculate it somehow from your depth + pixel location + extra data), and you use some sort of clustered deferred (binning different lighting models to different bins), you don't have these problems. Obviously there's a lots of if's and lots of content pipeline considerations before a system like this is good enough for production.
However I think this is starting to get a bit off-topic from our original point regarding MSAA performance.
We could start a new forward vs deferred thread, since the next gen is here, and things have changed. Many new techniques have been invented for both sides of the fence. Maybe other developers would want to comment about their opinions as well.
Of course there's also the EQAA modes on GCN as well, which add a lot of interesting possibilities (especially when coupled with a custom resolve) even if you don't want to pay for a heavy sample footprint.
It's unfortunate that we have had coverage samples (without backing sample data) on PC hardware for long time, but no rendering API has given the programmer proper access to this data. If this data had been exposed, I think we would have already seen SMAA and/or other high quality PPAA algorithms use this data. It's going to be interesting to see if this extra coverage data alone is enough to boost PPAA quality enough to be the preferred choice at 1080p. Pixel density is now so much higher compared to last generation games that a very good PPAA algorithm (without subsample color data, but with subsample geometry information) could actually be enough.
Why can't you use the MSAA depth buffer (or any other MSAA information) as part of PPAA? That alone should be helpful for better identifying triangles edges.
Yes, you could do it, and that would be good enough for most games. Unfortunately in our case, the depth data is not enough for antialiasing. Our levels are built using a huge amount of small objects. We don't have big objects build purposefully for a single location, because we are heavily focused on user created content. For example a rock wall in our game is built from a big amount of separate rocks clipping each other. We need to antialias the clipping seams as well. Obviously on consoles you could keep the coverage data around (as it's very small) and use that to help in those cases.
I don't see why that would be the case, outside of having a bit of extra data available from your G-Buffer (if you've kept it around at MSAA resolution).
Deferred rendering pipelines with MSAA are often optimized to use sample precision only in areas that need it. On forward rendering you don't usually need this data, so you don't need to do any extra computation (and/or pay any extra storage cost) to calculate the area masks. However again, if you have low level access to the GPU generated MSAA structures, you actually get all this for free...

Slide 31 in BF4 Mantle presentation (http://developer.amd.com/wordpress/...attlefield-4-with-Mantle-Johan-Andersson.ppsx) gives us information that FMASK, CMASK and HTILE are all exposed by Mantle. I would definitely be happy if we got some access to these structures on cross vendor APIs as well.
 
Back
Top