Forward+

Also, I don't think it's a good idea to focus on average performance like this, as you'll get some bad framerate dips when you can't cull shadows well due to the scene. At least with thousands of lights you can cap how many lights could affect each pixel through your art assets.
Artists usually want somewhere between 2-3 lights to hit every surface in the 3d world (to make it not look flat, the light count is not important, the coverage is). Of course lights that are behind a wall (shadowed) do not help much, and neither does sunlight in indoor scenes. We have lots of partially indoor/outdoor scenes in Trials Evolution, and indoor sections (big areas where sunlight is blocked) cost more because you have to add lots of local lights (but the sun still must be calculated to the same pixels even if it's mostly shadowed out). We have fully dynamic lighting and user created content, so users can create lots of cases like this to their levels (that we have no way to prevent).

Improving minimum frame rate is of course the ("only") goal (we have 60 fps with vsynch after all). We want the 2-3 lights per 3d surface to translate to 2-3 lights processed per pixel. Smaller tile sizes and good culling are very important to reach that constant frame time (less border pixels that require "random camera angle specific" amount of processing). It's important that the "visible amount of lighting" translates well to processing cost (so that artists/users can have the amount of light they require per each surface).
worst case performance is the most relevant, which is actually one of the reasons why I'm slightly less enthused with so-called "Forward+" then the very-similar deferred variant. In my experience running complex shaders at the end of the raster pipeline results in a whole lot more variability (due to triangle sizes, occlusion, scheduling, etc) than doing it in image space.
I fully agree with you. The more processing you can move away from the raster pipeline to the (screen space) compute pipeline, the better. Everything in raster pipeline has a fluctuating cost (variable amount of triangles/vertices in screen, variable overdraw and hi-z efficiency, variable quad efficiency, no way control branching granularity = branching fluctuation, etc). Minimizing the fluctuating cost is the key to solid performance (60 fps with hard vsynch has always been our goal).

With Forward+ those far away objects with tiny triangles (artists always have too little time to create perfect LODs) are hit with plenty of light sources (further away geometry has more z-fluctuation, so light culling has more false positives). Those tiny triangles have very bad quad efficiency (and often very bad texture cache efficiency as well). I would personally prefer to do as little as possible in the raster stage. I would be ready to go as far as simply storing the texture coordinate (to the virtual texture cache) for each pixel instead of sampling the material textures in the rasterization step. Our current Xbox 360 game has 2000 meter view distance, and we are already getting bad quad efficiency for further away geometry. In next gen titles we of course want more (more draw distance, more further away geometry, more geometry with smaller details). Lots of things must be deferred to make all this happen (at constant non-fluctuating 60 fps).
 
Last edited by a moderator:
Would it be feasible to use texture space with a virtual texturing for lighting for things further away?

Frame rate independent lighting and very simple rendering of faraway objects might be nice.
Also tiled deferred/forward rendering might like the fact that min/max could be masked to certain distance or objects and thus raise efficiency.
 
Would it be feasible to use texture space with a virtual texturing for lighting for things further away?

Frame rate independent lighting and very simple rendering of faraway objects might be nice.
Also tiled deferred/forward rendering might like the fact that min/max could be masked to certain distance or objects and thus raise efficiency.
Yes, you can light the virtual texture cache instead of lighting the screen space g-buffer. Lighting in texture space actually provides better results as well (normals cannot be interpolated without causing errors). With texture space lighting, and sampling the lighted result, you get better mip transitions (lighted results are linearly blended from two mips), and you reduce a lot of shader aliasing (especially specular aliasing).

The only problem is the extra cost. For example the virtual texture cache in Trials Evolution is 4096x4096 pixels. Currently visible set of pages is around 35% of the cache (average in our scenes). So we have 5872k pixels. 720p screen is 921k pixels. That's 6.3x pixels. Doable, but it's going to cost more than it's going to benefit the image quality. Of course if the scene lighting is static, each virtual texture tile in cache needs to be only lit once (when it becomes visible). In Trials Evolution I we limit the tile generations to 16 per frame. That's 128x128x16 = 262k pixels. So in a scene that has static lighting (but calculated at runtime) lighting the virtual texture cache instead of the screen g-buffer is actually more efficient (and provides better image quality to boot). Real scenes are of course a mix of static and dynamic lighting, so the performance gain/loss is game (and scene) specific.

Reusing processing with virtual texture cache is one of the best things virtual texturing brings to table. For example in Trials Evolution we do our complex terrain blending once to the cache (and only sample baked data for each frame - very imporant for 60 fps gameplay), and same system is used for object decals (decals are burned to VT cache during page generation). Of course you can burn things like SSAO and GI lighting (non view dependent part of the light equation) to the cache as well. If you would like to calculate view dependant stuff (like specular lighting) to the VT cache, you would need to update it pediodically. The good thing of course is that it can be made completely frame rate independent.

I actually wrote a post describing this kind of system (g-buffer with texture coordinate to VT cache instead of any color/material data) last year, but I haven't had much time to experiment with it (since Trials Evolution has taken all my time lately). It's a really big change to things, and thus we didn't want to risk adapting a system like that during our project. But I will surely experiment it for our next major engine version. It seems that there are others who are also interested in texture space lighting (Timothy Lottes wrote about it in his blog), and virtual texturing is a key incredient for systems like this... and many big developers have stated that they are implementing virtual texturing in their future engines, so we might actually see some texture space lighting in actual products in the future.
 
Last edited by a moderator:
@Sebbbi: Huh, I never thought of using virtual textures for lighting. In one sense it sounds very cool, in another it also just doesn't sound terribly efficient, a bit like just supersampling. Point being there might be something more, elegant, and more importantly efficient. If only I could think of what it was :p Until then, did you ever publish that paper sebbbi? Or are you planning on investigating it some more until you publish?

As for using a tiled based scheme to cull out virtual point lights or something, well I don't think even with this next generation we're going to get anything as "realtime" as that or path tracing for GI. Doing it that way is an NP complete problem, and since it's going to grow exponentially with scene complexity I'm not imagining most developers would be willing to give up even bigger and more complex environments for something like that (admittedly something like The Legends of Grimrock or other such tightly enclosed environments could work fine).

I do however like the idea of partially precaculated probes, in fact right now I'd say that is THE good idea when it comes to dynamic GI. Heck it even works acceptably on the current generation! (See Far Cry 3, the papers from the cancelled Milo game). With a lot more power I don't see why that couldn't be scaled up to something very good indeed, and the "deferred shading"... something. Occlusion whatever they called it used in Infamous 2 looks promising as well. Get that to work as a bent normal occlusion and combined with the probes you'd have something closing in on what Hollywood used to use in the mid 2000's, which is damned solid to me.
 
Would someone be so kind to summarize the pros, and cons, of Forward Rendering and Deferred Rendering (maybe split it into fully deferred and deferred shading or whatnot) to highlight why someone would want to go with a Forward+ like approach?
 
Would someone be so kind to summarize the pros, and cons, of Forward Rendering and Deferred Rendering (maybe split it into fully deferred and deferred shading or whatnot) to highlight why someone would want to go with a Forward+ like approach?
As you can probably tell from this thread, that's not really a simple question :). Honestly though we're arguing details... tiled deferred and "forward+" (tiled forward) are really similar, and both are strictly better than conventional forward or conventional deferred.

Can someone explain texture space lighting briefly to me? Thanks.
The tl;dr version:

Instead of shading in screen space - i.e. for each pixel, you sample input textures then compute a lighting function based on those inputs - imagine doing the shading "on the surface of each object". So for each pixel in the *texture map*, compute the lighting at that place on the object, and save the result into a texture. Then to display you simply render the object using that pre-lit texture map, including texture filtering.

There are of course details... the fact that textures are reused in different places in the scene means that you need some form of virtual texturing, since the lighting results will actually be different for each of those places. Also you don't want to shade at the resolution of the textures themselves... you want to shade at some resolution that once projected into screen space ends up being roughly pixel frequency. Mipmapping sort of already does this so you can use a similar technique.

The potential advantages are that since you're applying texture filtering to the final shaded output you get very little aliasing (from stuff like specular) in screen space, since it gets band-limited to the right frequency automatically. Disadvantages sebbbi alluded to: among other things, you will end up shading more samples (up to ~4x) than shading in screen space since you need enough data to filter properly. Also it should be noted that using texture-space shading does *not* eliminate the need for stuff like normal map/BRDF filtering, as the function can still alias in motion if the input is not properly band-limited. Given that, the question is whether or not texture-space shading provides enough additional visual quality over a properly filtered BRDF (say LEAN mapping or similar) to be worth the additional complexity and cost. The answer to that is unclear :)

There's a gigantic discussion about this - including a nice RenderMonkey shader comparison by Stephen Hill! - over here if you're bored: http://timothylottes.blogspot.ca/2012/01/sbaa-paper-sraa-and-texture-aligned.html :)
 
Also you don't want to shade at the resolution of the textures themselves... you want to shade at some resolution that once projected into screen space ends up being roughly pixel frequency.
Luckily virtual texturing does this for you automatically. The virtual texture cache includes all the visible texture surfaces at the correct detail levels, so you can simply do 1:1 lighting. Point sample the texels like you do in screen space deferred lighting.

Like Andrew said earlier, this doesn't completely solve the normal map filtering issue. It just solves the part of trilinear filtering the normal map during screen buffer rendering (4 bilinear samples * 2 mip levels = 8 samples get blended together -> temporal aliasing caused by incorrect normal map linear filtering). You still have to generate the texture mip levels so that the lighting equation provides correct results in lower mip levels as well (recursive 2x2 averaging is wrong for normals). As long as the mip calculation is correct (not a trivial task since it requires extra data about distribution), you can simply trilinear (or anisotropic) filter the lighted virtual texture cache, and you should get no shader aliasing (from lighting) at all.

Lighting equation for texture space doesn't need to worry about linear filtering at all (screen space texturing requests "random" points that is approximated by filtering from spatial/mip neighbors, in texture space you can always just point sample and get the exact result). Without having to worry about getting the lighting and mip calculation formulas to a form that supports linear filtering, you have more options (starting from tight bit packing in textures, but possibly leading to completely new formulas). LEAN/CLEAN mapping is designed so that linear approximation produces good results. With texture space lighting you do not have that limitation (and wouldn't likely need to go to that fat texture formats to "solve" the issue).

There's a gigantic discussion about this - including a nice RenderMonkey shader comparison by Stephen Hill! - over here if you're bored: http://timothylottes.blogspot.ca/2012/01/sbaa-paper-sraa-and-texture-aligned.html :)
Anyone interested in methods to combat (shader) aliasing should definitely read this discussion.
 
Pretty big win for AMD, especially in 2560Ă—1600, although I suppose NVIDIA may not have had much time to work on drivers. Perhaps bandwidth plays a part, too?
 
That game seems terribly optimized. A GTX680 can barely average 60fps at 1080p with 4xMSAA.

I know it's AMD sponsored, but the 7970 doesn't exactly blow the roof off either. Can anyone tell me if the game looks that good?
 
Clustered Deferred and Forward Shading
http://www.cse.chalmers.se/~olaolss/main_frame.php?contents=publication&id=clustered_shading
Video

Adds depth partitioning of lights to get better efficiency.
Very interesting article. Nice to see other people researching this area as well.

Some notes:
They use pretty big tiles for the tiled deferred renderer in their comparisons (and have test scenes where big tile size hurts performance). Also they use a brute force algorithm to cull lights in tiled deferred method. For better efficiency you would want to subdivide to smaller tiles (down to 8x8). With small tiles, depth partitioning doesn't add much benefit (as one tile is just two warps, subdividing further would kill warp based branching coherency, so no technique can do that and expect good performance).

We do quadtree based partitioning on CPU (in our old tiled deferred system), and similar system would likely be good for GPU as well. A full quadtree is not likely the best choice (doesn't fit shared thread block memory), but a few of the most detailed tree levels (32x32, 16x16, 8x8) would improve the light culling performance nicely.

Depth partitioning would be possible with the system described above. For each 8x8 tile (64 threads) split the pixels to two depth partitions, and process each by a single warp (32 threads). Of course this would require some data shuffling inside the thread block shared memory, but it wouldn't cost that much extra (as everything stays inside the fast local memory). Two depth partitions would be enough for all but pathological cases (as the 8x8 blocks are so small already). This would likely offer pretty much comparable efficiency to the clustered version, with much less setup work. Of course you could also check the pixel normals as well, and reject lights based on that information (just like in the clustered algorithm), but it didn't seem to be a clear cut advantage in the clustered algorithm either. Conservative shadow map checking based on depth bounds (reject fully shadowed lights from block) would however be a big gain (assuming of course that rather expensive soft shadowing / shadow antialiasing method is used). It's interesting how most academic research on new deferred lighting methods doesn't even mention shadow maps. Light sources that do not cast shadows look very "flat" and unnatural.

The biggest reason I liked stencil shadowing was that it automatically culled out all pixels that didn't receive light (backfacing pixels and pixels in shadow). You only paid for the pixels that actually received light (stencil culling handled the rest). It's nice to see similar benefits in new algorithms based on shadow maps. (as the down sides of stencil shadows make it unusable in anything else than super simple scenes).
 
Last edited by a moderator:
It's interesting how most academic research on new deferred lighting methods doesn't even mention shadow maps. Light sources that do not cast shadows look very "flat" and unnatural.

The biggest reason I liked stencil shadowing was that it automatically culled out all pixels that didn't receive light (backfacing pixels and pixels in shadow). You only paid for the pixels that actually received light (stencil culling handled the rest). It's nice to see similar benefits in new algorithms based on shadow maps. (as the down sides of stencil shadows make it unusable in anything else than super simple scenes).
What is your thought on cone tracing of voxels, distance fields or surfels for a shadows?

We already know the lights that affect the tile, wouldn't it be feasible to trace shadows for those lights instead of creating shadow maps for each of them?
 
Very interesting article. Nice to see other people researching this area as well.
Some notes: (...)
Thanks, you too! :)

Apologies for the slow response. I'd like to offer some comments on your notes.

Firstly, smaller tiles do come at a cost in storage, computation and bandwidth. Therefore it seems not without problem to suggest subdividing the entire screen in order to lessen the problem with discontinuities (as opposed to subdividing where they actually occur). Also any 2D tiling can only reduce the problem, not solve it. For example, consider cases like foilage with alpha to coverage, debris and stochastic transparency. These kind of "pathological cases" are just the thing that makes tiled shading highly view dependent. Also, with many lights, even one bad tile could potentially end up dominating the run time.

Secondly, it is not true that smaller tiles than 32 samples need pose a problem. You appear to assume we must schedule the tiles explicitly (e.g. X warps per tile), which is not true. We do, in fact, not, which means that samples within the same warp may fetch different light lists, but as long as they are of similar length, little divergence will occur.

Thirdly, regarding partitioning into two sets within a tile. I see no reason to assume samples will split nicely into two chunks of 32 elements. Equally plausible is 1 at the far plane and 63 near the viewer. Thus, it is not obvious to me that performing this shuffling is going to be neither particularly simple, nor very efficient. And, again, the fundamental problem remains unsolved. This should also be taken in view of clustering not actually being very complex/expensive to implement, especially if using the page tables approach.

Finally, I totally agree that shadows will be the next big challenge to tackle for many lights. And I also think that clusters may provide a very useful starting point for an efficient shadowing algorithm. The normal clustering and culling was indeed not competitive in our measurements, but even a fairly small increase in shader complexity / shading cost would shift the balance. Adding shadowing looks pretty sure to shift that balance.

I think voxel based occlusion methods could be used very well in conjunction with clusters. It's a very interesting possibility to investigate.

Again thanks for your comments, I hope to get time some day to run some comparisons with smaller tiles in our system, to put some numbers to the arguments. That time is, unfortunately, not now, it seems.

Cheers
.ola
 
Thirdly, regarding partitioning into two sets within a tile. I see no reason to assume samples will split nicely into two chunks of 32 elements. Equally plausible is 1 at the far plane and 63 near the viewer. Thus, it is not obvious to me that performing this shuffling is going to be neither particularly simple, nor very efficient.
Sorry I wasn't clear: I didn't mean clustering in 2D, I meant two clusters *in depth*, rather than a full grid. i.e. compute the depth distribution over the tile, then split it at the center, etc. Then treat each of those "depth tiles" separately. For the vast majority of tiles, the depth distribution is unimodal or bimodal. In my tests this captures the vast majority of the benefit of depth tiling or 3D binning. Of course you can construct scenes in which it doesn't work as well, but I have yet to see a realistic scene that it didn't work well in.

The normal clustering and culling was indeed not competitive in our measurements, but even a fairly small increase in shader complexity / shading cost would shift the balance. Adding shadowing looks pretty sure to shift that balance.
Right, although the issue here is that you're not talking about culling "the rest of the shader", but rather only the NdotL test and subsequent branch per-pixel. Any well-written shader will branch on the fully-computed occlusion value (NdotL > 0 && distance attenuation factor > 0) so the level of tile culling that you want to do is going to be limited solely by computing these values. Now larger tiles will swing the balance more towards better tile culling (since there are more pixels to compute these values at), but the complexity of the shading function doesn't actually affect it.
 
Sorry I wasn't clear: I didn't mean clustering in 2D, I meant two clusters *in depth*, rather than a full grid. i.e. compute the depth distribution over the tile, then split it at the center, etc. Then treat each of those "depth tiles" separately. For the vast majority of tiles, the depth distribution is unimodal or bimodal. In my tests this captures the vast majority of the benefit of depth tiling or 3D binning. Of course you can construct scenes in which it doesn't work as well, but I have yet to see a realistic scene that it didn't work well in.
I may not have been clear either, as I intended to write a comment specifically to what sebbi said. Perhaps I should have kept the full quote.

In any case, as I see it the main issue is that the view dependence of tiled shading remains. There is no way to ensure a certain number of lights per tile, e.g. by controlling placement, as this would require testing _every_ possible view point. So even though bimodal binning may work for the vast majority of cases, you have no way of telling beforehand when the minority is going to come and push your frame time over the edge, so to speak. And with thousands of lights, a single tile can get pretty expensive, especially given the parallelism needed to feed modern GPUs. With 3D clusters there are no pathological view points, and work size is closely related to light density.

Right, although the issue here is that you're not talking about culling "the rest of the shader", but rather only the NdotL test and subsequent branch per-pixel. Any well-written shader will branch on the fully-computed occlusion value (NdotL > 0 && distance attenuation factor > 0) so the level of tile culling that you want to do is going to be limited solely by computing these values. Now larger tiles will swing the balance more towards better tile culling (since there are more pixels to compute these values at), but the complexity of the shading function doesn't actually affect it.

I see what you are saying, I had not considered it that way and it does seem to put a damper on the whole cone culling thing. I'm not aware, however, that you can control branching on that level in glsl, or CUDA (as you can in hlsl), so results might vary in practice.
 
Back
Top