Forward+

Light culling on GPU could likely be done quite efficiently, as it's a problem that can be parallelized. Doing it in brute force way (per pixel or per block like usually done with pixel shaders) is easy, but requires lots of duplicate work. Investigating an efficient parallel algorithm for it that requires minimal work (and utilizes minimum amount of BW) is a pretty challenging problem.

I would personally like to try similar techniques that the fastest GPU radix sorters use, and use a quadtree to split lights. Basically you could do a modified scan (prefix sum) based on culling results (four frustums). One thread processes one light and checks the four frustums and increases counts on the bins that the light belongs. Scan the bins, and create next iteration data (this can have less or more items based on how many lights intersected frustums and how many got completely culled out). Next iteration (log2(r) in total, where r is screen resolution : block resolution) would then again go though all the lights (one thread per light) but now check the light to the four frustums of the sub frustum it belongs to. So in total 4*n*log2(r) frustum checks... log2(r) = 9 (for 720p and 4x4 block size). So it would result in approx 18 frustum checks per light (if we disregard intersected frustums and lights completely culled out).
 
I would personally like to try similar techniques that the fastest GPU radix sorters use, and use a quadtree to split lights.
Yeah precisely, that's how I do it on the CPU and it's vastly faster due to it doing like 10-100x less total tests (!). But you don't want to do it it in "passes", because it's incredibly inefficient bandwidth-wise to dump your stack to DRAM every frame. What you need is the ability to fork/join with a work stealing scheduler. i.e. depth first traversal (most cache friendly), breadth first stealing (best load balancing).

On the GPU right now you can kind of do this with persistent threads (this is what OptiX and others do), but not efficiently, elegantly or portably. The programming model and hardware needs to expand to not lock in a static register count/stack/shared memory size, else it won't be able to do anything more interesting than what it does today. Turns out hierarchical data structures and recursion are kind of important ;)
 
The "tiled deferred" implementation in that demo should be similar to mine and that's what I was comparing.
Ahh, I see.

So called "shader variety" is a totally red herring. Deferred can run arbitrary shaders just as efficiently as forward (sometimes more-so due to 2x2 quad scheduling from the rasterizer). Try it :)
At first I was going rebut with an example of large amounts of data interacting with the light sources, but I guess you could defer that with a mega-texture type of technique where every piece of data passed to the shader during forward texturing can be accessed in the deferred path. It does complicate fixed-function amenities like LOD or cylindrical texture mapping, but I suppose I'm being a bit picky now.

Light culling on GPU could likely be done quite efficiently, as it's a problem that can be parallelized. Doing it in brute force way (per pixel or per block like usually done with pixel shaders) is easy, but requires lots of duplicate work.
Does it really? I may not be understanding what's required, but the best way I can think of is downscaling the Z-buffer (tracking min/max) so that you have one pixel per tile and then render lights to an equal-sized buffer with a DX11-style linked list. I don't see a lot of redundancy or inefficiency going on there except the usual quad-based underutilization which shouldn't bring it down to CPU speeds.
 
Last edited by a moderator:
It does complicate fixed-function amenities like LOD or cylindrical texture mapping, but I suppose I'm being a bit picky now.
Sure, and that's precisely why you tend to sample the diffuse texture in the first pass (i.e. not defer it), because you really have 6 floats as input (uvs and derivatives although this could be compressed somewhat), whereas the output is normally 4 unorms or similar. That said, as soon as you have a lot of textures using all the same coordinates it may make sense to defer that and store the derivatives... Also in other cases you can compute the derivatives analytically, like I do with shadows for instance. Storing Z derivatives is enough to recover any position-related derivatives in any space, and in fact you can use the shading normal as a reasonable approximation if G-buffer space is a problem.

Does it really? I may not be understanding what's required, but the best way I can think of is downscaling the Z-buffer (tracking min/max) so that you have one pixel per tile and then render lights to an equal-sized buffer with a DX11-style linked list. I don't see a lot of redundancy or inefficiency going on there except the usual quad-based underutilization which shouldn't bring it down to CPU speeds.
Yeah you need to create a min/max mip tree, although you shouldn't go down to 1:1 (GPUs are really inefficient at the last few levels). After that though given the very low resolutions (of tiles) it's normally not worth rasterizing, particularly since it's mostly spheres and cones, which are more efficient to "rasterize" in software. The biggest issue with this approach is that all your light lists go out to DRAM instead of sitting in caches, etc. As I mentioned, you really want depth-first traversal...

The ability that I described lets you implement this pretty close to as efficiently (i.e. hierarchical rasterizer) as possible in compute. You just need a stack and work stealing really, both of which GPUs hardware can do decently (see OptiX and other work)... it's trivial to implement in something like Cilk or TBB for instance. In fact if you download the ISPC compiler package, I wrote a CPU implementation of both the culling and shading, which is actually not as much slower than a modern GPU (iso-power) as you might thing :) With culling only, it's actually faster.
 
The biggest issue with this approach is that all your light lists go out to DRAM instead of sitting in caches, etc.
I don't see why that's an issue. For each light in each tile, you'd need a few orders of magnitude more bandwidth to render it deferred w/o tiling: 2 byte light index + linked list overhead vs reading a g-buffer for a few hundred pixels. I make this comparison because according to your paper deferred w/o tiling is 5-7x more expensive per light than tiled. Bandwidth really should be negligible, unless you're implying that latency from linked lists the issue. That would be weird given the AMD Mecha demo.

The way I see it, this method should be almost 100x faster than quad-based non-tiled deferred culling (due to several hundred times smaller render target), subject to triangle setup overhead from the lights (tiny fraction of Z-pass setup). Your tile frustum tests are taken care of in 2D by the rasterizer, and the shader does the Z test along with the LL.

I don't see why you need a full Z mip tree. Just sample all the pixels in a tile and output the min/max.
 
Bandwidth really should be negligible, unless you're implying that latency from linked lists the issue. That would be weird given the AMD Mecha demo.
First, the linked list rendering thing is actually quite expensive. Particularly reading the lists gives you really incoherent gathers to global memory.

But we're talking cross purposes... you're comparing tiled deferred/forward in generate to conventional deferred; yes of course I agree the tiled ones will be massively faster. I'm comparing rasterizer-based light culling to recursive, work-stealing compute-based depth first traversal which is pretty much provably as cache-coherent as it gets, and also approaching the minimum number of tile/light tests.

The mip tree is needed for hierarchical rasterization/tiling. It's basically Hi-Z implemented in software, so of course if you're using the hardware rasterizer it will do it for you. (And obviously if you're only doing static sized tiles and no recursion you only need one level - but that's inefficient in compute for the reasons that I mentioned.)
 
First, the linked list rendering thing is actually quite expensive. Particularly reading the lists gives you really incoherent gathers to global memory.
I just figured that if AMD can get realtime with a fullscreen per-pixel linked list in the Mecha demo, then a ~100x50 one holding only indices should be a piece of cake.

But we're talking cross purposes... you're comparing tiled deferred/forward in generate to conventional deferred; yes of course I agree the tiled ones will be massively faster. I'm comparing rasterizer-based light culling to recursive, work-stealing compute-based depth first traversal which is pretty much provably as cache-coherent as it gets, and also approaching the minimum number of tile/light tests.
I made that comparison just to fit with your data. Look at the few tiles each light will have to be inserted into with the tile-deferred light-culling process vs. all the pixels you have to do the full lighting for that same light in conventional deferred. If the latter is 20 μs/light (as in your paper), then the former should be maybe 0.2 μs/light.

Even with your CPU implementation, what are the advantages of hierarchical rasterization? You immediately know from the extents and position of the light which tiles it will cover, so all you need is the Z test. You could have 100 lights per tile and still fit in the CPU cache. What am I missing?
 
Even with your CPU implementation, what are the advantages of hierarchical rasterization? You immediately know from the extents and position of the light which tiles it will cover, so all you need is the Z test. You could have 100 lights per tile and still fit in the CPU cache. What am I missing?
You can still reject a light entirely at a higher level of the tree then never have to test all of the lower levels - that's pretty important even just for Z tests, since so many lights are occluded in an average scene.

Now the efficiency of a specific way of culling will always be dependent on the configuration of the lights (many small vs. few large, etc.) in the same way as different ways of rasterizing are efficient for different sizes of triangles. But still, in a general purposes implementation, I don't think you can really get away from wanting to cull lights of big swathes of the screen with on a small number of tests.
 
Well, for that case, nothing is going to beat using the rasterizer for practical resolutions.

But what about my other assertion? Don't you think you can bin a light into, say, 10 tiles at least two orders of magnitude faster than you can light the 4000 pixels covered by those tiles?
 
Well, for that case, nothing is going to beat using the rasterizer for practical resolutions.
Not clear to me - when you need to "bin" with atomics/UAVs and there's no texturing or shading, the hardware loses a lot of its advantages. You don't really want to store a light index 5000 times for a light that covers 5000 pixels. You want to store the quad-tree. Even better, the quad-tree can just implicitly be on the stack during a depth-first traversal and never stored off-chip at all :)

Of course you always have to "spill" something, so spill whatever is the smallest footprint in your application. That's why this is a bit of a meaningless discussion at this point; it's the same algorithm just implemented in software or via the graphics pipeline. Use whichever is fastest in your game :) (And I have the same advice for tiled forward vs tiled deferred, etc. Just test it; they are basically the same algorithm.)

But what about my other assertion? Don't you think you can bin a light into, say, 10 tiles at least two orders of magnitude faster than you can light the 4000 pixels covered by those tiles?
Uhh yes, definitely. But that's the assertion behind all tiled rendering, and applies to all of these techniques. So yes, as I mentioned earlier all of these are massively better than conventional deferred shading. You do want smaller tiles than you imply though. 8x8 or 16x16 is pretty good for most scenes, but as you can imagine has some culling overhead if you just do the brute force thing (in software) and don't add at least one or two levels to the hierarchy.
 
Last edited by a moderator:
You don't really want to store a light index 5000 times for a light that covers 5000 pixels.
Not sure when that would happen. You mean if I had a tile size of 1x1? Or when <1% of the pixels in a 16x16 tile gets touched by a light?

Uhh yes, definitely. But that's the assertion behind all tiled rendering, and applies to all of these techniques.
Okay. So in summary, if
A) deferred light cost is 20μs/light,
B) GPU light culling is two orders of magnitude faster (~0.2 μs/light), and
C) tiled shading cost is 4μs/light,
then why is B not good enough for C? It would take some very large scene-related deviation to become a notable performance limitation. I also don't see why it would be slower than the CPU numbers you showed, or why there would be any significant 0-light setup cost.

FYI, I used 20x20 tile size in my discussion because that's what it looked like your paper used in the screenshots, which I assumed were taken at 1080p.
 
16x16 tiles are too big for various reasons. If your lights have shadows and you are doing MSAA, you would want to go all the way down to 4x4 (like Black Rock does in their "Screen Space Classification" tiled deferred system). With a fine grained system, you can completely skip lights based on conservative (once per tile) shadow map test, and you can completely accept lights that are fully "not in shadow". This way you only need to do per pixel shadowmap sampling for tiles that are on lights' shadow boundaries. It saves huge amount of BW and shading cost (fully shadowed blocks need no light calculation and fully lit blocks need no EVSM filtering). MSAA can be handled in a similar way (only tiles that contain edges need per sample processing).

If 4x4 tiles are used, 16 threads per block is a natural choice, as that would lead to one thread per pixel at the final step of the processing. However to combat branching granularity, you would like to have at least 32 threads (single warp) processing one tile. And you would want to have even more threads to combat fetch latency. So having 16x16 macroblocks (256 threads per block) might actually be the way to go. All threads would first cull the lights to the sub blocks (8 blocks, 8x4 pixels each to ensure good braching performance). Each sub block would then do conservative light tests (each thread could process a separate light), and finally pixels could be processed (one thread per pixel). This i however just a quick first draft (with no real life performance benching to back it up). I haven't had much time to do CUDA/CS code yet, as our last game (Trials Evolution) just got out less than 2 weeks ago. In that game we do light tile culling on CPU (it's very fast, but GPU->CPU latency basically permits usage of current frame z-buffer, and this complicates things a lot).

Back to the topic: The 16x16 macrotiles would either read the per tile light list from memory, or they could cull all lights on their own. Culling would consume more BW, because it needs to go though all lights, not just the small per tile list. How much the tile light list generation consumes BW is of course a topic that requires more research (there's so many approaches that could be used). Larger macrotiles would of course reduce BW usage (and allow octree style recursive culling), but the thread block shared memory size would become a limiting factor pretty quickly. The more threads you have accessing the same shared memory, the more conflicts you have (and the less parallelism can be expoited).
 
...
The way I see it, this method should be almost 100x faster than quad-based non-tiled deferred culling (due to several hundred times smaller render target), subject to triangle setup overhead from the lights (tiny fraction of Z-pass setup). Your tile frustum tests are taken care of in 2D by the rasterizer, and the shader does the Z test along with the LL.
...

Well, since a GPU is built for lots of parallelism and you are rasterizing to a tiny render target, there would probably be lots of idle resources and contention for the few resources that are used. So, therefore I'd assume it is a fairly poor idea. I think this might be what Andrew is getting at with the 5000 pixel/light thingo.

For my paper (Tiled Shading, Section 6.1) I did a quick test of this, by simply doing a deferred render to a small render target, which should be a reasonable estimator (simper in most respects). The performance for drawing the light bounds was pretty poor. The test rendered to a 60x34 pixel render target, which corresponds to a light grid with tile size 32x32 at 1080p.

Also, to get a correct result conservative rasterization is needed, further complicating the implementation (and adding overhead).

So I think leveraging rasterization hardware for this job is a pretty tough sell. The only real benefit is that you can easily handle any shape of light in the same manner. Which is nice, but at the price of much worse performance for the common cases.

Cheers
.ola
 
I don't think that's really worth it, as you could just use dynamic branching during the shading pass on the lights with shadows.

Theoretically you could do the whole rendering with branching and not even worry about tile lists, but if you have thousands of lights and <10 hit each pixel on average, it's too costly to skip them this way. However, the number of shadow maps can't be very large, and they're not going to have this same skip/work ratio. Most importantly, you can skip a lot of tiles with zero cost during light culling by rasterization (i.e you don't have to test in screenspace x or y), but you have to test each tile for shadowing. Finally, conservative Z tests are only feasible if you don't render the floor in the shadow map (which screws up VSM edge filtering) or if you have large planar surfaces (with the Z values stored as their distance to this normal). Using DB within a tile for fine-grained skipping of lighting is unlikely to significantly hurt perf over the use of smaller tiles.

Also, I don't think it's a good idea to focus on average performance like this, as you'll get some bad framerate dips when you can't cull shadows well due to the scene. At least with thousands of lights you can cap how many lights could affect each pixel through your art assets.

The real advantage of smaller tiles can be seen in Andy's presentation: On silhouette edges, all the lights between the min and max Z values get included, so if you're peering through a screen mesh or have dense foliage, it would murder light-culling efficiency (in a recurring theme of this thread, we see the scene dependency again). It would be interesting to see if using the tiles's Z-variance along with it's max, min, and average to identify most tiles with this problem and taking action could reduce the number of lights in them.
 
Last edited by a moderator:
Not sure when that would happen. You mean if I had a tile size of 1x1? Or when <1% of the pixels in a 16x16 tile gets touched by a light?
Sorry, I meant covers N *tiles* of course (at whatever granularity is the leaves of the conceptual quad-tree).

Okay. So in summary, if
A) deferred light cost is 20μs/light,
B) GPU light culling is two orders of magnitude faster (~0.2 μs/light), and
C) tiled shading cost is 4μs/light,
then why is B not good enough for C?
In some cases it is, but those numbers were just from my specific test scenario where a good percentage of the lights were visible in that scene (not a ton of occlusion and not much of the scene was off-screen). Furthermore in my example I am culling the simplest type of lights: point lights with a radius. Cones are more complex to test and increase the culling cost relative to the shading cost.

I don't think it's unreasonable in "real" scenes for some high percentage of lights to be offscreen or occluded, at which point the culling speed could definitely become the bottleneck. To put it another way, sure not ever scene may need to do anything more complicated than the simple static tiles setup, but I also see no reason why we shouldn't pursue a more efficient culling so that a game is *able* to throw massive numbers of lights at the algorithm with either the majority being culled or very simple shading.

That said, one of the uses I was pursuing when making the culling work more efficient was VPLs for dynamic GI. Unfortunately I eventually concluded that even with tens of thousands of VPLs, while you could get a good and reasonably efficiently-rendered static solution, as soon as lights/geometry moves, VPLs have such poor temporal coherence that you need literally millions of them, even with clustering, to get a stable solution.

So without that workload driving culling needs, I do imagine that for a lot of games the naive solution will be just fine. That said, I still think it's interesting to pursue the more efficient one, as light culling/software hierarchical rasterization is obviously not the only algorithm that benefits from work stealing and recursion :) In fact, I'd go as far to propose that this is the most significant hurdle in writing efficient code in the GPU computing models today. It can be run efficiently on the hardware, just not efficiently expressed in the programming models.

I don't think that's really worth it, as you could just use dynamic branching during the shading pass on the lights with shadows.
Sure, and you do. In fact, the first thing I do in my demo is compute the light attenuation function and branch out the rest of the BRDF if it's zero, so in practice the culling is only avoiding the evaluation of that function, and yet it's still a win due to being over a reasonable large area. But as mentioned, there's simultaneous desire for more efficient evaluation of that function (i.e. over larger areas) and smaller tiles. That to me leads to a tree evaluation, whether it be static with two or three levels as sebbbi describes or my dynamic with work stealing.

I haven't really done proper testing of tile sizes, but I can say for certain that 32x32 is too big, and 16x16 is pushing it. The only reason that people tend to settle on 16x16 these days is because that fits work group sizing requirements for modern GPUs pretty well (i.e. ~256 items can still remain pretty efficient even if you use the maximum amount of shared memory).

Also, I don't think it's a good idea to focus on average performance like this, as you'll get some bad framerate dips when you can't cull shadows well due to the scene.
I agree; worst case performance is the most relevant, which is actually one of the reasons why I'm slightly less enthused with so-called "Forward+" then the very-similar deferred variant. In my experience running complex shaders at the end of the raster pipeline results in a whole lot more variability (due to triangle sizes, occlusion, scheduling, etc) than doing it in image space. The latter tends to be more predictable and have sometimes-significantly better worst cases.

It would be interesting to see if using the tiles's Z-variance along with it's max, min, and average to identify most tiles with this problem and taking action could reduce the number of lights in them.
Yes I actually tested this once by improving the representation of the tile Z distribution to be bimodal. It does indeed vastly cut down on the number of lights on edges (so much so that it doesn't seem worth doing anything more fancy than that) but in the cases that I tested it wasn't a win overall since it requires touching the Z data twice. It was around the same speed in the end. For a scene with lots of foliage/high-frequency geometry creating many more edges, it would probably be worth it though.

For the record, another thing I played with is something that I believe BF3 actually uses, at least on PS3: namely culling NdotL over the tile as well. There are several ways to do this (I think BF3 stores and average normal and cone radius or similar - you can also do it for any shader term generically with interval arithmetic), but again in my case the additional logic to compute the normal distribution over the tile didn't turn out to outweigh the cost of just computing NdotL per pixel and branching on it. The trade-off will of course vary depending on the scene, tile sizes, complexity of the culling function, etc.
 
Last edited by a moderator:
In some cases it is, but those numbers were just from my specific test scenario where a good percentage of the lights were visible in that scene (not a ton of occlusion and not much of the scene was off-screen).
I'm not sure if you understand my idea correctly. Lights are culled by rendering them into a tiny buffer. For 1080p and 16x16 tiles, your rendertarget is 120x68. For a light that covers 10k pixels, you'll render a <100 pixel poly into this target (it's slightly enlarged to make sure it hits a pixel center when the actual light hits a tile frustum). If it is offscreen, it will be rejected as fast as a GPU can reject 1-2 polygons, and I doubt the CPU can match that (if it can, then do a frustum cull on the CPU first). If it's occluded, the GPU can reject 256 pix (i.e. tiles) per clock or something absurd like that for either Z-min or Z-max.

Rasterization is just so much faster and the tile count is so small that I can't envision anything put pathological cases working out better on a smart CPU algorithm, and those are unlikely to be realtime anyway.

That said, one of the uses I was pursuing when making the culling work more efficient was VPLs for dynamic GI. Unfortunately I eventually concluded that even with tens of thousands of VPLs, while you could get a good and reasonably efficiently-rendered static solution, as soon as lights/geometry moves, VPLs have such poor temporal coherence that you need literally millions of them, even with clustering, to get a stable solution.
Actually, the reason this topic piqued my interest is also for dynamic GI, except I have a different solution than VPLs, and something I've been wanting to do for almost a decade. At 720p and 30fps, we can now do up to 100k flops per pixel. Surely we can do better than the hack that is SSAO. I'll keep you posted...

Yes I actually tested this once by improving the representation of the tile Z distribution to be bimodal. It does indeed vastly cut down on the number of lights on edges (so much so that it doesn't seem worth doing anything more fancy than that) but in the cases that I tested it wasn't a win overall since it requires touching the Z data twice. It was around the same speed in the end. For a scene with lots of foliage/high-frequency geometry creating many more edges, it would probably be worth it though.
Ah, neat. You need the min/max Z data anyway, so why not just do a more thorough Z test (i.e. subtile frequency) for lights hitting tiles with a large difference between the values? You have to eventually do the per-pixel test anyway, so at worst you are doubling only the Z-test in the case that the lights truly do touch pixels.
 
That said, one of the uses I was pursuing when making the culling work more efficient was VPLs for dynamic GI. Unfortunately I eventually concluded that even with tens of thousands of VPLs, while you could get a good and reasonably efficiently-rendered static solution, as soon as lights/geometry moves, VPLs have such poor temporal coherence that you need literally millions of them, even with clustering, to get a stable solution.
Where was the problem with this? The rapid rebuild of spatial hierarchy of VPLs? The zmin/zmax computation of tiles?
 
Rasterization is just so much faster and the tile count is so small that I can't envision anything put pathological cases working out better on a smart CPU algorithm, and those are unlikely to be realtime anyway.
Yes I do get the concept... it's just binning on the GPU using UAVs and atomics. It's the list append (atomic) in global memory and the fact that the light data goes to DRAM and back that I'm not convinced ends up faster, but as I said it's the same algorithm really so might as well try both for a given application :)

Ah, neat. You need the min/max Z data anyway, so why not just do a more thorough Z test (i.e. subtile frequency) for lights hitting tiles with a large difference between the values? You have to eventually do the per-pixel test anyway, so at worst you are doubling only the Z-test in the case that the lights truly do touch pixels.
This is precisely the hierarchical thing I was saying is worthwhile ;) It's sort of annoying to express in compute shaders though, as it really is a tree/recursion. You can statically unroll a level or two into the shader if need be, but you don't gain anything in load balancing. There's really no reason that a GPU can't accommodate the elegant solution both in the programming model and the scheduler.

Where was the problem with this? The rapid rebuild of spatial hierarchy of VPLs? The zmin/zmax computation of tiles?
Nope, temporal coherence was entirely the problem. VPLs are "too" discrete a representation and particularly once you want some visibility information for secondary bounces (but even before that too), having even a few dozen VPLs affecting a pixel is just not enough. Minor movement of the light changes where the VPLs get dumped which has a drastic "flickering"/aliasing effect over a large area on the underlying geometry that is being lit. You need to be approaching error that is less than 1/256 of the final frame or similar before it's acceptable. The light cuts paper found similar results; unfortunately that means that even with lots of clustering cleverness applied, you end up needing hundreds of thousands or even millions of lights to produce a temporally stable result, at which point is is simply not an efficient way to compute indirect illumination.
 
This is precisely the hierarchical thing I was saying is worthwhile ;)
Yup, I realized it right as I typed it. It's almost as if you layed a trap for me ;) But since you're already talking about CPU culling, I figured that I might as well mention it. I wonder if an append/consume buffer for all these light-tiles would be fast enough on the GPU culler to be worth it. All these tiles would have the same load, so should be balanced.

at which point is is simply not an efficient way to compute indirect illumination.
As cool as all the path tracing demos are, I was somewhat disappointed by them because it seems like brute force GI is just a generation or two away and there's no need for all this cleverness.
 
As cool as all the path tracing demos are, I was somewhat disappointed by them because it seems like brute force GI is just a generation or two away and there's no need for all this cleverness.
It's a definite possibility, but screen space noise is not necessarily as desirable as some other artifacts, so it may not be a sure thing. Voxels are interesting although they have some issues of their own. Still, constant-time cone tracing is pretty powerful, even if approximate.
 
Back
Top