Not sure when that would happen. You mean if I had a tile size of 1x1? Or when <1% of the pixels in a 16x16 tile gets touched by a light?
Sorry, I meant covers N *tiles* of course (at whatever granularity is the leaves of the conceptual quad-tree).
Okay. So in summary, if
A) deferred light cost is 20μs/light,
B) GPU light culling is two orders of magnitude faster (~0.2 μs/light), and
C) tiled shading cost is 4μs/light,
then why is B not good enough for C?
In some cases it is, but those numbers were just from my specific test scenario where a good percentage of the lights were visible in that scene (not a ton of occlusion and not much of the scene was off-screen). Furthermore in my example I am culling the simplest type of lights: point lights with a radius. Cones are more complex to test and increase the culling cost relative to the shading cost.
I don't think it's unreasonable in "real" scenes for some high percentage of lights to be offscreen or occluded, at which point the culling speed could definitely become the bottleneck. To put it another way, sure not ever scene may need to do anything more complicated than the simple static tiles setup, but I also see no reason why we shouldn't pursue a more efficient culling so that a game is *able* to throw massive numbers of lights at the algorithm with either the majority being culled or very simple shading.
That said, one of the uses I was pursuing when making the culling work more efficient was VPLs for dynamic GI. Unfortunately I eventually concluded that even with tens of thousands of VPLs, while you could get a good and reasonably efficiently-rendered static solution, as soon as lights/geometry moves, VPLs have such poor temporal coherence that you need literally millions of them, even with clustering, to get a stable solution.
So without that workload driving culling needs, I do imagine that for a lot of games the naive solution will be just fine. That said, I still think it's interesting to pursue the more efficient one, as light culling/software hierarchical rasterization is obviously not the only algorithm that benefits from work stealing and recursion
In fact, I'd go as far to propose that this is the most significant hurdle in writing efficient code in the GPU computing models today. It can be run efficiently on the hardware, just not efficiently expressed in the programming models.
I don't think that's really worth it, as you could just use dynamic branching during the shading pass on the lights with shadows.
Sure, and you do. In fact, the first thing I do in my demo is compute the light attenuation function and branch out the rest of the BRDF if it's zero, so in practice the culling is only avoiding the evaluation of that function, and yet it's still a win due to being over a reasonable large area. But as mentioned, there's simultaneous desire for more efficient evaluation of that function (i.e. over larger areas) and smaller tiles. That to me leads to a tree evaluation, whether it be static with two or three levels as sebbbi describes or my dynamic with work stealing.
I haven't really done proper testing of tile sizes, but I can say for certain that 32x32 is too big, and 16x16 is pushing it. The only reason that people tend to settle on 16x16 these days is because that fits work group sizing requirements for modern GPUs pretty well (i.e. ~256 items can still remain pretty efficient even if you use the maximum amount of shared memory).
Also, I don't think it's a good idea to focus on average performance like this, as you'll get some bad framerate dips when you can't cull shadows well due to the scene.
I agree; worst case performance is the most relevant, which is actually one of the reasons why I'm slightly less enthused with so-called "Forward+" then the very-similar deferred variant. In my experience running complex shaders at the end of the raster pipeline results in a whole lot more variability (due to triangle sizes, occlusion, scheduling, etc) than doing it in image space. The latter tends to be more predictable and have sometimes-significantly better worst cases.
It would be interesting to see if using the tiles's Z-variance along with it's max, min, and average to identify most tiles with this problem and taking action could reduce the number of lights in them.
Yes I actually tested this once by improving the representation of the tile Z distribution to be bimodal. It does indeed vastly cut down on the number of lights on edges (so much so that it doesn't seem worth doing anything more fancy than that) but in the cases that I tested it wasn't a win overall since it requires touching the Z data twice. It was around the same speed in the end. For a scene with lots of foliage/high-frequency geometry creating many more edges, it would probably be worth it though.
For the record, another thing I played with is something that I believe BF3 actually uses, at least on PS3: namely culling NdotL over the tile as well. There are several ways to do this (I think BF3 stores and average normal and cone radius or similar - you can also do it for any shader term generically with interval arithmetic), but again in my case the additional logic to compute the normal distribution over the tile didn't turn out to outweigh the cost of just computing NdotL per pixel and branching on it. The trade-off will of course vary depending on the scene, tile sizes, complexity of the culling function, etc.