testing spot light's cone with objects AABB

shuipi

Newcomer
When doing lighting we usually want to on the cpu side do a rough culling between lights and objects, in order to save shader cost. Spot light is a common type of light used in many games, to determine if an object's AABB overlaps the spot light's cone however doesn't seem to be trivial. What's the common practices used in games?
 
Cone vs plane tests are actually pretty cheap IIRC. Google (or check the above link) for the math but I'm pretty sure they aren't bad. And yeah you could always stick a frustum around the cone since AABB/frustum tests are definitely cheap.
 
note cone-plane intersection may not work because the box could completely enclose the cone without any instercetion yet still shouldn't be rejected
 
note cone-plane intersection may not work because the box could completely enclose the cone without any instercetion yet still shouldn't be rejected
That is easily solved by having your intersection function return "inside", "outside" or "intersects" instead of a simple boolean. If the cone is "inside" all 6 planes, it's enclosed in the box.
 
Sure although if you go tiled deferred rendering (read: the best kind :)), you end up doing the frustum/tile->light volume intersection tests in code/math anyways.

mmh, valid point indeed.
No idea in which domain I'd try solve that... (2D/3D).
I guess I'd approximate in 2D first and see how well it works. (some analysis)
 
Sure although if you go tiled deferred rendering (read: the best kind :)), you end up doing the frustum/tile->light volume intersection tests in code/math anyways.
Unless, you rasterize the light bounding geometry to your tile buffer. If you mark each light with one bit (1, 2, 4, 8, 16, etc), you can additive blend up to 64 lights to a 16-16-16-16 int back buffer (one pixel per tile) in a single pass. Read the texture, and memexport/streamout light counts and light indices to the tile constant buffers (to read the light properties, such as the light color and light shadow map matrix*). This is really fast if you have for example 32x32 tiles. When rendering the light bounding geometry, setup the max-z values as the z-buffer values of the 32x32 smaller z-buffer (use basic z-greater culling). If you also want to have min-z culling, one extra draw call that marks the stencil is needed (and the second draw call reads and clears it). Two draw calls per light is not a problem, since the shadowmap rendering most likely required hundreds of much more expensive draw calls.

*) Shadow maps are all stored in one big atlas, and the shadow map matrix includes the viewport add/mul to access the correct data (no extra instructions needed).
 
Unless, you rasterize the light bounding geometry to your tile buffer. If you mark each light with one bit (1, 2, 4, 8, 16, etc), you can additive blend up to 64 lights to a 16-16-16-16 int back buffer (one pixel per tile) in a single pass.
Sure. When I tested any methods that go through memory though (even 1 bit per light) they were slower on DX11 parts vs a single-pass compute-shader. To get equivalent cleverness with the rasterizer you need at the very least depth bounds test (two-pass stencil is way too slow) which is not generally supported, and even when it is requires one draw / light volume :( Furthermore, the frustum math is actually really cheap... math is basically free on GPUs nowadays anyways :p

Of course you're probably stuck with rasterizer-based methods on pre-DX11 hardware.
 
Sure. When I tested any methods that go through memory though (even 1 bit per light) they were slower on DX11 parts vs a single-pass compute-shader. To get equivalent cleverness with the rasterizer you need at the very least depth bounds test (two-pass stencil is way too slow) which is not generally supported, and even when it is requires one draw / light volume :( Furthermore, the frustum math is actually really cheap... math is basically free on GPUs nowadays anyways :p

Of course you're probably stuck with rasterizer-based methods on pre-DX11 hardware.
Luckily draw calls are much faster on consoles (lightweight API closer to metal, UMA, etc). And you can use two sided stencil to reduce draw calls (back side marks stencil for culling and front side reads stencil and draws the light). Just be sure to render the bounding sphere/cone so that the back sides are always rendered first (the mesh needs to be prepared for this order). You will lose hi-stencil this way, as the GPU will not be able to refresh it fast enough, but as you are rendering very simple untextured pixels to 40x23 texture, the fill rate is not really a concern. The light tile culling step is really really fast compared to the light rendering. The lighting step itself should be as fast with DX9 class hardware as with compute shaders (as all the lights properties are neatly stored in a constant buffer - there is no need to read this data from textures). I off course assume here that the tile size is reasonable large (32x32). So for each tile pixel setup, there is 1024 pixels to light (includes off course shadowmap reads with EVSM filtering).

I looked though your compute shader implementation and it looks really interesting. I haven't myself done much compute shaders (basically the only thing I have coded with compute shaders was a experimental radix sorter for one of my university classes).
 
When doing lighting we usually want to on the cpu side do a rough culling between lights and objects, in order to save shader cost. Spot light is a common type of light used in many games, to determine if an object's AABB overlaps the spot light's cone however doesn't seem to be trivial. What's the common practices used in games?
We have bounding spheres around our objects. Sphere-cone culling is really fast. Just dot product the sphere center with the four cone planes, and if the result is less than the object bounding sphere radius, it's inside the cone. Near and far planes can be added with a single extra dot product. Vectorize the test for best performance. It's even better to have some kind of acceleration structure, such as octree if you have plenty of objects in your game world. This kind of sphere-cone test also notices the the sphere intersects the cone or is fully inside it (so you can skip rest of the inside checks for the octree childs that are fully inside or outside of the light cone).
 
Luckily draw calls are much faster on consoles (lightweight API closer to metal, UMA, etc).
Yes true, definite advantage.

Just be sure to render the bounding sphere/cone so that the back sides are always rendered first (the mesh needs to be prepared for this order).
Clever, I like it :) So you do this and render to a series of RTs with some bit magic blend state? I guess on 360 this is all in EDRAM so it's pretty cheap? My biggest concern here though is serialization from score-boarding on such a small texture. Obviously this serialization happens for the atomics in the compute shader as well but it's local memory vs global so atomics are cheaper there (at least on ATI).

The lighting step itself should be as fast with DX9 class hardware as with compute shaders (as all the lights properties are neatly stored in a constant buffer - there is no need to read this data from textures). I off course assume here that the tile size is reasonable large (32x32). So for each tile pixel setup, there is 1024 pixels to light (includes off course shadowmap reads with EVSM filtering).
Yes I'd expect it to be similar speed for non-MSAA, although the cleverness with re-scheduling MSAA pixels was a huge win in the compute shader version so I'd imagine that to beat the DX9 equivalent. 16x16 tiles seems to be a sweet-spot on high-end PC hardware with the compute shader implementation but obviously it will depend on the relative size of the light lists and expense of the culling step.

I looked though your compute shader implementation and it looks really interesting. I haven't myself done much compute shaders (basically the only thing I have coded with compute shaders was a experimental radix sorter for one of my university classes).
Yeah it's a pretty handy tool for graphics and really nice that it's right in the API (and interacts with the same scheduler) vs. any of the interop solutions like CUDA and OpenCL.
 
Back
Top