Rendering with a feedback loop

Does anyone know of any projects in the research space or commercial space of a 3d renderer (hardware and api or a software) with an occlusion feedback loop? To clarify what i mean consider the following:

1. You can render your scenes front to back, either perfect or can signal when you are not or potentially not. (To pause feedback)
2. As you render (exactly how often is up in the air/per draw call or ...) or on demand the renderer tells all relevant parties upstream whether or not the whole screen has been drawn to and/or return a bitmap of fully occluded screenspace tiles. This way you can stop feeding the renderer.

I am mainly interested in the reduction in triangles sent to be rendered but any info on any similar project would be welcome seeing as google has failed me.

Thanks in advance.
 
GPU and CPU run asynchronously. GPU is often at least half a frame late from the CPU. You'd need to read a GPU generated buffer on CPU during the same frame (multiple times). This means that you need to stall both the CPU and the GPU during the frame (multiple times). Feedback loop like this would not give you any performance gains. Alternative way is to use the last frame's data to cull objects in the next frame. There are games that do this (for example all our previous console games). However this approach has well known problems: Visibility of the previous frame is not the same as visibility of the next frame (otherwise no new objects could ever get visible when the camera moves).

In the naive implementation, each object gets visible one frame late, and stops being visible one frame late. The latter is not a problem (one frame extra visibility doesn't cost much), but objects becoming visible one frame late is a major problem. If nothing is done to prevent this issue, sudden visibility changes (such as peeking across a corner) cause lots of visible popping. The worst case being that big level structures (such as building walls or terrain patches) disappear briefly, letting the player to see through the level (killing the illusion). One way to combat this is to enlarge object bounding spheres/boxes based on some conservative estimate. This way the object bounding area will (most likely) become visible sooner than the object, and we hide the popping. This works quite nicely for smoothly moving cameras and objects, but if something happens suddenly near the camera (wall explodes, door opens quickly, etc) the whole background pops shortly and the illusion of "being there" disappears. Because of these limitations, this technique works quite well for locked 60 fps games (minimum fps = 60). However even at 60 fps you'll sometimes see popping issues even with highly conservative bounding areas. The bigger the conservative bounding areas, the more you waste performance rendering things that are not visible.

If you want an intra-frame feedback loop, I suggest that you move all your scene data structures inside the GPU memory and do your viewport culling completely in the GPU (using compute shaders). This way you can immediately read the GPU rendered depth/occlusion buffers and perform as many iterative culling+rendering steps as you need. However, you need to be careful about how many passes you render. Reading the depth buffer in a compute shader forces the GPU to wait until the draw calls are all finished and the results being written to memory (possibly also flushing the ROP cache on some GPUs). If you do this too many times per frame, you will end up being slower than just (brute force) rendering everything.

Unfortunately I can't yet discuss how we solve the occlusion culling problem in our GPU-driven renderer, but I can forward you to NVIDIA research: http://on-demand.gputechconf.com/gt...32-Advanced-Scenegraph-Rendering-Pipeline.pdf
 
Some applications doesn't have hard interactive or realtime requirements, but still desires high performance. So if latency is not an issue, the inter-frame feedback cpu-gpu issue could be avoided by having several frames in the pipeline.
I guess the "has everything been covered yet" question could be answered efficiently by using the HiZ buffer, but generally that's not accesible. Dunno how it could be answered otherwise without letting the gpu gather the whole buffer, or doing likely too expensive bookkeeping in a lowres uav from the pixel shaders.
 
Last edited by a moderator:
GPU and CPU run asynchronously. GPU is often at least half a frame late from the CPU. You'd need to read a GPU generated buffer on CPU during the same frame (multiple times). This means that you need to stall both the CPU and the GPU during the frame (multiple times). Feedback loop like this would not give you any performance gains.

I guess the "has everything been covered yet" question could be answered efficiently by using the HiZ buffer, but generally that's not accesible. Dunno how it could be answered otherwise without letting the gpu gather the whole buffer, or doing likely too expensive bookkeeping in a lowres uav from the pixel shaders.

I was thinking more along the lines of custom (as compared to current and past mainstream) hardware and api. When I said "tells all relevant parties upstream" I was implying that there was hardware on the GPU and more advanced command processors/buffers to deal with the actual fully occulded condition or sub-condition. The only thing CPU side would be an interrupt which would indentify which push thread or "virtual channel source" (if you prefer) can be killed.
 
Here is a newer version of the NVIDIA presentation held by myself http://on-demand.gputechconf.com/si.../SG4117-OpenGL-Scene-Rendering-Techniques.pdf
As hinted in the slides, this is still a topic of active research for us and we try to make further improvements to overcome deficits of readback/current indirect culling.

@Psycho the algorithm presented here is bascically the reverse from "was everything occluded" to "was one pixel visible", for which you can leverage the rasterization pipeline efficently. I've also experimented with HiZ but it wasn't as good (too many false positives) as the rasterization technique presented.

as sebbbi said you'd want to avoid intra-frame cpu/gpu communcation.
 
Here is a newer version of the NVIDIA presentation held by myself http://on-demand.gputechconf.com/si.../SG4117-OpenGL-Scene-Rendering-Techniques.pdf

Some interesting stuff there - always good to reiterate on old methods when new capabilities becomes available :)

But, if the question is whether the whole screen has been filled yet - is it always faster to rasterize the far plane, with writes disabled (I forgot you can disable everything and still get the query result), lots of early-z rejection etc, than to use the TMUs to minify the screen to a few pixels first? (if in a case where I can do the pixel-written detection from the color buffer it should only be limited by bandwidth)
You can of course do the final readout with occlusion queries in both cases if you want.
 
So I take it nobody has tried a hardware implementation as such... sigh oh well. Psycho if you plan on implementing this in a non-realtime project could you please share your benchmark results with me? Thanks.
 
So I take it nobody has tried a hardware implementation as such... sigh oh well. Psycho if you plan on implementing this in a non-realtime project could you please share your benchmark results with me? Thanks.
I don't believe we need dedicated hardware for this. You can already create software solitions that are very fast on modern hardware. It would be nice if the PC graphics APIs exposed GPU hierarchical depth buffers, as that would speed up the software culling implementation. NVIDIA and AMD do not have any OpenGL extensions for this yet. AMD exposes their HTILE buffers in Mantle (source: DICE Mantle presentation).
 
I don't believe we need dedicated hardware for this.

You can already create software solitions that are very fast on modern hardware.
I disagree about not needing dedicated hardware for this technique. Even if you can make a fast implementation you'd just be guessing as to when to check on demand and not necessarily cull a bunch of primitives. If it is done automatically on a per primitive basis you're likely to save a bunch more of primitives per render-target. If you take the rest of my idea and add entities to handle the occluded state on behalf of the cpu you can do some advanced portal techniques. Also with hardware if you add some extra gates (on top of the and/or gates) and you can get "non-aligned" results which might come in handy. I think this in combination with conditional rendering techniques will save on your primitive/vertex budget.
 
Last edited by a moderator:
Back
Top