GPU-driven rendering (SIGGRAPH 2015 follow up)

sebbbi

Veteran
We had a presentation at SIGGRAPH 2015 "Advances in Real-Time Rendering in Games" about GPU-driven rendering pipelines.
My part of the presentation was highly compressed to fit into my 25 minute time slot, so I had to cut some slides. Most importantly I had no room to discuss about the shortcomings of the technology, and how we avoided them.

I got lots of questions after the presentation, lots of people came to discuss to me during the SIGGRAPH conference, and I got some e-mails after the conference. Most of the questions I got were very well though out, and discussed real problems with techniques like this. The reason for this follow up post is to clarify some things, and to answer forthcoming questions.

Advances in Real-Time Rendering in Games presentations:
http://advances.realtimerendering.com/s2015/index.html

Direct link to our presentation (my slides page 39->):
http://advances.realtimerendering.c...iggraph2015_combined_final_footer_220dpi.pptx

Since I don't have my own blog, I decided to post the follow-up here at beyond3d.com. Feel free to ask me questions if something is still unclear. I will only be answering questions regarding to the GPU-driven rendering technology (including deferred texturing and virtual texturing). Games related questions are out of the scope of this thread.

I also (finally) created a Twitter account for me (@SebAaltonen). If you don't want to post your questions in this forum thread, you can send me a PM or use Twitter.
 
Last edited:
Q: What is the difference between the 64 bit and 128 bit G-buffer layouts?
64 bit layout is the minimal layout. It provides the bare minimum for most real time applications (UV and tangent frame). UV allows you to fetch all the material textures (and calculate screen space gradients for virtual texturing page ID pass and anisotropic filtering). Tangent allows you to perform (anisotropic) lighting and parallax mapping. UV also implicitly gives you virtual texture page ID (UV divided by page size = VT page id). This gives you the mip level (integer part) for free and allows you to fetch per material properties from the virtual texture, such as texture renormalization mul+add, colorize color, lighting model id, etc.

128 bit layout gives you some extra space to store data that can be linearly interpolated (assuming MSAA trick is used). The most common thing to include would be per pixel motion vectors (for motion blur and temporal reprojection). Per triangle constant data (such as instance id) can also be stored (as the MSAA trick reconstructs all per triangle constant data perfectly).

MSAA-trick reduces the color bandwidth on average by 2.7x. 128 bit layout roughly matches traditional G-buffer of 48 bits/pixel (+depth) in bandwidth usage. As everybody seems to be using temporal reprojection, having a cheap way to store per pixel motion vectors is an additional advantage of the MSAA-trick technique.


Q: How can everything be rendered with a single pixel shader?
The pixel shader receives the interpolated UV coordinate and the interpolated tangent frame (quaternion) from the vertex shader (*). Tangent is encoded (10-10-10-2) and virtual texture indirection is performed for the UV. There are not that many different ways to perform these two operations. A single pixel shader is enough for this task.

Material/lighting permutations can be handled with the exactly the same methods used by other deferred rendering pipelines. With tiled deferred rendering, this either means tile classification or uber-shader. Clustered deferred shading (pixel binning) can be also used if high amount of material/lighting permutations are needed.

(*) To ensure that the VS->PS quaternion interpolation works properly, use the mesh preprocessing trick described by Peter Sikachev in GPU-Pro 5: http://gpupro.blogspot.fi/2014/01/our-gpu-pro-5-chapter-describes.html


Q: Can deferred texturing support complex materials (mask textures to blend multiple material textures together)?

Virtual texture page cache provides storage for texture space transformations. Operations such as the one described above can be executed in a separate (artist authored/configured) shader that writes to the virtual texture page cache. The advantage of this method is that the complex blending operation only needs to be executed when a page comes visible. On average pages are visible for more than 2 seconds. At 60 fps, this means that the cost of a material blend operation is amortized on average over 100 frames. Decal rendering works similarly and has similar gains.

The main limitation of this technique is that it requires unique UV mapping for the object. We have unique UV mapping in most (90%+) objects and for terrain, but having unique UV can be problematic for large structures with lots of repeated (tiled) texture surfaces.


Q: Does deferred texturing support light maps?
Light mapping needs two sets of UVs. Deferred texturing could be extended to support multiple UVs, but that would make it less efficient. I would not recommend using deferred texturing with light maps.

Modern light mapping techniques such as SG (spherical gaussian) look awesome. This technique gave Order 1886 a very nice look. Ready at Dawn guys gave a wonderful presentation about it at SIGGRAPH: https://readyatdawn.sharefile.com/share#/download/s9979ff4b57c4543b

However, light mapping is a static technique. Geometry and lights (*) cannot move. This is a perfect trade off for many games, since most environments are inherently static. However static lighting (and big offline generated data set per level) is not suitable for games that are based around dynamic content (lots of physics and/or structural destruction) and/or user generated content (levels loaded from a server).

Many next-gen AAA games have achieved convincing results without using light maps. Dynamic direct lights (with high quality soft shadow maps) + light probes + high quality screen space techniques (to mask out probe reflections -> much better object grounding) have proven to be good enough.

Tomasz Stachowiak's presentation in SIGGRAPH 2015 was a perfect example about the graphics fidelity achievable by combining high quality screen space techniques with light probes:
Presentation: http://advances.realtimerendering.com/s2015/Stochastic Screen-Space Reflections.pptx
Video: http://advances.realtimerendering.com/s2015/Stochastic Screen-Space Reflections.mp4

GPU-driven rendering can be used to speed up light probe rendering. You can render thousands of probes during the level load time and do partial refreshes if the level geometry changes during the game play. This makes it possible to combine good quality lighting with fully dynamic (& destructible) environments.

(*) Lights that are not baked to light maps can of course move. You don't get any GI from these lights (unless you combine light maps with probes). You could also bake the light transport, but that takes lots of extra space (meaning that lower light map resolution is needed). Dynamic geometry is not supported by light transport methods.


Q: Is per pixel discard (clip) possible with the MSAA trick?
Use ddx/ddy to calculate the four "virtual pixel center" UVs. Take four samples of the BC4 alpha mask texture (stored in virtual texture). Output SV_Coverage mask based on the results. This sounds expensive, but is actually faster than traditional discard at full resolution, as you have much smaller amount of pixel shader invocations.
 
Last edited:
Geometry clustering history and MDI
In my presenter notes there was a quick reference to Avalanche Studios merge-instancing technique (page 19->): http://www.humus.name/Articles/Persson_GraphicsGemsForGames.pptx .

Merge-instancing was a big inspiration for us 3 years ago. Our fixed size (64 vertex strip) clustering is similar to it. However merge-instancing emulates the index buffering inside the vertex shader. This unfortunately means that merge-instancing needs to always execute 3 vertex shader invocation per triangle. Our strip based method needs to only execute one vertex shader invocation per triangle in the best case (64 vertex cluster is a perfect strip). However in reality, you need to insert degenerate triangles, causing some extra vertex shader work. In moderately high poly meshes we can achieve around 1.5 transformed vertices per triangle. This is 2x better than merge-instancing, but still lower than the best cache optimized index buffered methods.

Multi-draw-indirect (and ExecuteIndirect in DX12) solve this issue, as each draw instance can have it's own index start offset (meaning that each instance can have unique topology). However MDI stresses the GPU command processor much more than the vertex shader based custom technique. Also GPUs cannot pack multiple actual draws to a single vertex shader wave/warp. We have measured that MDI starts to lose efficiency (on current GPUs) when the cluster size becomes smaller than 256.

Conclusion: MDI gets you higher performance for high polygon objects, while the custom vertex shader clustering is better(on current GPUs) for small low poly objects (such as background).


Q: Is coarse culling needed in GPU-driven rendering?
We have tested up to 2 million object scenes without any coarse culling. This should be enough for "level based" games. However if you a building a free roaming game in a huge open game world, you might want to perform coarse culling either by the CPU or the GPU. If your streaming system only keeps the objects around the camera in memory, you already have a solution (GPU culls only objects loaded to memory).

Important advice: Keep the object culling data small. You don't want to access the float3 position or the (quaternion) rotation of the object. Pack the object bounding sphere to a single 64 bit integer, for example using 18+18+18+10 bits for position.xyz + radius. This gives 6 centimeter precision for culling bounds position in a 256 square kilometer game world. Reducing the bandwidth cost is the highest priority, since the culling shader will be memory bandwidth bound. Culling itself is just a few dot products per visibility test (one per plane). Load all viewport planes to the LDS, and cull objects to all viewports at once. This way you only need to touch the full (2 million) object list once. We do sub-object (cluster) culling per viewport (in a separate compute shader).
 
Last edited:
Two-phase occlusion culling tidbits
We don't reproject the last frame occlusion culling depth pyramid (like AC: Unity does). This has not been a problem for us, because our engine is designed solely for 60 fps locked rendering (frame difference is halved compared to 30 fps). Also we use the last frame depth pyramid solely as an occlusion hint. Occlusion tests that would sample out of screen data are clamped to the screen border (using clamp filtering mode). This results in surprisingly good culling results, way better than culling/rendering everything at screen edges. At 60 fps the clamp areas at screen edges are small, making the clamped estimate is surprisingly accurate.

Last frame pyramid occlusion tests needs to be biased by the camera.view_vector.z delta. At 60 fps, this is a small bias. We don't use any other biases or conservative estimates (such as bloated object bounds). This makes the culling super precise compared to traditional methods.

Tip: If your GPU supports min/max filtering (DirectX 11.2 optional feature), you should use this feature for the occlusion pyramid sampling. Single sample is enough. This removes 3 min instructions from the shader.


Virtual texturing tidbits
We have already used virtual texturing in 2 shipping games (Trials Evolution and Trials Fusion). These games do not use deferred texturing and do not use GPU-driven rendering.

We use software indirection (similar to id Software).

Software indirection was chosen because:
- We can't change tile mapping on GPU side. UpdateTileMappingsIndirect doesn't exist in DirectX.
- Software indirection reduces the UV address space from 256k^2 --> 8k^2. This makes the UV coodinate smaller in the G-buffer.

In our virtual texture cache (in memory) we use outer borders (for filtering *). This means that the 128x128 pixel area of each page is pixel perfect (no stretching). This uses slightly more memory and some more hard drive space, but provides pixel perfect quality. Now that we have a physically based rendering pipeline, the quality of the textures is highly important. I would not recommend inner borders (stretch borders inside each 128x128 page) to anyone, as inner borders blur the texture data surprisingly much.

(*) Page outer border size is big enough for max anisotropy. We also have some additional page borders for parallax mapping, since we don't perform another indirection after the parallax offset.


Q: Doesn't the dynamic branching in the vertex shader result in high GPR usage?

Yes. However this degrades the performance much less than I anticipated. The advantage of being able to render everything in a single draw call (with almost perfect depth ordering) is larger than the (~2%) extra cost caused by the vertex shader.

In the future (with new GPUs and new APIs):
I would like to be able to spawn vertex waves directly from the culling compute shader. The culling compute shader would select the vertex shader (wave start instruction pointer) based on the animation type. The GPU would handle parameter passing from the CS -> VS using on-chip buffers and would immediately start executing the vertex work (simultaneously as continuing to run the culling compute shader). This would make it possible to select the vertex shader in a dynamic way (without bloating the GPRs). Also this would reduce the total frame rendering latency (and remove the GPU stall between culling and rendering) and it would remove the need to output the culling data to memory (saving bandwidth).


How are the GPU scene data structures updated?
Most game developers use entity/component data models. Our engine has an optimized tag component (boolean) system that uses one bit of storage per entity. TransformUpdate is the most commonly used tag component for updating GPU buffers.

Animation and physics systems will set the TransformUpdate tag components for entities that have been moved. The GPU update task runs just before the rendering starts. It will perform a single compute dispatch that updates all the required GPU-side (SoA) arrays. Usually only the object transform changes, meaning that only the transform buffer will be updated. SoA layout is important because of this reason (and because it actually speeds up the vertex fetch compared to AoS style vertex buffers on GCN).

We have been able to simulate more than 20k physics objects and update them at real time (60 fps). The GPU scene data structure update is not a performance bottleneck. The (CPU based) physics engine becomes a bottleneck much sooner.
 
Last edited:
If anyone has more questions, please ask me :)
*waves hand* :)
We don't reproject the last frame occlusion culling depth pyramid (like AC: Unity does). This has not been a problem for us, because our engine is designed solely for 60 fps locked rendering (frame difference is halved compared to 30 fps). Also we use the last frame depth pyramid solely as an occlusion hint. Occlusion tests that would sample out of screen data are clamped to the screen border (using clamp filtering mode). This results in surprisingly good culling results, way better than culling/rendering everything at screen edges. At 60 fps the clamp areas at screen edges are small, making the clamped estimate is surprisingly accurate.
I assume this estimate applies to consoles, when gaming with a controller, yes?

Because with a mouse, you can suddenly snap the view to an entirely different angle, causing huge changes in the screen layout. Or is this not an issue (like, your engine is console only for example), or do you have a different system to cope with such usage scenarios?
 
I assume this estimate applies to consoles, when gaming with a controller, yes?

Because with a mouse, you can suddenly snap the view to an entirely different angle, causing huge changes in the screen layout. Or is this not an issue (like, your engine is console only for example), or do you have a different system to cope with such usage scenarios?
First person shooter camera handling on PC (fast mouse movement) requires quite a bit of extra logic for game/engine systems that are optimized regarding to the camera orientation: mesh/texture streaming, culling, animation systems, particle systems, AI systems (simpler AI when NPCs are not visible). Fortunately we haven't been doing first person games (for PC), meaning that this hasn't been a problem for us.

The two-phase occlusion culling algorithm handles camera jumps and sudden turns with zero image corruption (unlike many other GPU-based culling methods that use solely last frame data). The frame that was highly different from the last one will be slower to render, since the visibility prediction will be further away from the truth. People can't see minor stalls in camera jumps, as the jump is sudden (not smooth movement). Empiric results (from our past 60 fps games) show that checkpoint restart times of up to 200 milliseconds (<12 frames) feel instant. With the new system, I wouldn't expect to see more than a single frame (16.6 ms) stall even in a big camera jump. Our new virtual texturing system also has lots of improvements to reduce texture popping in scenarios like this.
 
Fetching instance and material data in the lighting shader
This section was also very briefly discussed in my slides. If you are not familiar with software virtual texturing (software indirection + page cache atlas) the information from the slides (and presenter notes) is hard to digest.

Page cache is a 8k * 8k texture. It contains N texture array slices to store all the necessary material properties (as four channels are not enough). All the material texture slices are fetched with a single UV. Page cache is split to 128x128 tiles (regular grid). Each tile stores texels of a single virtual texture page.

As the cache atlas is a grid of pages, it is easy to calculate the grid coordinates by dividing the UV.xy by 128. We have a 64x64 (8192/128 = 64) grid of material properties (*) that we point sample in the lighting shader. This lookup is highly cache friendly, as up to 128x128 (=16384) rendered pixels sample the same value (the lookup almost always hits the L1 texture cache).

The virtual texture page property grid solves the case of per material constants. Per object instance constant data is more tricky to fetch, if multiple instances can share a material. The simplest way to avoid this issue is to ensure that every object instance has unique virtual UV rectangle in the virtual address space. This is the way that objects with unique decals are handled, so it doesn't require much extra effort. If you want unique decals everywhere, this should be your choice. The downside of this method is that it wastes virtual texture cache space for scenes that have lots of objects that use identical materials and could share the texture pages with each other.

If you don't need unique decals on most objects and want to conserve the VT cache space, you can store the instance id to the G-buffer. With the 128 bit layout this is trivial. The MSAA trick replicates per triangle constants in a lossless manner (perfect reconstruction).

(*) This grid includes also per page mip level, page id, renormalization factors (to improve DXT compression quality), shading model type and other data.
 
Last edited:
Very nice talk! A few questions and comments, mostly about the “MSAA trick”:

- As you mention, the interpolation during the 1080p reconstruction pass is not perspective-correct. However, it is rather easy to compute a perspective correct interpolation in the shader, assuming that your G-Buffer stores also the depth. Did you try to do a perspective-correct interpolation, or the quality was already good enough that did not justify any extra effort/cost?

- Interestingly, the fact that only N pixels are perspective correct and linear interpolation is used in-between, reminds me the software rasterizer of Quake. Although in your case N=2, while Quake used N=8 if I recall correctly.

- It is worth noting that the “MSAA trick” can also be used to shade more “coarsely” on high-pixel-density (Hi-PPI) displays, where it is impractical (and rather unnecessary) to shade every display pixel, but you still want to preserve the clarity of geometric edges. I gave a quick talk about this technique at this year’s High Performance Graphics.
Anyone interested can find more info at the following page:
http://www.pmavridis.com/research/coarse_shading/

- Aside from Hi-PPI displays, coarse shading can also be beneficial when rendering distant or out-of-focus or motion blurred objects. Do you have any plans towards this direction or you are already doing it?
 
Do you support dynamic decals driven by in-game events (eg. scorch marks, bullet holes, etc)?
I will answer this question tomorrow. It requires some extra details about our alpha blending pipeline.
Very nice talk! A few questions and comments, mostly about the “MSAA trick”:

- As you mention, the interpolation during the 1080p reconstruction pass is not perspective-correct. However, it is rather easy to compute a perspective correct interpolation in the shader, assuming that your G-Buffer stores also the depth. Did you try to do a perspective-correct interpolation, or the quality was already good enough that did not justify any extra effort/cost?

- Interestingly, the fact that only N pixels are perspective correct and linear interpolation is used in-between, reminds me the software rasterizer of Quake. Although in your case N=2, while Quake used N=8 if I recall correctly.
MSAA hardware provides pixel perfect depth (no need to reconstruct it, since the depth unit runs at subsample precision and stores subsample results). Yes, we could reconstruct perspective-correct results, but since we are interpolating just one pixel between two perspective correct pixels, the error is neglible. As you said, this is similar to some old software rasterizers, but we have significantly higher quality.
- It is worth noting that the “MSAA trick” can also be used to shade more “coarsely” on high-pixel-density (Hi-PPI) displays, where it is impractical (and rather unnecessary) to shade every display pixel, but you still want to preserve the clarity of geometric edges. I gave a quick talk about this technique at this year’s High Performance Graphics.
Anyone interested can find more info at the following page:
http://www.pmavridis.com/research/coarse_shading/

- Aside from Hi-PPI displays, coarse shading can also be beneficial when rendering distant or out-of-focus or motion blurred objects. Do you have any plans towards this direction or you are already doing it?
I will definitely read your research. This is an interesting topic that also interests me, since our studio also ships games on mobile devices with very high DPI screens (but low relative GPU performance). I have been toying around with an idea of decoupled shading with the MSAA trick and virtual texturing (with texture space shading). I could start another thread about that idea, once I have time to write a (very) long forum post :)
 
8xMSAA trick details
One of my benchmark slides mentioned 8xMSAA trick, with four 2xMSAA pixels packed inside a single 8xMSAA pixel. This requires further clarification.

8xMSAA trick sampling pattern image:

nYXhcIE.png

(Thanks for HTupolev for drawing this image)
Green is the 540p macropixel center, purple dots are the 1080p final pixel centers and red dots are the MSAA sample centers.

In this mode, we use a custom MSAA sampling pattern (*) to replicate 2xMSAA pattern to all the quadrants of a 8xMSAA pixel. This gives each reconstructed pixel two actual subsamples, giving equivalent quality to 2xMSAA.

(*) Custom MSAA sampling pattern is supported by modern GPUs, however the API support is lacking on PC. At SIGGRAPH 2015, Kronos announced some new ARB standard extensions to OpenGL 4.5. These extensions included custom MSAA patterns (http://www.anandtech.com/show/9506/opengl-siggraph-2015-opengl-es-32-opengl-extensions-announced). I hope Microsoft introduces this feature to DirectX 12 soon.


Anti-aliasing with 8xMSAA trick

8xMSAA trick provides two actual subsamples per pixel. You would want to combine this data with temporal reprojection and with additional (EQAA/CSAA) coverage samples.

At SIGGRAPH 2015, MJP discussed the solution they used in Order 1886. Their AA solution also includes two real MSAA samples. His blog includes lots of details about their implementation, including source code (GitHub) link. I recommend implementing their antialiasing solution on top of the 8xMSAA trick.

MJP's blog post: http://mynameismjp.wordpress.com/2012/10/28/msaa-resolve-filters/
Ready At Dawn presentation: https://readyatdawn.sharefile.com/share#/download/s9d45411f2bf4958a



Deferred alpha blending

As everyone knows, alpha blending is a tricky topic for deferred renderers. The most generic solution is to use forward rendering for transparencies. This is what you should use if you need full flexibility (infinite amount of alpha layers). However the performance drop of even a single big transparent object near the camera is quite severe, assuming you use full blown lighting/material model for the transparencies (including indirect lighting + sampling several shadow maps).

An alternative way to tackle this issue was presented by Alex Evans. Little Big Planet (1) stored transparencies as MSAA subsamples (page 21-->):
http://advances.realtimerendering.com/s2011/Evans, Kirczenow - Voxels in LBP2 (Siggraph 2011 Advances in Real-Time Rendering Course).pdf

Hardware support for this feature has existed for ages, and is known commonly as alpha-to-coverage (a2c for short). This technique is quite limited, giving 0%, 50% and 100% transparency possibilities (for 2xMSAA). This is not enough for modern games.

8xMSAA trick + deferred texturing provides us two UVs per pixel (one per sample). We use the alpha surface UV to sample the alpha surface transparency texture (which is stored in one of the virtual texture layers). We alpha blend the transparent and the opaque surface together according to the per pixel alpha fetched from the texture. This gives us 256 transparency levels (instead of just two).

Alpha blended objects go though the same pipeline as everything else, making the life much easier. MSAA hardware generates sub-pixel precise depth, allowing composition of post effects such as depth of field blur, motion blur and particles to work properly with alpha layers (and anti-aliased edges). Transparency rendering doesn't need back to front sorting, since the GPU depth buffering hardware works at sample frequency. The nearest alpha layer survives. For pixels that are 100% opaque (such as foliage and tree leaf center pixels), the SV_Coverage mask is fully set, and for pixels that are 0% opaque (overdraw edges of leaf cards for example), the SV_Coverage mask if fully zeros. The lighting compute shader tests whether both MSAA samples of a pixel are identical, and only applies the slow (double shaded) path for pixels that need it (*). This is very nice for vegetation rendering, since only the alpha edge pixels (prefiltered antialiasing from the vegetation textures) need double shading. The g-buffer filling at 2x2 lower resolution with the MSAA trick is super fast, meaning that trees and foliage are very fast to render.

Downside of this technique is still severe. It can only support one alpha layer per pixel, just like the original LittleBigPlanet technique. This is enough for vegetation and other "alpha antialiased" geometry, but is not good enough for scenes with lots of windows and other big transparencies. 16xMSAA trick (with 4xMSAA per quadrant) could be used to increase the count to 3 layers per pixel. This should be enough for most games.

You can also use EQAA / CSAA tricks to support more alpha layers per pixel. Unfortunately we don't have coverage samples exposed to the programmer in PC DirectX/OpenGL APIs. It would be awesome if we got SV_Coverage mask that works with coverage samples and Texture2DMS.Load instruction that supports coverage samples (and "unknown" return value). Modern NVIDIA and AMD GPUs should support these features.

(*) I recommend using a work distributing trick in the tiled lighting shader to make this fast. See this SIGGRAPH 2010 presentation by Andrew Lauritzen (page 24-->) for more info:
http://bps10.idav.ucdavis.edu/talks/12-lauritzen_DeferredShading_BPS_SIGGRAPH2010.pdf


Dynamic decals
Do you support dynamic decals driven by in-game events (eg. scorch marks, bullet holes, etc)?
Terrain has guaranteed unique mapping in virtual texture, making decaling easy. Trials Evolution and Trials Fusion both supported dynamic decals for the terrain.

See my old Digital Foundry technical interviews for more info about Trials Evolution and Fusion technology (including virtual texturing):
http://www.eurogamer.net/articles/digitalfoundry-trials-evolution-tech-interview
http://www.eurogamer.net/articles/digitalfoundry-2014-trials-fusion-tech-interview

Object decals can be rendered using a modified deferred decal algorithm with the aforementioned alpha blending trick. Because of the tight MSAA G-buffer, deferred decaling is actually very fast. It is however a little bit more limited compared to the traditional fat G-buffer based methods. The biggest limitation of course being that dynamic object decals count towards the transparency layer limit.

For uniquely mapped objects, you can of course also use virtual texture based decals, as those are significantly faster and don't stress the transparency layer limit. Trials Evolution and Trials Fusion also supported this, but it wasn't widely used in those games.
 
Last edited:
1.Since you can't access the HTILE buffer directly on PC, I'm assuming you recreate the data from the depth buffer. How much does this affect performance on PC? or is it a non-issue?
2.For the actual occlusion culling do you use a screen space bounding box for clusters?
3.What do you do for lighting? Have you considered clustered forward+? Its seems to be a good fit for a GPU driven setup with async compute. If not any specific reason/s why?
 
So considering you render foliage using the 2 MSAA transparancy layer trick, what happens when the edge of a "soft" alpha tested surface overlaps another one? Don't you get wholes in those areas?
 
1.Since you can't access the HTILE buffer directly on PC, I'm assuming you recreate the data from the depth buffer. How much does this affect performance on PC? or is it a non-issue?
2.For the actual occlusion culling do you use a screen space bounding box for clusters?
3.What do you do for lighting? Have you considered clustered forward+? Its seems to be a good fit for a GPU driven setup with async compute. If not any specific reason/s why?
1. This costs the same as any depth buffer downsampling shader. In general, I recommend using compute shader for any recursive downsampling taks and output multiple levels at once (downsample recursively to LDS). Ohterwise you have data dependency between the dispatches, and you definitely don't want to stall the GPU between each mip level.
2. Yes. Screen space bounds rounded up to the next mip (= next power of two). This info was cut from the slides at last moment (to prevent information overload). More precise tests would of course be possible (at higher ALU & GPR count).
3. Deferred texturing is "more deferred" than standard deferred rendering. It further reduces the overdraw cost (making the frame rendering time less fluctuating). It does not combine with forward rendering techniques (forward+, clustered forward+, etc). MSAA trick is not compatible with forward rendering techniques either, since it doesn't even run the pixel shader for each pixel. The whole point of these techniques is to fully separate visibility sampling from the lighting and material processing.
So considering you render foliage using the 2 MSAA transparancy layer trick, what happens when the edge of a "soft" alpha tested surface overlaps another one? Don't you get wholes in those areas?
Closer edge pixel remains (depth buffering works at subsample precision). Further away is clipped. Blending works perfectly for the remaining edge pixel. Even in the worst case, this is still better than standard alpha to coverage.
 
Closer edge pixel remains (depth buffering works at subsample precision). Further away is clipped. Blending works perfectly for the remaining edge pixel. Even in the worst case, this is still better than standard alpha to coverage.

Oh I see, so I guess non-edge pixels of alpha-mapped surfaces are rendered to the opaque sub-pixels in standard alpha-test fashion. Clever! I was thinking initially those surfaces were rendered entirely on the "alpha layer" only, so those overlaps would lead to terrible artifacts. But as you seem to do it, the artifacts are no more than what you usually get with alpha testing, but only on the back surfaces of edge-overlaps = pretty hard to notice.
With a secondary alpha-buffer, you could do something similar using analitic AA for opaque surfaces. Very low bit-depths would be good enough I imagine.
 
3. Deferred texturing is "more deferred" than standard deferred rendering. It further reduces the overdraw cost (making the frame rendering time less fluctuating). It does not combine with forward rendering techniques (forward+, clustered forward+, etc). MSAA trick is not compatible with forward rendering techniques either, since it doesn't even run the pixel shader for each pixel. The whole point of these techniques is to fully separate visibility sampling from the lighting and material processing.
Yeah, sorry for the stupid question... basically what happened was I wanted your opinion on clustered forward+ in general and in relation to GPU driven rendering, and in the effort to shoehorn the question to look like it had to do completely with the slides it wound up wrong (I completely lost sight of deferred texturing). So if you don't mind a slightly off topic question, what do you think of clustered forward+? and how would it fit with a GPU driven arch? (w/ async compute)
 
Yeah, sorry for the stupid question... basically what happened was I wanted your opinion on clustered forward+ in general and in relation to GPU driven rendering, and in the effort to shoehorn the question to look like it had to do completely with the slides it wound up wrong (I completely lost sight of deferred texturing). So if you don't mind a slightly off topic question, what do you think of clustered forward+? and how would it fit with a GPU driven arch? (w/ async compute)
GPU-driven rendering works fine without deferred rendering / deferred texturing. I don't see any problems using (clustered) forward(+) rendering.

I would not personally choose the "+" techniques, as those need double geometry pass. The geometry pass is the most fluctuating part of the frame performance (in addition to transparency rendering). If you are aiming at locked 60 fps, you need to choose techniques that have as close as constant cost as possible.

One way to prevent the double geometry pass is the Avalanche Studios practical clustered shading (coarse 3d screen space light culling):
Original PDF: http://www.humus.name/Articles/PracticalClusteredShading.pdf
New stuff (SIGGRAPH 2015) including Avalanche tech: https://newq.net/publications/more/s2015-many-lights-course

Their 3d screen space light culling technique is fully compatible with GPU-driven rendering. If you also use cluster (sub-object) sorting, you get near to perfect depth ordering, meaning that overdraw (the biggest limitation in forward rendering) becomes much less of an issue. However this means that you have to either use a "uber shader" (pixel shader) or you need to do one draw call per shader type (making it practically impossible to have perfect depth ordering). Quad efficiency can also become an performance issue in high poly content (as with any forward rendered technique).

Avalanche Studios practical clustered + GPU-driven rendering would be an interesting combination for tiling mobile GPUs (such as PowerVR). Tiled GPU rendering removes the overdraw issue of forward rendering, meaning that you could remove the cluster sorting and just push a single draw call per shader type (likely just a dozens of draws in total). You could also use hardware MSAA (for actual antialiasing) efficiently on tiling GPUs. Practical clustered is fully compatible with MSAA. Unfortunately OpenGL ES 3.1 support is still quite limited on Android side (you need it for compute shaders and indirect draws). Fortunately Metal has broad support on iOS.
 
GPU sorting and DirectX
Unfortunately (PC) DirectX doesn't support cross lane operations. All fast CUDA and OpenCL 2.0 (radix) sorting algorithms use cross lane operations. In (PC) DirectX (DirectCompute) you need to use LDS to communicate between threads, and this is significantly slower (more latency, bank conflicts, added ALU, etc) and results in worse occupancy.

Unfortunately DirectX 12 didn't improve DirectCompute HLSL language. CPU side improvements of DirectX 12 were awesome. I really like the manual resource management, the pipeline state objects, async compute and ExecuteIndirect among other things. But we need improvements to the HLSL as well, especially for the compute side, since DirectCompute is still far away from CUDA. OpenCL has improved rapidly lately (almost catching up with the CUDA feature set). It really feels that we DirectCompute programmers are left behind all the recent GPU compute advances.
 
Last edited:
Back
Top