Lately several presentations, white papers and blog posts have been written about deferred rendering techniques that do not store any texture data to the g-buffer. These techniques reduce the memory bandwidth usage by storing less data to the g-buffer and by only sampling the texture data for visible pixels. Overdraw cost is minimized, resulting in less fluctuating frame rate.
All of these techniques provide big saving in texture bandwidth. Instead of storing the uncompressed data to the g-buffer and reading the uncompressed data in the lighting step, these techniques directly sample the BC compressed data in the lighting shader (2x-4x reduction in texturing BW).
The smallest amount of data you need to store is a triangle id per pixel (32 bits). Triangle id allows you to reconstruct the depth, so you don't need to read the depth buffer in the lighting step either. However, you still need to render the depth buffer, as depth buffering during the rendering step is still a win
(no matter how cheap the pixel shader is), as hierarchical depth culling saves considerable amount pixel shader invocations and bandwidth in complex scenes.
Contenders
1. Intel's "The Visibility Buffer"
http://jcgt.org/published/0002/02/04/
Stores 32 bit packed triangle + instance id. 64 bit storage is needed if rendering scene with > 65536 instances AND max triangle count per mesh > 65536.
Runs "vertex shader" per pixel -> transformed vertices. Intersects triangle by screen ray -> barycentrics. With transformed vertices and barycentrics interpolates per pixel values.
2. Tomasz Stachowiak's "Deferred Material Rendering System"
https://onedrive.live.com/view.aspx...0!115&app=PowerPoint&authkey=!AP-pDh4IMUug6vs
Similar to Intel's technique, but stores also barycentrics (2x32 bit) and instance id (32 bit). Total 128 bits. "Unlimited" triangle count. With stored barycentric coordinates, you don't need to fetch the vertex positions from the memory (and transform the positions again). Also there is no need to calculate the triangle intersection with the screen ray.
3. RedLynx's "Virtual Deferred Texturing"
http://advances.realtimerendering.c...iggraph2015_combined_final_footer_220dpi.pptx (page 40+)
I was talking about this technique at SIGGRAPH 2015. Stores UV (16 + 16 bits) + tangent (32 bit encoded) instead of texture data. Fetch the texture data in the lighting pass directly from the virtual texture cache (8k^2 texture atlas containing a grid of 128x128 texture pages - all currently visible surfaces guaranteed to be in the cache). MSAA trick further reduces the G-buffer pixel shader wave count by ~3x and color bandwidth by ~2x, making it comparable to the Intel 32 bit per pixel technique in G-buffer bandwidth.
Improvements
First I will propose some improvements to techniques 1 and 2 to further reduce the G-buffer bandwidth cost of these techniques.
Compact triangle & instance id storage
Techniques 1 and 2 need 8 bytes to store instance id + triangle id in complex scenes (Intel is 32 bits in simple scenes).
All of these techniques are perfect fit for GPU-driven pipelines, so I will assume that the geometry will be drawn by DirectX12 ExecuteIndirect or Vulkan MultiDrawIndirect. I will also assume that the renderer uses compute shaders to perform viewport and occlusion culling by sub-object granularity.
If the culling is based on constant sized sub-object pieces (clusters), there is a simple way to uniquely number the triangles. If we assume 256 triangle culling granularity, we can pack the triangle id inside the cluster in 8 bits. This leaves 24 bits for the instance id, allowing us to draw 16 million independent pieces of geometry, or 8 instances per pixel (at 1080p).
Instead of using a global instance id, one should instead store a index to the culling output buffer (containing only visible clusters). 16 million visible pieces of geometry (=4 billion visible rendered triangles) should be enough for everyone (at least for a while).
Material id (if needed) can be added (SoA layout) to the visibility culling output. No need to store it either per pixel.
Result: Intel's technique is always 32 bits per pixel (no matter the scene complexity). Tomasz's technique is down to 96 bits per pixel.
16+16 bit barycentric coordinates
Tomasz's technique is using 2x 32 bit floating point barycentric coordinates. This is wasteful. 24 bit normalized integer would give exactly the same quality (in the [0.5, 1.0] range). But I would want to go as low as 16+16 bits if possible.
Let's assume that we need 8x8 subpixel precision for barycentric coordinates. This leaves 13 bits for the integer part = 8192 values. If we assume 2048 pixel vertical resolution (~1080p), 13 bits is enough for triangles that are 4x larger than the screen.
Let's assume 90 degree vertical FOV and 0.25 meter near plane. When a triangle is on the near plane and it is 4x larger than the screen its size is = 0.25 meters * 2 * tan(45) * 4 = 2 meters. This means that triangle edges longer than 2 meters are a problem. However the problem only occurs close to the near plane, meaning that this problem is only valid of the highest LOD meshes. Triangles are never bigger than the screen when LOD 1+ mesh is used.
This is not a real problem for most games. You might need to add some extra triangles to your placeholder box / flat floor meshes. Performance hit of these extra triangles will be practically zero.
Result: Tomasz's technique is down to 64 bits per pixel.
"MSAA Trick" for triangle id techniques
The "MSAA trick" technique was described by me in our SIGGRAPH 2015 presentation. The idea is to render the main viewport at lower resolution, but at higher MSAA sample count, and use custom MSAA pattern to match the sampling points to pixel centers. For example use 4xMSAA with custom ordered grid pattern and render the viewport at 2x2 lower resolution. MSAA hardware guarantees that each triangle hitting a subpixel will be shaded separately (separate pixel shader invocation and separate storage). Sample replication (instead of supersampling) will only occur for samples that belong to the same triangle.
MSAA trick works trivially for Intel's technique, as triangle id is constant over a triangle. The result is pixel perfect. No reconstruction tricks or interpolation need to be performed. 4xMSAA trick reduces the pixel shader wave invocations roughly by 3x and the color buffer bandwidth roughly by 2x (with GCN's MSAA color compression).
MSAA trick also supports data interpolated accross triangles. RedLynx uses it to interpolate UVs and tangents. However UVs and barycentrics have subtle difference. UV gradient usually smoothly continues across triangles (of the same surface) while barycentric has a discontinuity on every triangle edge. This is problematic for content that has tiny triangles (especially for edges of high poly objects). Neighbors for smooth UV interpolation are much easier to find. Cases where the neighbor is missing are rare. Reconstruction (to 2x2 higher resolution) would fail much more often for barycentrics.
Result: Trivial to implement MSAA trick for Intel's technique. Barycentrics (in Tomasz's technique) have too many discontinuities in high poly content, making it unsuitable for MSAA trick.
After improvements:
Intel's technique G-Buffer write BW is "on average" 16 bits per pixel (MSAA trick,tight tri+instance packing)
Tomasz's technique G-Buffer write BW is 64 bits per pixel (16+16 bit barycentrics, tight tri+instance packing)
RedLynx technique (no improvements) is "on average" 32 bits per pixel (64 bits + MSAA compression). However the SIGGRAPH paper also mentions 128 bits per pixel format to include more data (such as motion vectors). This results in average 64 bits per pixel (with MSAA trick). BW is thus similar to optimized version of Tomasz's technique. However the RedLynx technique is not directly comparable with the other two as it doesn't need to fetch any vertex data in the lighting shader (saving bandwidth).
continued...
All of these techniques provide big saving in texture bandwidth. Instead of storing the uncompressed data to the g-buffer and reading the uncompressed data in the lighting step, these techniques directly sample the BC compressed data in the lighting shader (2x-4x reduction in texturing BW).
The smallest amount of data you need to store is a triangle id per pixel (32 bits). Triangle id allows you to reconstruct the depth, so you don't need to read the depth buffer in the lighting step either. However, you still need to render the depth buffer, as depth buffering during the rendering step is still a win
(no matter how cheap the pixel shader is), as hierarchical depth culling saves considerable amount pixel shader invocations and bandwidth in complex scenes.
Contenders
1. Intel's "The Visibility Buffer"
http://jcgt.org/published/0002/02/04/
Stores 32 bit packed triangle + instance id. 64 bit storage is needed if rendering scene with > 65536 instances AND max triangle count per mesh > 65536.
Runs "vertex shader" per pixel -> transformed vertices. Intersects triangle by screen ray -> barycentrics. With transformed vertices and barycentrics interpolates per pixel values.
2. Tomasz Stachowiak's "Deferred Material Rendering System"
https://onedrive.live.com/view.aspx...0!115&app=PowerPoint&authkey=!AP-pDh4IMUug6vs
Similar to Intel's technique, but stores also barycentrics (2x32 bit) and instance id (32 bit). Total 128 bits. "Unlimited" triangle count. With stored barycentric coordinates, you don't need to fetch the vertex positions from the memory (and transform the positions again). Also there is no need to calculate the triangle intersection with the screen ray.
3. RedLynx's "Virtual Deferred Texturing"
http://advances.realtimerendering.c...iggraph2015_combined_final_footer_220dpi.pptx (page 40+)
I was talking about this technique at SIGGRAPH 2015. Stores UV (16 + 16 bits) + tangent (32 bit encoded) instead of texture data. Fetch the texture data in the lighting pass directly from the virtual texture cache (8k^2 texture atlas containing a grid of 128x128 texture pages - all currently visible surfaces guaranteed to be in the cache). MSAA trick further reduces the G-buffer pixel shader wave count by ~3x and color bandwidth by ~2x, making it comparable to the Intel 32 bit per pixel technique in G-buffer bandwidth.
Improvements
First I will propose some improvements to techniques 1 and 2 to further reduce the G-buffer bandwidth cost of these techniques.
Compact triangle & instance id storage
Techniques 1 and 2 need 8 bytes to store instance id + triangle id in complex scenes (Intel is 32 bits in simple scenes).
All of these techniques are perfect fit for GPU-driven pipelines, so I will assume that the geometry will be drawn by DirectX12 ExecuteIndirect or Vulkan MultiDrawIndirect. I will also assume that the renderer uses compute shaders to perform viewport and occlusion culling by sub-object granularity.
If the culling is based on constant sized sub-object pieces (clusters), there is a simple way to uniquely number the triangles. If we assume 256 triangle culling granularity, we can pack the triangle id inside the cluster in 8 bits. This leaves 24 bits for the instance id, allowing us to draw 16 million independent pieces of geometry, or 8 instances per pixel (at 1080p).
Instead of using a global instance id, one should instead store a index to the culling output buffer (containing only visible clusters). 16 million visible pieces of geometry (=4 billion visible rendered triangles) should be enough for everyone (at least for a while).
Material id (if needed) can be added (SoA layout) to the visibility culling output. No need to store it either per pixel.
Result: Intel's technique is always 32 bits per pixel (no matter the scene complexity). Tomasz's technique is down to 96 bits per pixel.
16+16 bit barycentric coordinates
Tomasz's technique is using 2x 32 bit floating point barycentric coordinates. This is wasteful. 24 bit normalized integer would give exactly the same quality (in the [0.5, 1.0] range). But I would want to go as low as 16+16 bits if possible.
Let's assume that we need 8x8 subpixel precision for barycentric coordinates. This leaves 13 bits for the integer part = 8192 values. If we assume 2048 pixel vertical resolution (~1080p), 13 bits is enough for triangles that are 4x larger than the screen.
Let's assume 90 degree vertical FOV and 0.25 meter near plane. When a triangle is on the near plane and it is 4x larger than the screen its size is = 0.25 meters * 2 * tan(45) * 4 = 2 meters. This means that triangle edges longer than 2 meters are a problem. However the problem only occurs close to the near plane, meaning that this problem is only valid of the highest LOD meshes. Triangles are never bigger than the screen when LOD 1+ mesh is used.
This is not a real problem for most games. You might need to add some extra triangles to your placeholder box / flat floor meshes. Performance hit of these extra triangles will be practically zero.
Result: Tomasz's technique is down to 64 bits per pixel.
"MSAA Trick" for triangle id techniques
The "MSAA trick" technique was described by me in our SIGGRAPH 2015 presentation. The idea is to render the main viewport at lower resolution, but at higher MSAA sample count, and use custom MSAA pattern to match the sampling points to pixel centers. For example use 4xMSAA with custom ordered grid pattern and render the viewport at 2x2 lower resolution. MSAA hardware guarantees that each triangle hitting a subpixel will be shaded separately (separate pixel shader invocation and separate storage). Sample replication (instead of supersampling) will only occur for samples that belong to the same triangle.
MSAA trick works trivially for Intel's technique, as triangle id is constant over a triangle. The result is pixel perfect. No reconstruction tricks or interpolation need to be performed. 4xMSAA trick reduces the pixel shader wave invocations roughly by 3x and the color buffer bandwidth roughly by 2x (with GCN's MSAA color compression).
MSAA trick also supports data interpolated accross triangles. RedLynx uses it to interpolate UVs and tangents. However UVs and barycentrics have subtle difference. UV gradient usually smoothly continues across triangles (of the same surface) while barycentric has a discontinuity on every triangle edge. This is problematic for content that has tiny triangles (especially for edges of high poly objects). Neighbors for smooth UV interpolation are much easier to find. Cases where the neighbor is missing are rare. Reconstruction (to 2x2 higher resolution) would fail much more often for barycentrics.
Result: Trivial to implement MSAA trick for Intel's technique. Barycentrics (in Tomasz's technique) have too many discontinuities in high poly content, making it unsuitable for MSAA trick.
After improvements:
Intel's technique G-Buffer write BW is "on average" 16 bits per pixel (MSAA trick,tight tri+instance packing)
Tomasz's technique G-Buffer write BW is 64 bits per pixel (16+16 bit barycentrics, tight tri+instance packing)
RedLynx technique (no improvements) is "on average" 32 bits per pixel (64 bits + MSAA compression). However the SIGGRAPH paper also mentions 128 bits per pixel format to include more data (such as motion vectors). This results in average 64 bits per pixel (with MSAA trick). BW is thus similar to optimized version of Tomasz's technique. However the RedLynx technique is not directly comparable with the other two as it doesn't need to fetch any vertex data in the lighting shader (saving bandwidth).
continued...