Modern textureless deferred rendering techniques

Discussion in 'Rendering Technology and APIs' started by sebbbi, Feb 28, 2016.

  1. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Lately several presentations, white papers and blog posts have been written about deferred rendering techniques that do not store any texture data to the g-buffer. These techniques reduce the memory bandwidth usage by storing less data to the g-buffer and by only sampling the texture data for visible pixels. Overdraw cost is minimized, resulting in less fluctuating frame rate.

    All of these techniques provide big saving in texture bandwidth. Instead of storing the uncompressed data to the g-buffer and reading the uncompressed data in the lighting step, these techniques directly sample the BC compressed data in the lighting shader (2x-4x reduction in texturing BW).

    The smallest amount of data you need to store is a triangle id per pixel (32 bits). Triangle id allows you to reconstruct the depth, so you don't need to read the depth buffer in the lighting step either. However, you still need to render the depth buffer, as depth buffering during the rendering step is still a win
    (no matter how cheap the pixel shader is), as hierarchical depth culling saves considerable amount pixel shader invocations and bandwidth in complex scenes.

    Contenders

    1. Intel's "The Visibility Buffer"
    http://jcgt.org/published/0002/02/04/

    Stores 32 bit packed triangle + instance id. 64 bit storage is needed if rendering scene with > 65536 instances AND max triangle count per mesh > 65536.

    Runs "vertex shader" per pixel -> transformed vertices. Intersects triangle by screen ray -> barycentrics. With transformed vertices and barycentrics interpolates per pixel values.

    2. Tomasz Stachowiak's "Deferred Material Rendering System"
    https://onedrive.live.com/view.aspx...0!115&app=PowerPoint&authkey=!AP-pDh4IMUug6vs

    Similar to Intel's technique, but stores also barycentrics (2x32 bit) and instance id (32 bit). Total 128 bits. "Unlimited" triangle count. With stored barycentric coordinates, you don't need to fetch the vertex positions from the memory (and transform the positions again). Also there is no need to calculate the triangle intersection with the screen ray.

    3. RedLynx's "Virtual Deferred Texturing"
    http://advances.realtimerendering.c...iggraph2015_combined_final_footer_220dpi.pptx (page 40+)

    I was talking about this technique at SIGGRAPH 2015. Stores UV (16 + 16 bits) + tangent (32 bit encoded) instead of texture data. Fetch the texture data in the lighting pass directly from the virtual texture cache (8k^2 texture atlas containing a grid of 128x128 texture pages - all currently visible surfaces guaranteed to be in the cache). MSAA trick further reduces the G-buffer pixel shader wave count by ~3x and color bandwidth by ~2x, making it comparable to the Intel 32 bit per pixel technique in G-buffer bandwidth.

    Improvements

    First I will propose some improvements to techniques 1 and 2 to further reduce the G-buffer bandwidth cost of these techniques.

    Compact triangle & instance id storage

    Techniques 1 and 2 need 8 bytes to store instance id + triangle id in complex scenes (Intel is 32 bits in simple scenes).

    All of these techniques are perfect fit for GPU-driven pipelines, so I will assume that the geometry will be drawn by DirectX12 ExecuteIndirect or Vulkan MultiDrawIndirect. I will also assume that the renderer uses compute shaders to perform viewport and occlusion culling by sub-object granularity.

    If the culling is based on constant sized sub-object pieces (clusters), there is a simple way to uniquely number the triangles. If we assume 256 triangle culling granularity, we can pack the triangle id inside the cluster in 8 bits. This leaves 24 bits for the instance id, allowing us to draw 16 million independent pieces of geometry, or 8 instances per pixel (at 1080p).

    Instead of using a global instance id, one should instead store a index to the culling output buffer (containing only visible clusters). 16 million visible pieces of geometry (=4 billion visible rendered triangles) should be enough for everyone (at least for a while).

    Material id (if needed) can be added (SoA layout) to the visibility culling output. No need to store it either per pixel.

    Result: Intel's technique is always 32 bits per pixel (no matter the scene complexity). Tomasz's technique is down to 96 bits per pixel.

    16+16 bit barycentric coordinates

    Tomasz's technique is using 2x 32 bit floating point barycentric coordinates. This is wasteful. 24 bit normalized integer would give exactly the same quality (in the [0.5, 1.0] range). But I would want to go as low as 16+16 bits if possible.

    Let's assume that we need 8x8 subpixel precision for barycentric coordinates. This leaves 13 bits for the integer part = 8192 values. If we assume 2048 pixel vertical resolution (~1080p), 13 bits is enough for triangles that are 4x larger than the screen.

    Let's assume 90 degree vertical FOV and 0.25 meter near plane. When a triangle is on the near plane and it is 4x larger than the screen its size is = 0.25 meters * 2 * tan(45) * 4 = 2 meters. This means that triangle edges longer than 2 meters are a problem. However the problem only occurs close to the near plane, meaning that this problem is only valid of the highest LOD meshes. Triangles are never bigger than the screen when LOD 1+ mesh is used.

    This is not a real problem for most games. You might need to add some extra triangles to your placeholder box / flat floor meshes. Performance hit of these extra triangles will be practically zero.

    Result: Tomasz's technique is down to 64 bits per pixel.

    "MSAA Trick" for triangle id techniques

    The "MSAA trick" technique was described by me in our SIGGRAPH 2015 presentation. The idea is to render the main viewport at lower resolution, but at higher MSAA sample count, and use custom MSAA pattern to match the sampling points to pixel centers. For example use 4xMSAA with custom ordered grid pattern and render the viewport at 2x2 lower resolution. MSAA hardware guarantees that each triangle hitting a subpixel will be shaded separately (separate pixel shader invocation and separate storage). Sample replication (instead of supersampling) will only occur for samples that belong to the same triangle.

    MSAA trick works trivially for Intel's technique, as triangle id is constant over a triangle. The result is pixel perfect. No reconstruction tricks or interpolation need to be performed. 4xMSAA trick reduces the pixel shader wave invocations roughly by 3x and the color buffer bandwidth roughly by 2x (with GCN's MSAA color compression).

    MSAA trick also supports data interpolated accross triangles. RedLynx uses it to interpolate UVs and tangents. However UVs and barycentrics have subtle difference. UV gradient usually smoothly continues across triangles (of the same surface) while barycentric has a discontinuity on every triangle edge. This is problematic for content that has tiny triangles (especially for edges of high poly objects). Neighbors for smooth UV interpolation are much easier to find. Cases where the neighbor is missing are rare. Reconstruction (to 2x2 higher resolution) would fail much more often for barycentrics.

    Result: Trivial to implement MSAA trick for Intel's technique. Barycentrics (in Tomasz's technique) have too many discontinuities in high poly content, making it unsuitable for MSAA trick.

    After improvements:

    Intel's technique G-Buffer write BW is "on average" 16 bits per pixel (MSAA trick,tight tri+instance packing)
    Tomasz's technique G-Buffer write BW is 64 bits per pixel (16+16 bit barycentrics, tight tri+instance packing)

    RedLynx technique (no improvements) is "on average" 32 bits per pixel (64 bits + MSAA compression). However the SIGGRAPH paper also mentions 128 bits per pixel format to include more data (such as motion vectors). This results in average 64 bits per pixel (with MSAA trick). BW is thus similar to optimized version of Tomasz's technique. However the RedLynx technique is not directly comparable with the other two as it doesn't need to fetch any vertex data in the lighting shader (saving bandwidth).

    continued...
     
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Comparison

    Quick comparison of the techniques.

    Geometry transform

    None of these techniques require submitting all draw calls and geometry twice (examples: light pass in deferred lighting, depth pre-pass in forward+ and LIDR). This is excellent, since geometry rendering is the most fluctuating cost in the scene rendering (for both CPU and GPU) and doubling this fluctuation makes it harder to hit frame rate targets consistently.

    Intel's technique fetches each visible triangle from memory and transforms it again once per visible pixel. It's worth noting however, that it only transforms the vertex positions in the vertex shader. Tangent matrix (or quaternion) transformation is only done for visible triangles. This further reduces the fluctuation of geometry processing, as the normal transformation is only done for visible pixels. This is a fixed cost, as the there's a fixed amount of pixels in the screen.

    Stored barycentric coordinates in Tomasz's technique remove the need to transform the vertex positions again for each visible pixel. This also means that the position data of the vertices doesn't need to be fetched. Depth buffer has enough information to reconstruct the pixel position (similarly to traditional deferred techniques). However tangent space normal maps still need to be transformed, and this means that vertex animation code is not fully separated from the lighting pass, requiring extra shader permutations and bookkeeping.

    RedLynx's technique stores transformed tangent frame to the G-buffer in addition to the UV coordinates. It does not access the vertex data at all in the lighting shader. Transformation (and skinning, etc) is fully separated from the pixel processing. This greatly reduces the number of shader permutations needed in the lighting step and also keeps the architecture more flexible and easier to maintain as animation and surface shading are 100% separated.

    The downside of RedLynx's technique is that it needs to transform and access both position and tangent frame in the G-buffer step, meaning that overdraw cost is slightly higher compared to the triangle id based techniques. Overdraw cost is still significantly reduced compared to traditional G-buffering techniques, as texturing is deferred to the lighting step.

    G-Buffer pixel shader

    Intel's technique has super simple G-Buffer pixel shader: Output the triangle id from the leading vertex (no interpolation). In addition to this Tomasz's technique also outputs triangle barycentrics. On consoles barycentrics are already available in the pixel shader with no extra work. Unfortunately on PC DirectX there's no SV_Barycentric semantic for pixel shader. This functionality is only available in domain shaders (as SV_DomainLocation). These are various ways to get barycentrics to pixel shader (see Tomasz's slides), but all of them roughly halve the geometry throughput.

    RedLynx's technique needs to encode the quaternion (nlerp input from the vertex interpolation). This adds a few ALU instructions to the G-buffer shader.

    All of these techniques need a second shader (and second G-buffer draw call) for alpha clipped geometry. This shader reads an alpha clip texture to determine whether to discard the pixel (or in case of the MSAA trick, how to fill the SV_Coverage output mask). Alpha clipped G-buffer shader has a little bit higher cost compared to the opaque pixel shader.

    Tessellation

    RedLynx technique supports tessellation and procedural geometry (such as optimized terrain rendering) out of the box. No tricks are needed.

    Tomasz's technique stores barycentrics. Most tessellation techniques interpolate vertex values using the barycentrics and then perform some data lookups (such as displacement map reads). This kind of tessellation setups can be easily supported. You execute the tessellation & displacement code normally (in the hull / domain shaders). Because the barycentrics are stored to the G-buffer, the interpolation in the lighting shader doesn't need any information about the geometry deformation. Position can also be reconstructed perfectly from the depth buffer.

    Added clarification: G-buffer shader (with tessellation) stores input triangle id ("tripatch id") to the G-buffer. There is no need to store the triangles generated by the tessellator to memory (or refer to them in any way in the lighting step). The lighting shader treats tessellated geometry exactly the same way as non-tessellated geometry. Stored barycentric coordinates and stored depth are enough to reconstruct all the values properly (if the assumption from last paragraph holds).

    Intel's 32 bpp technique cannot support tessellation. Extra information needs to be stored (doubling the pixel footprint). Also the tessellation support would make the technique incompatible with the MSAA trick (further doubling the G-buffer bandwidth of the optimized version).

    Lighting shader permutations

    All of these techniques use simple "degenerate" pixel shaders. There is no need to have multiple permutations of the pixel shader (as there are not that many different ways to output a triangle id, barycentrics or interpolated tangent+UV). This makes these techniques a perfect match for Vulkan MultiDrawIndirect and DX12 ExecuteIndirect.

    Shader permutations are handled in the lighting step. Intel's paper and Tomasz's presentation outline efficient ways to bin pixels to be executed with different lighting compute shader permutations. I would also recommend this paper by Olsson, et al.: http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf and this (older) SIGGRAPH 2010 presentation by Black Rock studio, called "Screen Space Classification for Efficient Deferred Shading". Unfortunately the link to the full presentation (on Disney's site) is down.

    Techniques based on triangle id (Intel and Tomasz) practically need a binning algorithm, as vertex animation must be handled in the lighting shader (to transform the tangent frame properly).

    RedLynx's technique fully separates the animation from the lighting meaning that it only needs lighting shader permutations based on pixel processing. RedLynx's technique also precaches material blending to the virtual texture, meaning that only material shading permutations need to be handled in the lighting shader. This is similar to traditional deferred rendering architectures. These differences greatly reduce the number of required lighting shader permutations compared to the triangle id based techniques.

    Executing lighting shader permutations efficiently

    I recommend that everyone reads this part of Tomasz presentation carefully. Current APIs do not expose a mechanism that allows low level jump tables between shaders that have similar GPR / resource requirements. So this approach is not valid for PC. We need something else.

    Fortunately DirectX 12 and Vulkan introduce resource barriers. Now resource synchronization is fully in developers hands. In DirectX 11 (and OpenGL), there was no way of running two compute shaders simultaneously that wrote to the same UAV. The API guaranteed that the first shader had finished its work and written data to the memory (or LLC) before the second one started.

    This is exactly our case. We have lots of small compute dispatches that fill small areas to the same lighting result UAV. None of these shaders write to the same areas. The DX11/OpenGL driver however has no way to know this. Result is that the GPU runs these small compute shaders one after each other. Majority of the GPU compute units are idling.

    In DirectX and Vulkan, you can submit all these small compute shaders at once. No resource barriers between them. This allows the GPU to execute them simultaneously, and fill all of the compute units. After all of the compute shaders have been submitted you submit a single resource barrier to transition the UAV to shader readable state. This ensures that all of the compute shader dispatches are finished before the post process shaders start reading the lighting output buffer.

    It it worth noticing that this kind of parallel compute shader execution (from a single queue) is properly supported by AMD, Nvidia and Intel. This should not be confused with asynchronous compute (multiple queues). All AMD GCN GPUs are fully bindless and can execute multiple different compute shaders simultaneously even on a single CU (64 thread wave granularity). Nvidia and Intel can also run multiple compute shaders simultaneously, but the granularity might be lower (on some product generations). As usual run benchmarks on multiple GPU generations from all three vendors.

    API feature requirements

    SV_Barycentrics to pixel shader
    This is a requirement for implementing Tomasz's technique efficiently. Not supported by Vulkan on DirectX 12. Already supported by consoles.

    Custom MSAA sampling patterns
    This is a requirement for implementing the RedLynx MSAA trick. This feature is already supported by OpenGL 4.5 extensions and OpenGL ES 3.2 extensions (mobile GPUs). Not supported by DirectX 12 or Vulkan.

    Feel free to comment :)
     
    #2 sebbbi, Feb 28, 2016
    Last edited: Feb 28, 2016
  3. Lefungus

    Regular Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    266
    Likes Received:
    119
    I'm currently reading the papers, very nice !
     
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I recommend reading all of the 3 papers/presentations before reading my forum post. I skipped repeating lots of important information.
     
  5. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Seems that my Twitter link made this post popular. Here are some additional article links to why custom MSAA patterns and SV_Barycentrics should be included in Vulkan and DirectX 12...

    MSAA-based Coarse Shading for Power-Efficient Rendering on High Pixel-Density Displays
    Link: http://www.pmavridis.com/research/coarse_shading/

    They are using 4xMSAA ordered grid and rendering at 2x2 lower resolution on mobile devices. With "retina" resolutions (super high DPI), the half resolution shading inside the triangles is not that apparent since the triangle edges are pixel perfect.

    Hybrid Reconstruction Anti Aliasing
    Link: http://michaldrobot.files.wordpress.com/2014/08/hraa.pptx

    This SIGGRAPH 2014 article discusses several ways to abuse SV_Barycentrics and custom MSAA patterns. Barycentric coordinates inside the pixel shader make analytical anti-aliasing possible, and custom MSAA pattern enables highly efficient flip-quad antialiasing pattern.

    MJPs blog post
    https://mynameismjp.wordpress.com/2015/09/13/programmable-sample-points/

    Discusses various aspects of low resolution rendering and upsampling using custom ordered grid MSAA pattern.
     
    #5 sebbbi, Feb 28, 2016
    Last edited: Feb 28, 2016
    Scott_Arm, Clukos, Billy Idol and 3 others like this.
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    How does MSAA hardware actually work?

    It's important to understand how the MSAA hardware works, and how MSAA achieves the memory bandwidth savings in order to understand the MSAA trick.

    MSAA has been around since 2001 (GeForce 3). It was developed to be an optimized version of supersampling. Traditinal supersampling shades every subsample inside every pixel and stores the result. Multisampling adds an optimization. It only shades and stores one sample per pixel for each triangle. Samples sharing a triangle point to the same data. This is how the pixel shader invocations are reduced and how the memory bandwidth is saved.

    Simplest way to store MSAA data is to store an index per sample describing which sample data it uses. 4xMSAA requires 2 bit indices, 8xMSAA requires 3 bit indices and 16xMSAA requires 4 bit indices. This is a small fraction of extra data as render targets are usually 32 bit or 64 bit or even bigger (in case of a MRT G-buffer). In the most common case, where only 2 triangles hit a single pixel, all of the 4, 8 or 16 samples use index 0 or 1 and only two samples are saved to memory. GPU needs to separate the samples to different memory locations (instead of store one pixel as a big chunk). One cache line (or preferably one memory page) should only contain samples from the same index. Otherwise the GPU would load lots of unused samples to the cache. Only samples 0 and 1 are frequently accessed, samples 2 and 3 are quite often addressed... and sample 15 is almost never accessed. However full storage has be be reserved for the worst case scenario.

    MSAA render targets can be directly read from a compute shader (Texture2DMS). There is no need to "decompress" the color data before reading it (*). Let's say we are using 16xMSAA and these samples form a 4x4 output image tile (with MSAA trick). The shader reads all samples from 0 to 15. One load instruction is issued per sample. Not all of these 16 samples actually exist in memory. Some of the samples point to the same memory (= same index bits). There is no way to knowing how many actual samples are stored to memory (except with a IHV specific OpenGL extension). The compute shader needs to load them all. You will need an equal amount of load instructions compared to a traditional lighting compute shader (with no MSAA trick). However since some of these loads will load the same data as others, the data is often fetched directly from the L1 cache instead from the memory. This results in similar bandwidth savings to the compute shader reading the G-buffer data as the MSAA/ROP hardware provides to the pixel shader writing the G-buffer data.

    Hopefully this made things more clear to people that are not intimately familiar with the MSAA hardware. MSAA has been around for 15 years and my post describes only the ideal implementation. Modern GPUs most likely have more complex implementations with additional IHV specific optimizations.

    (*) Update: This is not true on all PC GPUs. Consoles can read MSAA data directly without needing to "decompress" it.
     
    #6 sebbbi, Feb 28, 2016
    Last edited: Feb 28, 2016
    Scott_Arm, Clukos, homerdog and 3 others like this.
  7. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,153
    Likes Received:
    928
    Location:
    still camping with a mauler
    Do you think SV_Barycentrics to pixel shader and Custom MSAA sampling patterns will be implemented in future iterations of D3D or Vulkan? I actually thought the latter was supported since DX11 (or even before then...) but I must have been thinking about the OGL extensions. I believe the actual hardware (on PC) has been capable of both these things for quite some time so it's kind of sad to see the new APIs not supporting them.
     
  8. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    I suggested something similar except 2 pass forward rendering using sebbbi's idea of a tri-cluster id. Any papers on something like that? or at least benches?

    edit - second pass consisting only of visible triangle clusters.
     
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    All PC DX12 hardware should be capable. I don't know about mobile DX12 / Vulkan compatible hardware.

    I have high hopes in getting these features and most of the other features in my SIGGRAPH presentation wish list to future DX / Vulkan revisions. It is understandable that the amount of new features in first versions of these APIs was cut down to minimum, as the resource management and the state binding was fully rewritten in both APIs. We finally got proper support for multithreading, multiple GPU queues, explicit resource barriers, and much lower CPU overhead. Vulkan also got a completely new shader code (SPIR-V).

    I will write another post to elaborate my feature wish list soon.
     
    milk likes this.
  10. milk

    Veteran Regular

    Joined:
    Jun 6, 2012
    Messages:
    2,989
    Likes Received:
    2,560
    I'm not on a level where I can add anything to the conversation, but please keep talking, as I can follow in a high-level, and sure find it all very interesting. You don't know how curious I am for your next siggraph/gdc/something-else presentation when it does happen.
     
    Laa-Yosh likes this.
  11. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    In a new thread or in this one?
     
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Forward rendering is also possible with triangle clusters. First you do occlusion culling based on last frame depth pyramid. This gives you a list of visible clusters (append buffer really). Now you go through the visible clusters and check whether their bounding spheres intersect with each light bounding sphere. Reserve memory with one atomic add and store that index to the cluster (and the light count). Then you forward render the clusters. Shader loops through the cluster lights. Then you refresh the depth pyramid from the depth buffer, and do culling pass 2. If any clusters were detected to be false positives in first pass, you cull lights to these clusters and render them.

    With forward rendering, it is even more important to depth sort the clusters before rendering them. This is a big gain, as forward rendering pixel shader is super heavy, as it does all the lighting.

    However, you have a problem... Forward rendering practically requires z-prepass (double drawing everything) as you need z-buffer for SSAO, screen space reflections (very important for cubemap reflection occlusion with PBR) and shadow culling. Shadow culling (render shadows only to visible pixels based on depth buffer) is a huge boost for performance in dynamic environments. You need to perform all these steps before the lighting, so you can't use the lighting pass (single forward pass) depth buffer for these. And unfortunately any modern engine wants all of these.
    New thread. To encourage people to add their own wish lists.
     
  13. Clukos

    Clukos Bloodborne 2 when?
    Veteran Newcomer

    Joined:
    Jun 25, 2014
    Messages:
    4,462
    Likes Received:
    3,793
    Sebbi, if it's that not too much to ask can you explain what is the hold up with Dx12 currently? We've already seen couple of games promising Dx12 support in the future (Rise of the Tomb Raider, Just Cause 3) but they don't include it day 1, or day 60. Could it be performance related problems? Like how The Talos Principle is actually running worse with Vulkan enabled by up to 20% (although frametimes are more consistent than the Dx11 alternative)?
     
  14. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,986
    Likes Received:
    847
    Location:
    Planet Earth.
    Like all big API changes the structure of the code doesn't match the structure of the API anymore, retro-fitting the new API inside the old engine will not provide much benefits if not a penalty.
    It takes an engine rewrite to take advantage of the new features/structure of the API, that's why existing games rarely benefit from a new API.
    (Ignoring the fact that drivers also take time to mature and improve.)
     
    Scott_Arm, Kej, Razor1 and 2 others like this.
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I am so happy that we are going to have cross lane operations soon in DX12.

    As quad swizzle will be guaranteed on all GPUs, here's an optimization trick to the Intel's triangle id technique using quad swizzle to save vertex processing work.

    Use 3 quad swizzles (equivalent to 2x ddx_fine + ddy_fine) to get triangle id of the 3 other threads in the same quad. Each triangle needs 3 vertex transforms. So a quad normally transforms 12 vertices (loop of 3 transforms x 4 threads). With quad swizzles you can efficiently share data between 4 threads. This means that you only need N*3/4 (round up) loop iterations, where N is the amount of unique triangles in the quad. After transform, you use quad swizzle to replicate the transformed data to other threads in the quad. No LDS is needed.

    In the best case, this reduces transform work by 66% (all four pixels are from the same triangle). 33% reduction occurs when two triangles are in the same tile.

    You can further improve this technique by only transforming shared vertices once (it is highly likely that two adjacent triangles in screen space share a vertex or two).

    An alternative way to reduce vertex transform work is to hash them in LDS (or use LDS bloom filter). Then repurpose threads to transform one vertex each. And in the end each thread fetches the transformed result from the hash. As you likely guessed, the hashing costs roughly the same as transforming the vertex again. So this "work optimal" technique is only a win for heavy animations (skinning, speed tree wind, etc). And it helps most the best case (big triangles).

    The quad swizzle technique is almost free and gives gains also in dense scenes (triangles just a few pixels each). Soon it is also possible on PC :)
     
    BRiT, Razor1 and homerdog like this.
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Dare I say it: DX12 and Vulkan as so crap free that a whole new generation of bright sparks will do wonderful things. It boggles my mind that with almost 10 TFLOPS graphics are so shit in modern games.

    Think about it: on Fuji that's more than 16,000 instructions per pixel per 3840x2160 frame at 60 frames per second. And that excludes all the free texture filtering, raster blending and depth comparison operations. It's completely ridiculous how shit graphics are right now.
     
    homerdog likes this.
  17. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    It's good to see that compute shaders are finally getting the attention they deserve by Microsoft. Brute force approaches fail to scale. Compute gets lots of good toys for work optimal parallel algorithm writing, such as efficient reductions and efficient prefix sum calculation. It is good to see basic parallel primitives becoming part of the language, making it easier to adopt good algorithms (instead of opting for simple brute force alternatives).

    I just hope that GPU side dispatch (called "dynamic parallelism" in CUDA) will also be introduced to DX12 and Vulkan (OpenCL 2.0 also has it already). We need solutions for the statically allocated GPR issue (pay for code paths you don't take). More fine grained GPU side work submission is needed.
     
  18. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    360
    Likes Received:
    252
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Deferred texturing decals
    MJPs blog post describes one way to handle decals in deferred texturing pipelines. But there are other ways.

    Virtual textured decals
    My SIGGRAPH presentation described a super efficient decaling approach using uniquely mapped virtual texturing. The decals are burned into the virtual texture cache whenever a new texture page (128x128 texels) becomes visible. This way decals bring is zero additional cost in frame rendering pass (important for 60 fps rendering). This technology was first time used in Trials Evolution (Xbox 360) and it allowed us to reach 60 fps with huge amount of decals. Our terrain was also based on pure decals.

    Modify UV + late blending
    If you don't use uniquely mapped virtual texturing, you need other ways for decaling. Turns out that you can still render decals directly to the thin g-buffer.

    All your decal textures need to be accessible at once (just like all your base textures). Bindless, virtual texturing (custom indirection) and tiled resources make this possible.


    The idea is to override the UV (and the material id) of the pixels that are covered by the decal. The tangent is also transformed by the decal's orientation. UV of the decaled area now points to the decal. Lighting shader will fetch decal's texture data for this region and transform it by the tangent quaternion. This achieves screen door (on/off) decals.

    Alpha blending is possible with either MSAA or with temporal reprojection. The important thing to notice is that decal's per pixel alpha is available in the lighting shader. It is stored to the decal's texture and can be fetched from there along other texture channels (thanks to deferred texturing).

    Temporal reprojection is the most interesting option as it is cheap. You render the decal using an alternating checkerboard pattern. Decal is known to have continuous UV, so you can easily detect decals inner pixels by reconstruction neighbors with screen space cross pattern. You can similarly interpolate the underlying surface UV/tangent (or just simply lean on temporal blending). As the decal's per pixel alpha texture is available, this technique can support any amount of transparency levels (=256). It is not limited to the amount of AA samples (witch is usually 4 or 8). This technique sidesteps the commonly known stochastical transparency issues of temporal transparency, as the decal has identical depth and motion vector as the underlying surface.

    The biggest limitation of this technique is that it doesn't support multiple overlapping decals on the same surface (you can extend it to support a 2-3 however with various quality / perf tradeoffs). However traditional g-buffer decals tend to be so expensive that you don't want huge amount of overlap. This technique on the other hand is dirt cheap and could enable games to have much higher total count of decals.
     
    #19 sebbbi, Mar 26, 2016
    Last edited: Mar 26, 2016
    homerdog likes this.
  20. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Virtual textured decals cost

    As said in the earlier post, VT decaling has zero cost in the frame rendering. 30 fps, 60 fps, 90 fps = same cost. Two eyes for VR = almost same cost.

    Camera movement speed and specially how much new surface area gets revealed (per second, not per frame) determines the cost / second. The cost is amortized over multiple frames.

    Virtual texturing pages are always the same size in texels (example 128x128) independent on the mip level. This means that 1 level coarser page covers 4 (2x2) pages of one level finer detail. This makes it simple to throttle the update rate. In Trials Evolution we had a limit of 12 pages per frame. The generator sorted requests in a way that coarse pages that covered 4 requested fine pages had a priority. This practically hide all popping related to camera movement, as the human eye cannot focus that quickly to new details. The page refresh limit causes problems on camera jumps. However the eye cannot see single frame drop during a camera jump, since a jump is not smooth movement. Just ensure that you show the frame before the jump until the textures have been refershed. People will not notice anything strange going on (unless a frame rate monitor is attached).

    12 refreshed pages * 128x128 texels/page = 21% of 720p. On average the cost is much smaller.
     
    milk, homerdog, Razor1 and 2 others like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...