Direct3D Mesh Shaders

Discussion in 'Rendering Technology and APIs' started by DmitryKo, Jul 1, 2019.

  1. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    690
    Likes Received:
    563
    Location:
    55°38′33″ N, 37°28′37″ E
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,778
    Likes Received:
    2,566
    So MS is officially supporting the last new Turing feature (mesh shaders)?
     
  3. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    12,502
    Likes Received:
    8,707
    Location:
    Cleveland
    It's good to see Mesh Shaders added, as they seemed to have the most positive developer excitement going for it out of the new features of RTX cards.
     
    jlippo, pharma and DavidGraham like this.
  4. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,778
    Likes Received:
    2,566
    Are there any links other than this one that elaborates on this subject with more details?
     
  5. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    690
    Likes Received:
    563
    Location:
    55°38′33″ N, 37°28′37″ E
    Yes, it looks so, but SDK support is very preliminary (even the feature query is not implemented yet).

    NVidia developer resources imply it's actually a new geometry pipeline which supersedes the traditional vertex, tessellation (hull/domain), and geometry shader stages - that would require a lot of work on the supporting API.

    https://devblogs.nvidia.com/introduction-turing-mesh-shaders/
    https://on-demand.gputechconf.com/gtc-eu/2018/pdf/e8515-mesh-shaders-in-turing.pdf
    http://www.reedbeta.com/blog/mesh-shader-possibilities/
    https://developer.nvidia.com/vulkan-turing
    etc.

     
    #5 DmitryKo, Jul 3, 2019
    Last edited: Sep 11, 2019
    DavidGraham and pharma like this.
  6. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    I like the idea, hopefully this will completely replace that crap called "geometry shader" for good without creating a new shader stage bottleneck.
     
    chris1515 likes this.
  7. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,778
    Likes Received:
    2,566
    Thanks, I still didn't get that last part though, is this a mesh shader specific pipeline, or is it a new pipeline that is similar to it?
     
  8. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    690
    Likes Received:
    563
    Location:
    55°38′33″ N, 37°28′37″ E
    This is not an addition to existing vertex/tessellation/geometry stages - it's a new geometry processing pipeline, entirely based on programmable task shader / mesh shader stages which operate on 'meshlets', small chunks of primitives ideal for parallel processing which add up to form complex meshes. It does not include existing shader stages or fixed function blocks.


    [​IMG] [​IMG]

    This mesh-based pipeline would require new Direct3D APIs to operate, just like *_NV_mesh_shader extensions for Vulkan/OpenGL.

    That said, it can still be called alongside the traditional vertex/tessellation/geometry shader pipeline and no changes to pixel shaders are required.
     
    #8 DmitryKo, Jul 3, 2019
    Last edited: Jul 3, 2019
  9. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    All this reminds me a little bit the never really born AMD primitive shaders... However here we ave a fixed stage in the middle...
     
  10. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,986
    Likes Received:
    847
    Location:
    Planet Earth.
    It would be nice if that could be made into a standard feature.
     
  11. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    690
    Likes Received:
    563
    Location:
    55°38′33″ N, 37°28′37″ E
    Looking again through Vega10 presentations and whitepapers, there is actually a tesselator block in AMD's 'next-generation geometry' (NGG) path - and overall it has a lot of similarities with NVidia's task/mesh shaders, from compute-like work scheduling to unification of vertex, hull/domain and geometry shaders into 'surface shader' for controling tessellation and geometry LOD and 'primitive shaders' for discarding and transforming primitives.

    AMD did promise a new API to fully exploit the possibilities of the NGG path, though apparently that didn't come into fruition, and provided an 'automatic' path in Vega drivers which was later disabled - but now it's finally enabled in Navi10 drivers.

    So maybe the discussion did result in a new API spec, and both 'mesh shaders' and 'primitive shaders' would be implemented in a new Direct3D geometry pipeline, probably with two feature tiers - a lower-level one supporting vertices/triangles/patches, and a higher-level with support for 'meshlets'?
     
    #11 DmitryKo, Jul 6, 2019
    Last edited: Jul 6, 2019
    no-X and Per Lindstrom like this.
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Surface shaders were described as an optional type that was generated if tessellation was active. That shader encapsulates the shader stages upstream from the tessellation stage, while the primitive shader would accept the output from that stage and take up the domain/geometry/pass-through stages.
    Prior to the merged shader stages with GFX9, this likely would have been a pair of shader stages upstream from the tessellator.

    From a post I made in the navi thread:

    https://gitlab.freedesktop.org/mesa...iffs#5b30d5f9f14cd7cb6cf8bc05f5869b422ec93c63
    (from si_shader.h)
    Code:
    * API shaders           VS | TCS | TES | GS |pass| PS
    * are compiled as:         |     |     |    |thru|
    *                          |     |     |    |    |
    * Only VS & PS:         VS |     |     |    |    | PS
    * GFX6     - with GS:   ES |     |     | GS | VS | PS
    *          - with tess: LS | HS  | VS  |    |    | PS
    *          - with both: LS | HS  | ES  | GS | VS | PS
    * GFX9     - with GS:   -> |     |     | GS | VS | PS
    *          - with tess: -> | HS  | VS  |    |    | PS
    *          - with both: -> | HS  | ->  | GS | VS | PS
    *                          |     |     |    |    |
    * NGG      - VS & PS:   GS |     |     |    |    | PS
    * (GFX10+) - with GS:   -> |     |     | GS |    | PS
    *          - with tess: -> | HS  | GS  |    |    | PS
    *          - with both: -> | HS  | ->  | GS |    | PS
    *
    * -> = merged with the next stage
    
    GFX10 goes one more step and makes the GS stage the general stand-in for VS. Versus Vega, the number of full stages doesn't change much, since the last VS stage is a pass-through one.
    The surface shader's function is to perform the VS and TCS stages, which represent the standard vertex and hull steps that feed into the tessellator. Primitive shaders would come into play after tessellation.
    At least with Vega's version, the functionality and primitives wouldn't change, since this slots into the existing pipeline stages. Primitive shaders take the unmodified formats, threading model, and dynamic output of those shaders, and attempt to handle them more efficiently after expansion. Precomputation of static geometry is not offered, although it means the more dynamic elements and tessellation that Nvidia doesn't focus on can be made more efficient.
    Task and mesh shaders don't try to fit the existing model.

    This shows that there are similarities between the two schemes due to where they are in the pipeline, but their goals are divergent. Nvidia's method has a focus on optimizing static geometry and for more explicit control over what the geometry pipeline chooses to generate and how it is generated/formatted. AMD's tries to optimize the existing path and hardware, which means it tries to fit its efficiency changes at the tail end of the generation process, uses existing formats/topologies, and doesn't have the custom or upstream decision-making.
    What GFX10 does to change this is unclear. If the GS stage is similar to what it has been before, Nvidia's presentation on Mesh shaders shows their objections to it in terms of poorly mapping to hardware.

    May depend on the definition of higher-level. From an API standpoint, it's possible the primitive shaders and their standard vertices/triangles/patches approach would be higher-level due to their catering to the existing API versus task and mesh shaders that allow for control over threading, model selection, and topology customization.
     
  13. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    690
    Likes Received:
    563
    Location:
    55°38′33″ N, 37°28′37″ E
    si_shader.h does have some interesting comments on implementation details which are left unexplained in architecture documents.

    Evergreen architecture was the first to have unified compute units emulate some fixed-function blocks and shader stage states with precompiled 'prolog'/'epilog' shader code, thus removing some limitations defined by hardware specs - whereas GCN implements a general-purpose compute architecture in place of dedicated shader processing blocks and GCN5/NGG takes it further by replacing additional fixed-function blocks - which are single-threaded by design - with more shader code and improving the work scheduler to aggressively manage these smaller geometry workloads for improved execution parallelism.

    Yes, I've read that in the Navi thread. Sure, their current mode of operation is to make the 'primitive shader' concept work in the existing geometry pipeline - but the big question is why AMD never released the NGG path as vendor-specified Vulkan/OpenGL extensions... they are very tight-lipped about it, so it's hard to make an informed guess about implementation details.

    Well, I'm not really sure if mesh-based pipeline could be modified to accept traditional primitives, and likewise whether primitive shader pipeline could accept 'meshlet' data, so they could be tiered on top of each other.

    That said, 'meshlet' is just a collection/list of (potentially unconnected) triangles/vertices, so it's not much different at the low level. The principal differences are their maximum size of 126 primitives, which allows massively parallel processing, and potential compatibility with 'higher level' game assets typically implemented with hierarchical LOD meshes, which could be directly consumed and manipulated by graphics hardware - while in the current paradigm, LOD meshes have to be converted to standard lists of primitives using CPU cycles.

    IMHO, there are better chances that AMD would implement 'meshlet' support in Navi with their general-purpose compute units, and this would make 'meshlet' structures a higher tier. Even Vega's NGG work scheduler should have offered improved efficiency for smaller workloads, if you look at AMD Vega 10 whitepaper cited above (p.6) - which should have further improved in Navi/RDNA:

     
    #13 DmitryKo, Jul 7, 2019
    Last edited: Jul 7, 2019
    iroboto, Alessio1989 and BRiT like this.
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    If it suffered from hardware issues or a lack of performance benefits, I could see the project losing out to other priorities. It's not clear at this point whether Navi introduced something significantly different to the implementation that might have conflicted with whatever would have come out for Vega.
    As for the primitive shader concept, there would be patents that described being able to take vertex shaders and use register dependence analysis to determine which operations used and depended on position data, hoisting those operations ahead of the rest of the shader, and adding code to cull primitives before involving other portions of the shading process.
    One of the AMD PR statements about cancelling primitive shaders was that they wanted to focus on more generally applicable and similar techniques already in use--games using triangle-sieve compute shaders that come from analyzing the vertex shaders, taking the position calculations out, and running them in a compute shader that generates the triangle stream for the rest of the vertex processing.

    Task and mesh shaders are described as having programmer-defined input and output, and that they can be made to take the same inputs to provide the same end result. Nvidia has a similar selling point to AMD for mesh shaders where it's low-effort to get from a standard vertex shader to a mesh shader variant. For certain specific input and parameter combinations, Nvidia admits that there can be an efficiency gain from the tessellation stage even if an equivalent task/mesh shader formulation can generate the same vertex stream. Outside the specific patterns that align very well with the fixed-function block, it's supposedly more frequently beneficial to go with the more flexible path.
    To a certain extent, the mesh shader can have a similar addition of late-stage culling logic, although it may not be a win for Nvidia unless the geometry is generating a substantial amount of attribute work per primitive.

    For AMD's primitive shaders, what was discussed with any detail was the version of primitive shaders that took existing API shaders and generated versions with hoisted position calculations and culling. Perhaps if a mesh shader became an API type a primitive shader variant could be generated--although as noted above Nvidia doesn't consider that outside the definition of a mesh shader already.
    A task shader is less likely to fall in the primitive shader's domain. It's too far upstream from where AMD's primitive shader is defined.

    There's a potential difference in meaning from "higher-tier" versus "higher-level". The definition given for a meshlet as a collection of vertices of arbitrary relationship to each other would require the shader to deal directly with lower-level details about the primitives and topology. A shader with a set of API-defined types and topologies would allow software to assign a type and assume the driver/hardware will handle the details below that level of abstraction.

    As for what the Vega whitepaper stated about NGG and primitive shaders, merged shader stages were put into the drivers when primitive shaders were not. Many of the features listed for Vega seemed to be able to function without primitive shaders.
    Some things like the DSBR might have suffered if the limited bin context size could be overwhelmed by a large amount of easily-culled geometry.
    The IWD's handling of context rolls is also something primitive shaders may not have had much to do with the primitive shaders, since that's covering re-ordering shader launches to avoid making certain state changes to the graphics pipeline in places where there may be only a few contexts permitted concurrently. That would be orthogonal to the number of primitives using a specific context.
    The instance handling deals with geometry instancing, rather than how it's possible that a primitive shader can significantly reduce the amount of geometry. An instance is an object known further upstream and at a higher level of abstraction than the vertices in the stream emitted by a primitive shader.
     
  15. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    690
    Likes Received:
    563
    Location:
    55°38′33″ N, 37°28′37″ E
    Stage merging/reordering was made possible by 'unified shaders', that's R600/R700 era, but NGG also claimed better multithreading where the 4 geometry units would process 'more than 17' (?!!) primitives instead of only 4 per cycle. And that probably did not work as expected in real-world scenarios.

    Factors such as cache bandwidth between fixed function triangle processing blocks and general purpose processing units could have been the limit, even if inefficient scheduling of small batches was not. Navi should have improved upon this with larger caches and narrower SIMD wavefronts with better locality, which could have finally enabled the benefits of the 'automatic' path.


    'Meshlet' may be either a 'higher' tier or a 'lower tier' from the data model point of view, but hierarchical LOD meshes are certainly a 'higher' level to me, just like the whole 'scene graph' concept that was never really possible to implement on hardware until very recently (as in BVH trees for DirectX Raytracing).


    As for the native path, it probably was not offering any sizeable improvement over 'automatic' shader management for their current use.

    I understand this is based on a PS4 'triangle sieve' technique when the vertex shader is compiled and executed twice - first through the computer shader 'stage', only to compute the position attributes and then discard invisible primitives (back-faced triangles) from the draw call, and then as the 'real' full vertex shader 'stage' on the remaining primitives. That explains these references to 'general-purpose' and 'compute-like' execution and the statistics of '17 or more' primitives from 4 geometry processing blocks.

    If AMD is able to analyze code dependencies to discard instructions that do not contribute to final coordinate output, they don't even need programmer's input in making this 'automatic' shader very efficient to run.


    However, AMD also said in the Vega 10 whitepaper there could be other usage scenarios beyond primitive culling - though they did not pursue these possibilities so far:
    Hopefully they're working to implement a 'mesh' / 'primitive' shader path as a standard graphics API for Direct3D/Vulkan - if not for Navi 10, then for 'RDNA2' / 'Next-gen Navi' part.
     
    #15 DmitryKo, Jul 8, 2019
    Last edited: Jul 8, 2019
    milk, BRiT and CaptainGinger like this.
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The whitepaper gives the following statement:
    "The “Vega” 10 GPU includes four geometry engines which would normally be limited to a maximum throughput of four primitives per clock, but this limit increases to more than 17 primitives per clock when primitive shaders are employed."
    The key phrase is likely "when primitive shaders are employed", and in this context probably means the total number of primitives processed (submitted + culled).
    The compute shaders used to cull triangles in Frostbite (per a GDC16 presentation) cull one triangle per-thread in a wavefront, which is 16-wide per clock. There would be more than one clock per triangle, but this peak may be assuming certain shortcuts for known formats, trivial culling cases, and/or more than one primitive shader instantiation culling triangles.

    Going by how culling shaders work elsewhere, there's also the probability that a given wavefront doesn't find 100% of its triangles culled, the load bandwidth of the CUs running the primitive shaders is finite, and there's likely a serial execution component that increases with the more complex shader.
    The merged shader stages don't eliminate the stages so much as they create a single shader with two segments that use LDS to transfer intermediate results.
    Navi does do more to increase single-threaded execution, doubles per-CU L1 bandwidth, and the shared LDS may allow for a workgroup to split the merged shader work across two CUs.


    It does have significant similarities conceptually with compute shader culling and primitive shaders. Versus compute shaders, the PS4's method has microcoded hooks that allow this shader to pass triangles to the main vertex shader via a buffer that can fit the L2, which a separate compute shader pass cannot manage. That said, Mark Cerny even stated that this was optional since it would require profiling to determine if it was a net win.
    AMD's primitive shader would have done more by combining the culling into a single shader invocation that had tighter links to the fixed-function and on-die paths, versus the long-latency and less predictable L2. The reduced overhead and tighter latency would have in theory made culling more effective since the front end tends to be significantly less latency-tolerant. However, it's redundant work if fewer triangles need culling, and the straightline performance of the CU and increased occupancy of the shader might have injected marginally more latency in a place where it is more problematic than in later pixel stages.
    It's also possible that there were hazards or bugs in whatever hooks the primitive shader would have had to interact with the shader engine's hardware. There are mostly-scrubbed hints at certain message types and a few instructions that mention primitive shaders or removing primitives from the shader engine's FIFOs, which might have been part of the scheme. However, fiddling with those in other situations has run into unforgiving latency tolerances or possible synchronization problems.

    That was AMD's marketing for Vega, right up until they suddenly couldn't. Existing culling methods show that it frequently isn't difficult to do (PS4, Frostbite, etc.), but for reasons not given it wasn't for Vega.
    Navi supposedly has the option available, but what GFX10 does hasn't been well-documented.

    It's fine in theory, and some of the descriptions of the overheads avoided with deferred attributes came up a few times with AMD and with Nvidia's mesh shaders.
    I've categorized all those as some kind of primitive shader 2.0+ variant, and haven't given them as much thought given AMD hadn't gotten to 1.0 (still not sure what version Navi's would be).
     
    pharma and DmitryKo like this.
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Is there a technical explanation for why mesh shaders require new hardware? They seem to be essentially compute shaders that take anything as input and spit out triangles as output.

    Are there new hardware buffers or data structures required to make it all work?
     
  18. jlippo

    Veteran Regular

    Joined:
    Oct 7, 2004
    Messages:
    1,341
    Likes Received:
    437
    Location:
    Finland
    Humus did a small meshs shader demo.
    http://www.humus.name/index.php?page=3D&ID=93

    In one presentation they mentioned that mesh shaders is basically compute with a path to rasterization.

    Could be that there is not much changes to hardware except for that.
     
  19. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    3,502
    Likes Received:
    2,123
    Location:
    Barcelona Spain
    This the same thing than PS2 vector unit...
     
    trinibwoy likes this.
  20. jlippo

    Veteran Regular

    Joined:
    Oct 7, 2004
    Messages:
    1,341
    Likes Received:
    437
    Location:
    Finland
    Pretty much.
    Although now rasterizer has decent frustum clipping capabilities. ;)
     
    #20 jlippo, Aug 28, 2019
    Last edited: Aug 28, 2019
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...