Direct3D Mesh Shaders

So MS is officially supporting the last new Turing feature (mesh shaders)?

Yes, it looks so, but SDK support is very preliminary (even the feature query is not implemented yet).

NVidia developer resources imply it's actually a new geometry pipeline which supersedes the traditional vertex, tessellation (hull/domain), and geometry shader stages - that would require a lot of work on the supporting API.

Are there any links other than this one that elaborates on this subject with more details?

https://devblogs.nvidia.com/introduction-turing-mesh-shaders/
https://on-demand.gputechconf.com/gtc-eu/2018/pdf/e8515-mesh-shaders-in-turing.pdf
http://www.reedbeta.com/blog/mesh-shader-possibilities/
https://developer.nvidia.com/vulkan-turing
etc.

 
Last edited:
I like the idea, hopefully this will completely replace that crap called "geometry shader" for good without creating a new shader stage bottleneck.
 
NVidia developer resources imply it's actually a new geometry pipeline which supersedes the traditional vertex, tessellation (hull/domain), and geometry shader stages - that would require a lot of work on the supporting API.
Thanks, I still didn't get that last part though, is this a mesh shader specific pipeline, or is it a new pipeline that is similar to it?
 
is this a mesh shader specific pipeline, or is it a new pipeline that is similar to it?

This is not an addition to existing vertex/tessellation/geometry stages - it's a new geometry processing pipeline, entirely based on programmable task shader / mesh shader stages which operate on 'meshlets', small chunks of primitives ideal for parallel processing which add up to form complex meshes. It does not include existing shader stages or fixed function blocks.




This mesh-based pipeline would require new Direct3D APIs to operate, just like *_NV_mesh_shader extensions for Vulkan/OpenGL.

That said, it can still be called alongside the traditional vertex/tessellation/geometry shader pipeline and no changes to pixel shaders are required.
 
Last edited:
All this reminds me a little bit the never really born AMD primitive shaders... However here we ave a fixed stage in the middle...
 
All this reminds me a little bit the never really born AMD primitive shaders... However here we ave a fixed stage in the middle...
Looking again through Vega10 presentations and whitepapers, there is actually a tesselator block in AMD's 'next-generation geometry' (NGG) path - and overall it has a lot of similarities with NVidia's task/mesh shaders, from compute-like work scheduling to unification of vertex, hull/domain and geometry shaders into 'surface shader' for controling tessellation and geometry LOD and 'primitive shaders' for discarding and transforming primitives.

AMD did promise a new API to fully exploit the possibilities of the NGG path, though apparently that didn't come into fruition, and provided an 'automatic' path in Vega drivers which was later disabled - but now it's finally enabled in Navi10 drivers.

So maybe the discussion did result in a new API spec, and both 'mesh shaders' and 'primitive shaders' would be implemented in a new Direct3D geometry pipeline, probably with two feature tiers - a lower-level one supporting vertices/triangles/patches, and a higher-level with support for 'meshlets'?
 
Last edited:
Looking again through Vega10 presentations and whitepapers, there is actually a tesselator block in AMD's 'next-generation geometry' (NGG) path - and overall it has a lot of similarities with NVidia's task/mesh shaders, from compute-like work scheduling to unification of vertex, hull/domain and geometry shaders into 'surface shader' for controling tessellation and geometry LOD and 'primitive shaders' for discarding and transforming primitives.
Surface shaders were described as an optional type that was generated if tessellation was active. That shader encapsulates the shader stages upstream from the tessellation stage, while the primitive shader would accept the output from that stage and take up the domain/geometry/pass-through stages.
Prior to the merged shader stages with GFX9, this likely would have been a pair of shader stages upstream from the tessellator.

From a post I made in the navi thread:

https://gitlab.freedesktop.org/mesa...iffs#5b30d5f9f14cd7cb6cf8bc05f5869b422ec93c63
(from si_shader.h)
Code:
* API shaders           VS | TCS | TES | GS |pass| PS
* are compiled as:         |     |     |    |thru|
*                          |     |     |    |    |
* Only VS & PS:         VS |     |     |    |    | PS
* GFX6     - with GS:   ES |     |     | GS | VS | PS
*          - with tess: LS | HS  | VS  |    |    | PS
*          - with both: LS | HS  | ES  | GS | VS | PS
* GFX9     - with GS:   -> |     |     | GS | VS | PS
*          - with tess: -> | HS  | VS  |    |    | PS
*          - with both: -> | HS  | ->  | GS | VS | PS
*                          |     |     |    |    |
* NGG      - VS & PS:   GS |     |     |    |    | PS
* (GFX10+) - with GS:   -> |     |     | GS |    | PS
*          - with tess: -> | HS  | GS  |    |    | PS
*          - with both: -> | HS  | ->  | GS |    | PS
*
* -> = merged with the next stage

GFX10 goes one more step and makes the GS stage the general stand-in for VS. Versus Vega, the number of full stages doesn't change much, since the last VS stage is a pass-through one.
The surface shader's function is to perform the VS and TCS stages, which represent the standard vertex and hull steps that feed into the tessellator. Primitive shaders would come into play after tessellation.
At least with Vega's version, the functionality and primitives wouldn't change, since this slots into the existing pipeline stages. Primitive shaders take the unmodified formats, threading model, and dynamic output of those shaders, and attempt to handle them more efficiently after expansion. Precomputation of static geometry is not offered, although it means the more dynamic elements and tessellation that Nvidia doesn't focus on can be made more efficient.
Task and mesh shaders don't try to fit the existing model.

This shows that there are similarities between the two schemes due to where they are in the pipeline, but their goals are divergent. Nvidia's method has a focus on optimizing static geometry and for more explicit control over what the geometry pipeline chooses to generate and how it is generated/formatted. AMD's tries to optimize the existing path and hardware, which means it tries to fit its efficiency changes at the tail end of the generation process, uses existing formats/topologies, and doesn't have the custom or upstream decision-making.
What GFX10 does to change this is unclear. If the GS stage is similar to what it has been before, Nvidia's presentation on Mesh shaders shows their objections to it in terms of poorly mapping to hardware.

So maybe the discussion did result in a new API spec, and both 'mesh shaders' and 'primitive shaders' would be implemented in a new Direct3D geometry pipeline, probably with two feature tiers - a lower-level one supporting vertices/triangles/patches, and a higher-level with support for 'meshlets'?
May depend on the definition of higher-level. From an API standpoint, it's possible the primitive shaders and their standard vertices/triangles/patches approach would be higher-level due to their catering to the existing API versus task and mesh shaders that allow for control over threading, model selection, and topology customization.
 
si_shader.h does have some interesting comments on implementation details which are left unexplained in architecture documents.

Evergreen architecture was the first to have unified compute units emulate some fixed-function blocks and shader stage states with precompiled 'prolog'/'epilog' shader code, thus removing some limitations defined by hardware specs - whereas GCN implements a general-purpose compute architecture in place of dedicated shader processing blocks and GCN5/NGG takes it further by replacing additional fixed-function blocks - which are single-threaded by design - with more shader code and improving the work scheduler to aggressively manage these smaller geometry workloads for improved execution parallelism.

At least with Vega's version, the functionality and primitives wouldn't change, since this slots into the existing pipeline stages. Primitive shaders take the unmodified formats, threading model, and dynamic output of those shaders, and attempt to handle them more efficiently after expansion.
AMD's tries to optimize the existing path and hardware, which means it tries to fit its efficiency changes at the tail end of the generation process, uses existing formats/topologies, and doesn't have the custom or upstream decision-making.
What GFX10 does to change this is unclear.
Yes, I've read that in the Navi thread. Sure, their current mode of operation is to make the 'primitive shader' concept work in the existing geometry pipeline - but the big question is why AMD never released the NGG path as vendor-specified Vulkan/OpenGL extensions... they are very tight-lipped about it, so it's hard to make an informed guess about implementation details.

From an API standpoint, it's possible the primitive shaders and their standard vertices/triangles/patches approach would be higher-level due to their catering to the existing API versus task and mesh shaders that allow for control over threading, model selection, and topology customization.
Well, I'm not really sure if mesh-based pipeline could be modified to accept traditional primitives, and likewise whether primitive shader pipeline could accept 'meshlet' data, so they could be tiered on top of each other.

That said, 'meshlet' is just a collection/list of (potentially unconnected) triangles/vertices, so it's not much different at the low level. The principal differences are their maximum size of 126 primitives, which allows massively parallel processing, and potential compatibility with 'higher level' game assets typically implemented with hierarchical LOD meshes, which could be directly consumed and manipulated by graphics hardware - while in the current paradigm, LOD meshes have to be converted to standard lists of primitives using CPU cycles.

IMHO, there are better chances that AMD would implement 'meshlet' support in Navi with their general-purpose compute units, and this would make 'meshlet' structures a higher tier. Even Vega's NGG work scheduler should have offered improved efficiency for smaller workloads, if you look at AMD Vega 10 whitepaper cited above (p.6) - which should have further improved in Navi/RDNA:

… Previous hardware mapped quite closely to the standard Direct3D rendering pipeline, with several stages including input assembly, vertex shading, hull shading, tessellation, domain shading, and geometry shading. Given the wide variety of rendering technologies now being implemented by developers, however, including all of these stages isn’t always the most efficient way of doing things. Each stage has various restrictions on inputs and outputs that may have been necessary for earlier GPU designs, but such restrictions aren’t always needed on today’s more flexible hardware.

“Vega’s” new primitive shader support allows some parts of the geometry processing pipeline to be combined and replaced with a new, highly efficient shader type. These flexible, general-purpose shaders can be launched very quickly, enabling more than four times the peak primitive cull rate per clock cycle.

… Primitive shaders can operate on a variety of different geometric primitives, including individual vertices, polygons, and patch surfaces. When tessellation is enabled, a surface shader is generated to process patches and control points before the surface is tessellated, and the resulting polygons are sent to the primitive shader. In this case, the surface shader combines the vertex shading and hull shading stages of the Direct3D graphics pipeline, while the primitive shader replaces the domain shading and geometry shading stages.
… Another innovation of “Vega’s” NGG is improved load balancing across multiple geometry engines. An intelligent workload distributor (IWD) continually adjusts pipeline settings based on the characteristics of the draw calls it receives in order to maximize utilization.

One factor that can cause geometry engines to idle is context switching. Context switches occur whenever the engine changes from one render state to another, such as when changing from a draw call for one object to that of a different object with different material properties. The amount of data associated with render states can be quite large, and GPU processing can stall if it runs out of available context storage. The IWD seeks to avoid this performance overhead by avoiding context switches whenever possible.

Some draw calls also include many small instances (i.e., they render many similar versions of a simple object). If an instance does not include enough primitives to fill a wavefront of 64 threads, then it cannot take full advantage of the GPU’s parallel processing capability, and some proportion of the GPU's capacity goes unused. The IWD can mitigate this effect by packing multiple small instances into a single wavefront, providing a substantial boost to utilization.
 
Last edited:
Yes, I've read that in the Navi thread. Sure, their current mode of operation is to make the 'primitive shader' concept work in the existing geometry pipeline - but the big question is why AMD never released the NGG path as vendor-specified Vulkan/OpenGL extensions... they are very tight-lipped about it, so it's hard to make an informed guess about implementation details.
If it suffered from hardware issues or a lack of performance benefits, I could see the project losing out to other priorities. It's not clear at this point whether Navi introduced something significantly different to the implementation that might have conflicted with whatever would have come out for Vega.
As for the primitive shader concept, there would be patents that described being able to take vertex shaders and use register dependence analysis to determine which operations used and depended on position data, hoisting those operations ahead of the rest of the shader, and adding code to cull primitives before involving other portions of the shading process.
One of the AMD PR statements about cancelling primitive shaders was that they wanted to focus on more generally applicable and similar techniques already in use--games using triangle-sieve compute shaders that come from analyzing the vertex shaders, taking the position calculations out, and running them in a compute shader that generates the triangle stream for the rest of the vertex processing.

Well, I'm not really sure if mesh-based pipeline could be modified to accept traditional primitives, and likewise whether primitive shader pipeline could accept 'meshlet' data, so they could be tiered on top of each other.
Task and mesh shaders are described as having programmer-defined input and output, and that they can be made to take the same inputs to provide the same end result. Nvidia has a similar selling point to AMD for mesh shaders where it's low-effort to get from a standard vertex shader to a mesh shader variant. For certain specific input and parameter combinations, Nvidia admits that there can be an efficiency gain from the tessellation stage even if an equivalent task/mesh shader formulation can generate the same vertex stream. Outside the specific patterns that align very well with the fixed-function block, it's supposedly more frequently beneficial to go with the more flexible path.
To a certain extent, the mesh shader can have a similar addition of late-stage culling logic, although it may not be a win for Nvidia unless the geometry is generating a substantial amount of attribute work per primitive.

For AMD's primitive shaders, what was discussed with any detail was the version of primitive shaders that took existing API shaders and generated versions with hoisted position calculations and culling. Perhaps if a mesh shader became an API type a primitive shader variant could be generated--although as noted above Nvidia doesn't consider that outside the definition of a mesh shader already.
A task shader is less likely to fall in the primitive shader's domain. It's too far upstream from where AMD's primitive shader is defined.

That said, 'meshlet' is just a collection/list of (potentially unconnected) triangles/vertices, so it's not much different at the low level. The principal differences are their maximum size of 126 primitives, which allows massively parallel processing, and potential compatibility with 'higher level' game assets typically implemented with hierarchical LOD meshes, which could be directly consumed and manipulated by graphics hardware - while in the current paradigm, LOD meshes have to be converted to standard lists of primitives using CPU cycles.

IMHO, there are better chances that AMD would implement 'meshlet' support in Navi with their general-purpose compute units, and this would make 'meshlet' structures a higher tier. Even Vega's NGG work scheduler should have offered improved efficiency for smaller workloads, if you look at AMD Vega 10 whitepaper cited above (p.6) - which should have further improved in Navi/RDNA:
There's a potential difference in meaning from "higher-tier" versus "higher-level". The definition given for a meshlet as a collection of vertices of arbitrary relationship to each other would require the shader to deal directly with lower-level details about the primitives and topology. A shader with a set of API-defined types and topologies would allow software to assign a type and assume the driver/hardware will handle the details below that level of abstraction.

As for what the Vega whitepaper stated about NGG and primitive shaders, merged shader stages were put into the drivers when primitive shaders were not. Many of the features listed for Vega seemed to be able to function without primitive shaders.
Some things like the DSBR might have suffered if the limited bin context size could be overwhelmed by a large amount of easily-culled geometry.
The IWD's handling of context rolls is also something primitive shaders may not have had much to do with the primitive shaders, since that's covering re-ordering shader launches to avoid making certain state changes to the graphics pipeline in places where there may be only a few contexts permitted concurrently. That would be orthogonal to the number of primitives using a specific context.
The instance handling deals with geometry instancing, rather than how it's possible that a primitive shader can significantly reduce the amount of geometry. An instance is an object known further upstream and at a higher level of abstraction than the vertices in the stream emitted by a primitive shader.
 
Stage merging/reordering was made possible by 'unified shaders', that's R600/R700 era, but NGG also claimed better multithreading where the 4 geometry units would process 'more than 17' (?!!) primitives instead of only 4 per cycle. And that probably did not work as expected in real-world scenarios.

Factors such as cache bandwidth between fixed function triangle processing blocks and general purpose processing units could have been the limit, even if inefficient scheduling of small batches was not. Navi should have improved upon this with larger caches and narrower SIMD wavefronts with better locality, which could have finally enabled the benefits of the 'automatic' path.


'Meshlet' may be either a 'higher' tier or a 'lower tier' from the data model point of view, but hierarchical LOD meshes are certainly a 'higher' level to me, just like the whole 'scene graph' concept that was never really possible to implement on hardware until very recently (as in BVH trees for DirectX Raytracing).


As for the native path, it probably was not offering any sizeable improvement over 'automatic' shader management for their current use.

I understand this is based on a PS4 'triangle sieve' technique when the vertex shader is compiled and executed twice - first through the computer shader 'stage', only to compute the position attributes and then discard invisible primitives (back-faced triangles) from the draw call, and then as the 'real' full vertex shader 'stage' on the remaining primitives. That explains these references to 'general-purpose' and 'compute-like' execution and the statistics of '17 or more' primitives from 4 geometry processing blocks.

If AMD is able to analyze code dependencies to discard instructions that do not contribute to final coordinate output, they don't even need programmer's input in making this 'automatic' shader very efficient to run.


However, AMD also said in the Vega 10 whitepaper there could be other usage scenarios beyond primitive culling - though they did not pursue these possibilities so far:
Primitive shaders have many potential uses beyond high-performance geometry culling. Shadow-map rendering is another ubiquitous process in modern engines that could benefit greatly from the reduced processing overhead of primitive shaders. We can envision even more uses for this technology in the future, including deferred vertex attribute computation, multi-view/multi-resolution rendering, depth pre-passes, particle systems, and full-scene graph processing and traversal on the GPU.

Hopefully they're working to implement a 'mesh' / 'primitive' shader path as a standard graphics API for Direct3D/Vulkan - if not for Navi 10, then for 'RDNA2' / 'Next-gen Navi' part.
 
Last edited:
Stage merging/reordering was made possible by 'unified shaders', that's R600/R700 era, but NGG also claimed better multithreading where the 4 geometry units would process 'more than 17' (?!!) primitives instead of only 4 per cycle. And that probably did not work as expected in real-world scenarios.
The whitepaper gives the following statement:
"The “Vega” 10 GPU includes four geometry engines which would normally be limited to a maximum throughput of four primitives per clock, but this limit increases to more than 17 primitives per clock when primitive shaders are employed."
The key phrase is likely "when primitive shaders are employed", and in this context probably means the total number of primitives processed (submitted + culled).
The compute shaders used to cull triangles in Frostbite (per a GDC16 presentation) cull one triangle per-thread in a wavefront, which is 16-wide per clock. There would be more than one clock per triangle, but this peak may be assuming certain shortcuts for known formats, trivial culling cases, and/or more than one primitive shader instantiation culling triangles.

Factors such as cache bandwidth between fixed function triangle processing blocks and general purpose processing units could have been the limit, even if inefficient scheduling of small batches was not. Navi should have improved upon this with larger caches and narrower SIMD wavefronts with better locality, which could have finally enabled the benefits of the 'automatic' path.
Going by how culling shaders work elsewhere, there's also the probability that a given wavefront doesn't find 100% of its triangles culled, the load bandwidth of the CUs running the primitive shaders is finite, and there's likely a serial execution component that increases with the more complex shader.
The merged shader stages don't eliminate the stages so much as they create a single shader with two segments that use LDS to transfer intermediate results.
Navi does do more to increase single-threaded execution, doubles per-CU L1 bandwidth, and the shared LDS may allow for a workgroup to split the merged shader work across two CUs.


I understand this is based on a PS4 'triangle sieve' technique when the vertex shader is compiled and executed twice - first through the computer shader 'stage', only to compute the position attributes and then discard invisible primitives (back-faced triangles) from the draw call, and then as the 'real' full vertex shader 'stage' on the remaining primitives. That explains these references to 'general-purpose' and 'compute-like' execution and the statistics of '17 or more' primitives from 4 geometry processing blocks.
It does have significant similarities conceptually with compute shader culling and primitive shaders. Versus compute shaders, the PS4's method has microcoded hooks that allow this shader to pass triangles to the main vertex shader via a buffer that can fit the L2, which a separate compute shader pass cannot manage. That said, Mark Cerny even stated that this was optional since it would require profiling to determine if it was a net win.
AMD's primitive shader would have done more by combining the culling into a single shader invocation that had tighter links to the fixed-function and on-die paths, versus the long-latency and less predictable L2. The reduced overhead and tighter latency would have in theory made culling more effective since the front end tends to be significantly less latency-tolerant. However, it's redundant work if fewer triangles need culling, and the straightline performance of the CU and increased occupancy of the shader might have injected marginally more latency in a place where it is more problematic than in later pixel stages.
It's also possible that there were hazards or bugs in whatever hooks the primitive shader would have had to interact with the shader engine's hardware. There are mostly-scrubbed hints at certain message types and a few instructions that mention primitive shaders or removing primitives from the shader engine's FIFOs, which might have been part of the scheme. However, fiddling with those in other situations has run into unforgiving latency tolerances or possible synchronization problems.

If AMD is able to analyze code dependencies to discard instructions that do not contribute to final coordinate output, they don't even need programmer's input in making this 'automatic' shader very efficient to run.
That was AMD's marketing for Vega, right up until they suddenly couldn't. Existing culling methods show that it frequently isn't difficult to do (PS4, Frostbite, etc.), but for reasons not given it wasn't for Vega.
Navi supposedly has the option available, but what GFX10 does hasn't been well-documented.

However, AMD also said in the Vega 10 whitepaper there could be other usage scenarios beyond primitive culling - though they did not pursue these possibilities so far:
It's fine in theory, and some of the descriptions of the overheads avoided with deferred attributes came up a few times with AMD and with Nvidia's mesh shaders.
I've categorized all those as some kind of primitive shader 2.0+ variant, and haven't given them as much thought given AMD hadn't gotten to 1.0 (still not sure what version Navi's would be).
 
Is there a technical explanation for why mesh shaders require new hardware? They seem to be essentially compute shaders that take anything as input and spit out triangles as output.

Are there new hardware buffers or data structures required to make it all work?
 
Humus did a small meshs shader demo.
http://www.humus.name/index.php?page=3D&ID=93

Is there a technical explanation for why mesh shaders require new hardware? They seem to be essentially compute shaders that take anything as input and spit out triangles as output.

Are there new hardware buffers or data structures required to make it all work?
In one presentation they mentioned that mesh shaders is basically compute with a path to rasterization.

Could be that there is not much changes to hardware except for that.
 
Back
Top