- Mesh shaders can do much of the object level culling and LOD selection work currently done on the CPU
My reading of the method is that decisions about cluster and LOD handling would occur in the task shader, which would then control how many mesh shader invocations would be created, and what it is they would be working on.
Per-primitive culling is something generally recommended to be left to the fixed-function hardware--which has also been improved since Pascal. The video showed a few scenarios where the conventional and mesh pipelines were rather close in terms of performance, highlighting the balancing act of managing culling in cases where not enough work is saved later in the process to justify the obligatory up-front overhead. While still a generally significant improvement, the video presentation had an interesting, almost awkward, tone about how the fixed-function hardware had kept improving and made the new shaders less impressive in certain places.
- This stuff could all be done using compute shaders but a lot of memory allocation and caching optimizations are taken care of by nvidia if you use the mesh pipeline
At least for the compute methods discussed by the presentation, using them at the front of the overall pipeline means they can stay on chip and reduces the overhead introduced by the launch of large compute batches. I presume there's cache thrashing and pipeline events that must write back to memory if the generating shader is in a prior compute pass.
What I don’t get is how does this help with having lots more objects?
The base pipeline always has a primitive distributor that runs sequentially through a single index buffer, creates a de-duplicated in-pipeline vertex batch, and builds a list of in-pipeline primitives that refer back to the entries from the vertex batch they use.
When much of the workload does not change what this one serial unit has to work through every frame, this means a lot of duplicated work bound by straight-line serial processing speed.
The meshlets effectively take the in-pipeline vertex and primitive lists and package them for reuse. Individual mesh shaders can then read from multiple mesh contexts, increasing throughput. On top of that, they are permitted to use a topology that reuses vertices more than the traditional triangle strip, and can be made to reduce the amount of attribute data that is passed around. This further compresses the bandwidth needs of the process. The overall path also has a threshold versus multi-draw calls where if an instance is small enough to fit in the shader context, it allows more parallel processing of instances versus the serial iteration a multi-draw indirect command has at the top of the pipeline.
In combination with the task shader, the overall pipeline can distribute work more concurrently, align primitive processing more effectively with the SIMD architecture, leverage the existing on-chip management, and it leaves open programmable methods for compression and culling.
What specifically happens for the more dynamic parts of the workload that do no benefit as much from reuse wasn't in the video, though it was noted these methods were considered an adjunct to the existing and still-improved traditional pipeline. More variable work may not "compress" and the existing tessellation path is still more efficient for the specific patterns it works for.
There are some parallels with what AMD has described for its geometry handling. Both methods have merged shader stages, and there's a similar split into two parts at a juncture where there is the potential for data amplification, roughly where the tessellation block is, or where a task shader begins to farm out work to child mesh shaders.
However, which parts of the traditional pipeline are kept as-is or have parts of their functionality replaced differs.
Culling at a cluster or mesh level happens earlier in the Nvidia pipeline, whereas the mesh and primitive shaders have more clear affinity with the vertex shaders.
One emphasizes per-primitive culling and dynamic context management for batching injected into a more traditional hardware flow, while the other has a more significant break exposed to the software that has more management incorporated into the code. The various batch sizes and primitive counts exist in one form or another for both, but what the motivations are and when they are exposed differs along with whether they are transparently handled suggestions versus structure definitions to the shaders.
The more explicit break with task and mesh shaders does seem to offer a wide range of options, though it would be less transparent to developers (or would if primitive shaders had done what was initially promised).