Whats the different to Nvidias Turing? Are Primitive Shaders now always on? And whats the difference to Nvidias Mesh shader?
The API shaders at the top of the table are mapped to internal shader stages executed by the hardware. I haven't seen a similar listing for what is done internally by Nvidia.
The shaders that are compiled as primitive shaders are flagged as being such, so the option to compile them as normal still exists. The automatic primitive shader concepts first discussed by AMD focused on culling, and the automatic path worked by using dependence analysis of a shader to extract operations in a vertex or other shader and place them in an early culling phase ahead of the rest of the shader. If for some reason the compiler could not separate the position calculations from the rest of the shader, it wouldn't be compiled as a primitive shader. If there was shader code that mixed usage of position and attribute data, or perhaps if there was a mode like transparency that prevented a lot of culling from working, this may be a reason for the compiler to avoid redundant work.
It's not clear if this new iteration of NGG has added features versus the concepts introduced in Vega.
If it's similar, then there are some differences from Nvidia's task and mesh shaders.
Nvidia's path is explicitly separate from the standard geometry pipeline with tessellation and other shaders, with the general argument that outside of certain cases they are more effective. Mesh shading is heavily focused on getting good throughput and efficiency by optimizing the representation and reuse of static geometry. Task shaders can perform a level of decision making and advance culling by being able to vary things like what LOD model the mesh shaders will use, or how many mesh shaders will be launched. There's a more arbitrary set of primitive types that can be fed into that pipeline, and the process exposes a more direct way to control what threads are launched to provide the necessary processing.
Primitive shaders exist within the standard geometry pipeline, which includes tessellation, vertex, and geometry shaders. It's not going to require balancing between pipeline types by the programmer. There's no mention of the sort of reuse or optimization of static geometry, which points to more work being done every frame despite the much of it not changing.
The decision-making of the shaders is more limited, since they are different ways of expressing the standard shader types. They can do the same things more efficiently or with more culling, not do different things like change what model is used or explicitly control the amount of thread expansion. The primitive types used seem to be more standard formats rather than a more generalized set of inputs.
That doesn't rule out that there can be some overlap or changes going forward. Presentations on task and mesh shading mention the possibility of adding culling elements to mesh shaders similar in concept to what AMD proposes to mesh shaders, and AMD alluded to possible future uses of primitive shaders that might allow for more complex behavior. Possibly, the more generalized GS stage may hint at things becoming more flexibile as far as what kind of data is passed through the pipeline and how it is mapped in terms of threads in the shader array.
TEXTURE PROCESSOR BASED RAY TRACING ACCELERATION METHOD AND SYSTEM
United States Patent Application 2019019776
http://www.freepatentsonline.com/20190197761.pdf
This seems consistent with the BVH texture instructions mentioned in an earlier LLVM commit. I've speculated about what facets of pre-Navi architectures best mapped to the BVH traversal process, and that it seemed like it benefited by hardware that had its own independent scheduling and data paths that didn't work in lock-step with the SIMD execution path.
At the time I wondered if either the shared scalar path or texturing path could be evolved to handle this, and each had certain features that might be useful depending on the level of programmability or raw performance.
The texturing path already does a lot of automatic generation of memory accesses and internal scheduling for more complex filtering types, and already handles a memory-dominated workload.
The scalar path was shared hardware in past GPUs, is associated with sequencing decisions for a thread, and had its own data path. However, it was more heavily used and needed more execution capability at the time, and with Navi it's become less separate.