Nvidia Turing Architecture [2018]

Put some ARM cores on the GPU and get rid of driver and PCIe overhead :)
I dread what's gonna happen once 7nm GPUs are released. We already have a CPU limitation even @1440p. If the 3080Ti is released in the end of 2019 that CPU bottleneck will creep into 4K too! With the stagnation of CPUs expected to continue for the foreseeable future, an extensive multi core support seems the only way forward at this point.
 
What we're seeing happening now is that the scheduling is getting moved off the CPU and onto the GPU. It's been possible now for a couple generations for full scheduling and resource management to happen on the GPU in the Cuda world, and we already have Execute Indirect in DirectX allowing the GPU to issue its own draw calls from shader code. Now we have task/mesh shaders, which combine to form another way for the GPU to feed itself.

We just need device side malloc/free and memcpy to come to the graphics APIs. As I mentioned, this already exists in Cuda, and I suspect the holdup in getting it across to the graphics APIs has to do with trying to shoehorn it into a very complicated system that expects all memory to be managed host side rather than hardware limitations. I'm not sure the status is on AMD/hip (which doesn't work on Windows in any case), though I don't believe there is any such capability on Intel/OpenCL.
 
What is going to be the functionality of this scheduling moving forward though? Are we talking about continued modified DX11 support or DX12/Vulkan w/async compute?
 
We're talking about abandoning DX11.

An easier programming interface is probably appreciated for non high performance programmers. But with raytracing and Vulcan being merged with Open CL that seems a lot easier even than DX11, at least at whatever point that would become common enough to use for the mass market.

Regardless if AMD needs to get anything from Turing (other than non compute shader fallback DXR support) it's Mesh and Task shaders. I'll still complain that games should move to a different, non polygonal model all together for the art pipeline. Subdivision surfaces or voxels or. Well something that gets rid of stuff like UV mapping and all the other junk making art slow and expensive today, there's got to be some clever model out there that could be fast to render, low memory, and very easy to manipulate, etc. But being realistic, that's going to happen too slowly for Mesh and Task Shaders not to be useful for quite a while. Depending on some "new" model being adopted they could well be useful after too.
 
An easier programming interface is probably appreciated for non high performance programmers.

Not the problem of driver and low-level API makers, but language developers which need no affiliation with the hardware companies. Lack of innertia, capitalization, stability, independence?
This is such a complicated topic ... GPUs should have been programmable co-processors since a while, they are not because ... man this is a really difficult topic. Maybe it's because GPU makers are more worried and concerned with quality of service (runnable hardware, drivers, etc.) and competition at any moment in time, than with programmers wrath. ;)

Regardless if AMD needs to get anything from Turing (other than non compute shader fallback DXR support) it's Mesh and Task shaders.

Why? Has is passed the test of time already? Maybe it's too complicated to program for. Maybe primitive shaders are the better generalization.

I'll still complain that games should move to a different, non polygonal model all together for the art pipeline. Subdivision surfaces or voxels or. Well something that gets rid of stuff like UV mapping and all the other junk making art slow and expensive today, there's got to be some clever model out there that could be fast to render, low memory, and very easy to manipulate, etc. But being realistic, that's going to happen too slowly for Mesh and Task Shaders not to be useful for quite a while. Depending on some "new" model being adopted they could well be useful after too.

Content creation and display have very little to do with each other. Improve the translation from one domain to the other and your problems go away. It hopefully will, in time.
 
Why? Has is passed the test of time already? Maybe it's too complicated to program for. Maybe primitive shaders are the better generalization.

Well, one of the nice features about mesh/task shaders is that they're already exposed in Vulkan and OpenGL extensions, so you can use them today. No such luck with AMD's primitive shaders.

Anyway, having read everything I can find about both technologies, I still only have a vague idea what primitive shaders actually do and where they live in the pipeline, other than a vague bit about them replacing the vertex and geometry shaders.

Mesh and task shaders are well documentated here: https://devblogs.nvidia.com/introduction-turing-mesh-shaders/

They appear to be no less than direct control of the geometry pipeline through something similar to compute shaders. This could be very useful for generating geometry with algorithms that don't map well to the traditional pipeline. Things like REYES, volumetric, subdivision surfaces, and even optimizations like various forms of culling and level of detail. The important point is that the mesh shader can directly feed the triangle setup without a round trip to memory, so there's a sizable bandwidth saving from just generating triangles in a compute shader. Also, the fact that the task shader can recursively generate tasks is a major win in certain cases, such as level of detail when you don't want the overhead of instance rendering. Give the task shader raw object data and let it select levels of detail and dispatch to mesh shaders, and perhaps even do camera dependent culling.

The biggest disadvantage I see to mesh/task shaders is that they don't seem to have access to the tessellation hardware. Otherwise, they're a close to the metal replacement for the entire geometry pipeline. The old geometry pipeline is a rather poor abstraction of modern hardware anyway.

As for how mesh shaders compare to primitive shaders, it's anyone's guess. They might operate almost identically. Of course until AMD releases documentation and API support, the question is academic.
 
I think relative to the traditional pipeline and something like the NGG pipeline and primitive shaders, the task/mesh path in some parts encompasses a wider scope while excluding others.
It is a more programmable and exposed method that can reach further upstream in terms of using objects as inputs rather than an index buffer, but it is also considered an adjunct to the existing path for things like tessellation. Also, some of the biggest improvements stem from pre-computation of meshlets and their benefit for static geometry. What the calculus is for the non-static component may be different, and the full set of features a developer might want to use in DX12 may involve running these paths concurrently.
Primitive shaders do add programmability, but there seems to be a stronger preference for insinuating them into the existing pipeline, which may constrain them somewhat relative to the broader swath of inputs and options available to a pipeline that doesn't try to remain compatible with the existing primitive setup path and tessellation behaviors. This does mean the difference from the prior form of geometry setup is less, and there's still one path for everything. Potentially, the originally offered form of them would have enabled higher performance with very little developer involvement.

Both schemes hew more to using generalized compute as their model, so it seems possible that one could be programmed to arrive at similar results to the other. Off-hand, there's a brief reference to perhaps using a culling phase in the mesh shader description, and the Vega whitepaper posits some future uses of primitive shaders that can lead to similar deferred attribute calculations and may allow for precomputing some of the workload.
There is a difference in terms of emphasis, and I am less clear on how fully primitive shaders change the mapping of the workload to threads relative to how it is documented for mesh shaders. There is some merging of shader stages, though the management may differ in terms of what the hardware pipeline may manage versus shader compiler and how much is in local data share versus caches, etc. At least initially, primitive shaders are more concerned with culling, and they do this culling more in the period where mesh shaders would run. Task shader cluster culling is the more discussed path for Nvidia, and it has more discussion on what to do when even despite culling there are a lot of triangles.
The difference is marked enough that Nvidia has more faith in the culling or rasterization capabilities of its hardware in the post-cluster culling set of mesh shaders, whereas primitive shaders seem to have the cycle penalty of even one extra triangle reaching the rasterizer much closer to the forefront of their marketing.

The NGG pipeline also introduces another level of batching complexity at the end of the pipeline which task/mesh shaders do not. That may not have directly interfered with the design decisions of the earlier stages, but it may have introduced considerations for later raster and depth limitations into the optimal behaviors of the primitive shaders, which task/mesh may not be as strongly bound by.
 
I suspect that simple backface culling might be a whole lot more efficient at the meshlet level. Instead of testing every single triangle, for each meshlet, store a bounding box for the geometry as well as a "bounding box" for the the triangle normals. Then you can compute all back facing vs. maybe some front facing for an entire meshlet at once. If we assume that the content authors have some halfway decent software to cluster triangles, I can easily imagine 30% of meshlets culled as back facing. Moreover, this should be more efficient than trying to do this automagically in drivers and hardware since the developers will have a clear picture of how their culling system works and this how to optimally order and cluster triangles, rather than guessing at a black box.

Another important thing is that the developer now has full visibility and is in control of warp and block scheduling, rather than the hardware just doing whatever assigning vertices in a black box fashion. This is VERY useful for things like subdivision surfaces where you need a lot of adjacency information.
 
Thinking about it some more, how does the hardware go from rasterizer to pixel shader scheduling? Triangles are so small now that it would be grossly inefficient to not set up pixel shader blocks on many triangles at a time. Care with meshlet shape could therefore provide the pixel shaders with better locality at block assignment time, leading to better utilization and coherency, especially considering that pixels are shaded in 2x2 quads, making them all the more vulnerable to fragmentation.
 
Well, one of the nice features about mesh/task shaders is that they're already exposed in Vulkan and OpenGL extensions, so you can use them today. No such luck with AMD's primitive shaders.

I still wouldn't say something is better in the absense of comparability, or somethings has to learn from something else in the absence of visibility/knowledge.
The meshlet definitions breaks compatibility with current assets, and in the least intrusive implementation you have to transcode the index-buffers on the CPU. When you're already CPU-bound it leaves a bit of a bad taste.

AMD had compute fronts for vertex shaders since a while (ES), it had no IA so vertex buffer reads were software fetches also since a while. The tesselator has a spill path since forwever, so mesh amplification to memory (and performance results) is also an old hat. Doesn't sound like they are too far away from what's been proposed. And I would believe you can do the Task/Mesh shader pipeline on an API level easily enough with GCN, maybe they just need to bump the caches to be able to pass the magic 64 vertices and 126 triangles ;) under all circumstances from stage to stage. GCN is a compute design in the first place, so being able to put wave/compute functionality into vertex shaders is a given (UAVs in vertex shaders etc.).

So, lets say optimistically Task/Mesh shaders are trivial, rather: just an API-choice, for AMD. Now, are Primitive Shaders just an API-choice? Is there something more in it? I don't know.

Personally I would hope we get some post-rasterizer scheduling capability, maybe we can sort and eliminate and amplify pixels in there.
 
Another use for task/mesh shaders might be for deferred shading. Cluster pixels by material, then feed them to mesh shaders which render directly to a UAV rather than actually doing mesh processing. This could lead to lower divergence. Might be *really* tricky to figure out an algorithm for the clustering, though.
 
They have: AMD's Tootle
Look into section 4: Fast linear clustering. You can pass maximum triangle limits for cluster generation in the API.

Seems ironic.

That's exactly the sort of software I was talking about ;-)

Note that being able to put this through a mesh shader lets you guarantee that the clusters remain coherent. Imagine what would happen if for some reason you had a stride misalignment between your clusters and your vertex shader block assignment!
 
Mesh Shader Possibilities
September 29, 2018
The appealing thing about this model is how data-driven and freeform it is. The mesh shader pipeline has very relaxed expectations about the shape of your data and the kinds of things you’re doing to do. Everything’s up to the programmer: you can pull the vertex and index data from buffers, generate them algorithmically, or any combination.

At the same time, the mesh shader model sidesteps the issues that hampered geometry shaders, by explicitly embracing SIMD execution (in the form of the compute “work group” abstraction). Instead of each shader thread generating geometry on its own—which leads to divergence, and large input/output data sizes—we have the whole work group outputting a meshlet cooperatively. This mean we can use compute-style tricks, like: first do some work on the vertices in parallel, then have a barrier, then work on the triangles in parallel. It also means the input/output bandwidth needs are a lot more reasonable. And, because meshlets are indexed triangle lists, they don’t break vertex reuse, as geometry shaders often did.
....
It’s great that mesh shaders can subsume our current geometry tasks, and in some cases make them more efficient. But mesh shaders also open up possibilities for new kinds of geometry processing that wouldn’t have been feasible on the GPU before, or would have required expensive compute pre-passes storing data out to memory and then reading it back in through the traditional geometry pipeline.

With our meshes already in meshlet form, we can do finer-grained culling at the meshlet level, and even at the triangle level within each meshlet. With task shaders, we can potentially do mesh LOD selection on the GPU, and if we want to get fancy we could even try dynamically packing together very small draws (from coarse LODs) to get better meshlet utilization.

In place of tile-based forward lighting, or as an extension to it, it might be useful to cull lights (and projected decals, etc.) per meshlet, assuming there’s a good way to pass the variable-size light list from a mesh shader down to the fragment shader. (This suggestion from Seb Aaltonen.)
http://www.reedbeta.com/blog/mesh-shader-possibilities/
 
Last edited by a moderator:
There are also some edge cases that would pretty much mean that the results outputs of vertex shaders always have to be stored back to memory. Consider what happens if you're happily going along through vertices and come across an index pointing to a vertex that you shaded so long ago that the output no longer lives in the cache. In the old days, you could just go "hell with it" and put the vertex through the shader a second time, but now vertex shaders are allowed to have side effects, meaning that they *must* be run exactly once. Thus everything must be saved for the duration of the draw call. If you assume a mesh with a couple hundred thousand triangles, you're talking about pushing quite a few vertex outputs off chip.
 
Back
Top