Nvidia Turing Architecture [2018]

Intessting thing. In the patent of PS is written that when tesselation is off, than there is also no Surface Shader and the PS acts directly after input assembler.

https://patents.google.com/patent/US20180082399A1/en

Also Mike Mantor stated that AMD can do a lot of culling bevor vertices form a polygon.

“In a chipset we’ve been building for a while now, we have a vertex process that runs and it could either be a domain shader or a vertex shader, could be a vertex process on the output of a geometry shader that’s doing amplification or decimation. And at that point when you finally have the final position of the vertices of the triangle, is one point where we can always find out whether or not the triangle is inside of the frustum, back-faced, or too small to hit. From frustum testing, there’s a mathematical way to figure out whether or not a vertex is inside of the view frustum. If any one of the vertices are inside of the view frustum, then we’ll know that the triangle can potentially create pixels. To do a back-faced culling perspective, you can find two edges or with three vertices you can find one edge and a second edge, and then you can take a cross-product of that and determine the facedness of the triangle. You can then product that with the eye-ray, and if it’s a positive result, it’s facing the direction of the view, and if it’s negative it’s a back-faced triangle and you don’t need to do it. […] State data goes in to whether or not you can opportunistically throw a triangle away. You can be rendering something where you can actually fly inside of an object, see the interior of it, and then when you come outside, you can see outside-in – in those instances, you can’t do back-faced culling.”

https://www.gamersnexus.net/guides/3010-primitive-discarding-in-vega-with-mike-mantor
 
Last edited:
AMD's Vega whitepaper summarized the surface shader as a shader type generated when tessellation was enabled. It encapsulates the initial vertex shader and hull shader stages, while the primitive shader is adjusted to encapsulate the domain and geometry shader stages. The surface shader is associated with the fixed-function tessellation stage, and it is this optional stage that governs whether the pipeline is split at a point of work amplification.

Task shaders are optional, albeit not necessarily due to tessellation-like behavior. They can spawn extra child mesh shader workgroups, but this isn't necessary. The subdivision point for the pipeline then becomes a less straightforward one than the tessellation hardware presents, as it may not be a customary expansion since the customary tessellation unit is not there.
The task shader is indicated as being able to make object and inter-mesh level decisions in terms of selecting meshlets, LOD, and meshlet culling. Surface shaders are described as taking the same sorts of inputs as the vertex+hull shader stage, and presenting the usual output. It falls upon the later primitive shader stage to try culling as effectively as it can from the output of the tessellation unit. The broader set of inputs/output and upstream decision making of the task shader were not mentioned.

Mesh shaders are indicated as being more closely aligned with vertex shaders, requiring some preprocessing instructions. This is similar to primitive shaders' being generated from vertex shader code. Mesh shaders may have some level of culling, but the general tenor is that if the task shader has performed cluster culling sufficiently, it is more effective to let the mesh run through to the rasterizer and beyond rather than pile up more up-front work in the shader path (one subset involving heavy attribute usage mooted a PS-like prospect). Primitive shaders have limited detail as to all the transformations they perform, but a conservative triangle sieve is the most straightforward description of the initial offering. How exactly this process is performed or how it would look in the shader code was not described.
The mesh shader has an element of precomputation for meshlets in static geometry, which is something the descriptions of primitive shaders do not mention.

At this point, I think I am repeating my suspicion that these are methods with some evolutionary similarities derived from facing similar conditions in the same location in the pipeline, but with different decisions as to the scope and integration with their individual standard pipelines. Saying one element is the same as the other may gloss over the areas where their priorities and judgement calls differ.
 
AMD's Vega whitepaper summarized the surface shader as a shader type generated when tessellation was enabled. It encapsulates the initial vertex shader and hull shader stages, while the primitive shader is adjusted to encapsulate the domain and geometry shader stages. The surface shader is associated with the fixed-function tessellation stage, and it is this optional stage that governs whether the pipeline is split at a point of work amplification.

Ok, some thoughts. AMD has no "vertex shader", it's a compute shader with "interpolator outputs" (or not, just use UAVs instead), which can easily mean spill-to-LDS. The TaskShader for AMD is the wavefront kicker (probably the ACE/HWS), it takes the number of instances to create from the draw-command and instances them, that's it. The next [programmable] stage get's the lane-number and then does everything manually and drops it somewhere inside the chip/CU.
The magic is vertex connectivity, but from the software perspective that's just some clever coded likely implicit (from ordering) information. So, the hull-shader is only really producing new instances programmatically albeit constrained, and allows you to specify the instance-seed-values yourself, also possibly through LDS. If that really is fixed function in GCN ... who knows. So, tessellator is also only TaskShader.
Now comes the real difference. It appears in the MeshShader you have to write the connectivity yourself, while under the old API model the connectivity was implicit (the recursive subdivision rule). That means you have to store that information now for sure. But then, the DomainShader is just the suffix-code of a MeshShader, positioning the connected vertices.

I don't see how all of this wasn't just minor different conventions to obscure the scheduling part, having it implicit instead of explicit for the sake of ... not sure. Maybe it was Nvidia itself when DX11 was specified, because the DX9 tesselator extension from AMD was completely free of this stuff, the subdivision-level was simply a multiplier for the instance-generator and you got context-free triangles + barycentrics to do whatever you wanted in the vertex-shader (kind of like triangle-shader, not unlike the DomainShader).
The only different thing I see is, now you control amplification without implicit data generation, and explicit connectivity. And it seems recursive amplification.? Well, the Tesselator was also specified as recursive amplification. That's just streightforward generalization of this heterogenous multiple amplification points in the DX11-API.

Task shaders are optional, albeit not necessarily due to tessellation-like behavior. They can spawn extra child mesh shader workgroups, but this isn't necessary. The subdivision point for the pipeline then becomes a less straightforward one than the tessellation hardware presents, as it may not be a customary expansion since the customary tessellation unit is not there.

Which tesselation hardware. What would you resonably expect to be dedicated hardware? Try to imagine the most basic generalized building blocks needed to implement it. You don't want overly specialized silicon, I've not seen a tesselation-block on a die-shot (there might be some, but's not big and complicated and ugly). The principle problem is data-routing, and that's probably the only FF part of it, put data from somewhere to somewhere else 1:N style.
Remember AMD suggests to use the DomainShader to cull triangles using the 1:1 amplification case, basically using the stage as a decimator instead. How's that a senseful suggestion when that's FF?

The task shader is indicated as being able to make object and inter-mesh level decisions in terms of selecting meshlets, LOD, and meshlet culling. Surface shaders are described as taking the same sorts of inputs as the vertex+hull shader stage, and presenting the usual output. It falls upon the later primitive shader stage to try culling as effectively as it can from the output of the tessellation unit. The broader set of inputs/output and upstream decision making of the task shader were not mentioned.

Yeah, but that's all just creative imaginations. "Your could possibly". If something's free of implicity I can do all kinds of creative things, that's not a checkbox for MeshShader, it's a checkbox for ThinkOutOfTheMuchLargerBox.

Mesh shaders are indicated as being more closely aligned with vertex shaders, requiring some preprocessing instructions. This is similar to primitive shaders' being generated from vertex shader code. Mesh shaders may have some level of culling, but the general tenor is that if the task shader has performed cluster culling sufficiently, it is more effective to let the mesh run through to the rasterizer and beyond rather than pile up more up-front work in the shader path (one subset involving heavy attribute usage mooted a PS-like prospect). Primitive shaders have limited detail as to all the transformations they perform, but a conservative triangle sieve is the most straightforward description of the initial offering. How exactly this process is performed or how it would look in the shader code was not described.
The mesh shader has an element of precomputation for meshlets in static geometry, which is something the descriptions of primitive shaders do not mention.

That's just the same limitation as factor 64 for tesselation. Seems coincidence? Don't think so. I assume <= 64 uses the previously employed internal storage and above it spills to memory just like geometry shader stream-out. They don't say you can't do it, they say it performs predicably upto 64/126.

At this point, I think I am repeating my suspicion that these are methods with some evolutionary similarities derived from facing similar conditions in the same location in the pipeline, but with different decisions as to the scope and integration with their individual standard pipelines. Saying one element is the same as the other may gloss over the areas where their priorities and judgement calls differ.

It's different so say one is different from the other because of hardware differences (and _not_ performance differences), and to say it's different from each other because of ecosystemic points of view.
Nvidia dropped this thing as an extension to DirectX into the NVAPI. It's the way Nvidia is going forward, they like and are able to spam extensions. Last time AMD did want to communicate it's vision it led to an entire alternative data-management paradigm (and API) instead of shoehorning it into DX. That shoehorning came from MS, and it's ... hu ... a bit ugly honestly.
 
Ok, some thoughts. AMD has no "vertex shader", it's a compute shader with "interpolator outputs" (or not, just use UAVs instead), which can easily mean spill-to-LDS.
There is an internal stage in the geometric pipeline that performs the function. It didn't seem to have a bearing on the comparison between the different geometry pipelines whether the stage was a direct map to an API shader or not, just what external shader type seems to be most closely aligned with it. For primitive shaders, their description and patents on their automatic generation point to their reliance on vertex shaders as a source, which has some commonality with mesh shaders.
Surface shaders have more limited disclosure, and were described solely in terms of the VS+HS stage they are analogous to, per the Vega whitepaper. The task shader as described does not bear much similarity to either.

The TaskShader for AMD is the wavefront kicker (probably the ACE/HWS), it takes the number of instances to create from the draw-command and instances them, that's it.
I have lost track on whether graphics wavefront launch can use the ACE dispatch controllers, but even so from this description the "task shader" in this scenario is missing several functions of the Nvidia-described task shader. There's been no description of a way to get the ACE or dispatch controller to have the sort of access or user-level programmability to evaluate geometry or LOD, or run arbitrary shader code itself.
This leads to the question of how this variation of task shader can re-direct the next stage in the pipeline to different index/primitive buffers. The Vega whitepaper has this process after input assembly or the tessellation stage, which is further downstream and more local to an object than an list of objects submitted by the graphics command stream.

So, the hull-shader is only really producing new instances programmatically albeit constrained, and allows you to specify the instance-seed-values yourself, also possibly through LDS. If that really is fixed function in GCN ... who knows. So, tessellator is also only TaskShader.
I haven't seen a description of AMD doing fully away with the tessellation unit. Nvidia's task shader is described as being able to perform a more arbitrary level of amplification, although it is described as not being as efficient at matching the standardized patterns of the tessellation path. As another difference, Nvidia suggested that it could be optimal to run task and mesh shaders alongside tessellation shaders, which as a matter of definition is not a distinction for surface+primitive shaders.

I don't see how all of this wasn't just minor different conventions to obscure the scheduling part, having it implicit instead of explicit for the sake of ... not sure.
I would think the choice of implicit vs. explicit is not so easily glossed over. Explicit creates fewer bounds on what each stage can do or what data it can communicate. Implicit behaviors need to maintain a smaller and more consistent set of behaviors and inputs/outputs. At least for the initial primitive shader offering, it was not going too far from the index and primitive formats already defined.

Which tesselation hardware. What would you resonably expect to be dedicated hardware?
This is a comparison between the task/mesh path and the alternatives--either the standard pipeline or surface/primitive shaders. The task and mesh shader would not have dedicated hardware, although the juncture that would serve as a discontinuity in the number of work items is at a location situated similar to where the tessellation stage would have been. Task shaders could be enabled with or without activity similar to subdivision or amplification, and have other actions they can perform. The surface/primitive shader path would have surface shaders enabled if tessellation hardware was enabled, and would not if there wasn't.

Yeah, but that's all just creative imaginations. "Your could possibly". If something's free of implicity I can do all kinds of creative things, that's not a checkbox for MeshShader, it's a checkbox for ThinkOutOfTheMuchLargerBox.
I'm afraid I'm not sure which item this is in reference to. The descriptions of the task shader's decision-making and object processing were disclosed in Nvidia' announcement. The surface shader's encapsulation of the vertex and hull shader stages is from the Vega whitepaper. Neither of what I discussed mentioned the mesh shader stage.
 
There are also some edge cases that would pretty much mean that the results outputs of vertex shaders always have to be stored back to memory. Consider what happens if you're happily going along through vertices and come across an index pointing to a vertex that you shaded so long ago that the output no longer lives in the cache. In the old days, you could just go "hell with it" and put the vertex through the shader a second time, but now vertex shaders are allowed to have side effects, meaning that they *must* be run exactly once. Thus everything must be saved for the duration of the draw call. If you assume a mesh with a couple hundred thousand triangles, you're talking about pushing quite a few vertex outputs off chip.
There's nothing in any API that says vertices can only be shaded once. No IHV has perfect reuse across a thousand triangles, let alone a couple hundred thousand. There was an HPG 2018 paper that showed Intel had the largest vertex reuse batching at 128.
 
There's nothing in any API that says vertices can only be shaded once. No IHV has perfect reuse across a thousand triangles, let alone a couple hundred thousand. There was an HPG 2018 paper that showed Intel had the largest vertex reuse batching at 128.

How do you deal with UAV access then? It strikes me as a very big problem if you can't tell how many times a given piece of code that modifies global state will be run...
 
How do you deal with UAV access then? It strikes me as a very big problem if you can't tell how many times a given piece of code that modifies global state will be run...
I'm not sure what's done in that case though vertex reuse could get turned off. I know Nvidia changes their behavior when using UAVs. For example, at one point I saw they stop binning when UAVs are used. That might have been UAVs in the PS.
 
How do you deal with UAV access then? It strikes me as a very big problem if you can't tell how many times a given piece of code that modifies global state will be run...
The app just has to deal with it that this isn't known.
From ARB_shader_image_load_store:
Shader Memory Access Ordering

The order in which texture or buffer object memory is read or written by
shaders is largely undefined. For some shader types (vertex, tessellation
evaluation, and in some cases, fragment), even the number of shader
invocations that might perform loads and stores is undefined.
In particular, the following rules apply:

* While a vertex or tessellation evaluation shader will be executed at
least once for each unique vertex specified by the application (vertex
shaders) or generated by the tessellation primitive generator
(tessellation evaluation shaders), it may be executed more than once
for implementation-dependent reasons. Additionally, if the same
vertex is specified multiple times in a collection of primitives
(e.g., repeating an index in DrawElements), the vertex shader might be
run only once.
...
And further:
(12) Should image loads and stores be allowed for all shader types?

RESOLVED: Yes, it seems useful.

Note that some shader types pose specific implementation complexities
(e.g., reuse of vertices in vertex shaders, number of fragment shader
invocations in multisample modes, relative order of execution within and
between shader groups). We have explicitly specify several cases where
the invocation count and execution order are undefined. While these
cases may be a problem for some algorithms, we expect that many
algorithms will not be adversely impacted.
...
 
Agreeing about the differences @3dilettante explained.

That's the same limitation as factor 64 for tesselation. Seems coincidence? Don't think so. I assume <= 64 uses the previously employed internal storage and above it spills to memory just like geometry shader stream-out. They don't say you can't do it, they say it performs predicably upto 64/126.

The 64 has nothing to do with tessellation factors nor spilling. We mentioned that the sum of outputs should be <= 16 kb for each mesh workgroup (actually we enforce that in the specs).

Tessellation evaluation also operates on 32 vertices, you can use gl_WarpID and gl_SMID to color triangles to see how the hw distributes work (like I showed in the life-of-a-triangle blog post).

Why 64/126. Well I played with various CAD datasets and those numbers worked well. Actually 64/84 is more realistic vertex re-use in many datasets.
Like mentioned in the talk, it's a trade-off between higher on-chip storage (less occupancy) and potentially higher vertex reuse. You would not want to blow out 16kb unless you have a very good reason...

I'd also think 64 is probably what people from consoles are used to.
 
I'm not sure what's done in that case though vertex reuse could get turned off. I know Nvidia changes their behavior when using UAVs. For example, at one point I saw they stop binning when UAVs are used. That might have been UAVs in the PS.
Wanted to correct myself. Nvidia did not stop binning when using UAVs. My memory was faulty.
 
Discussion on value based on MSRP is irrelevant, and I really hope you don't actually expect that price.

What did I miss? Aren't the Turing cards being sold at MSRP, at the moment?
 
GeForce Experience 3.15 Release adds DLSS support. Now all we need are the games.
October 15, 2018
  • We’ve added support for GeForce RTX graphic cards so you can optimize your gaming rig with Game Ready Drivers and Optimal Playable Settings as well as capture content using Ansel, Freestyle, and Highlights. Additionally, GeForce Experience also now supports RTX technologies such as Deep Learning Super-Sampling (DLSS).

https://www.geforce.com/geforce-experience/download
 
Not sure if anybody noticed, but tensor TFLOPs are reduced by half on Geforce RTX versus Quadro RTX.
That is for the mixed precision FP16/FP32, which are important for neural network training.

Peak FP16 Tensor TFLOPS with FP32 Accumulate
RTX 2080 Ti | 53.8 / 56.9
Quadro RTX 6000 | 130.5

Only the FP16/ FP16 accumulate are full speed, but these are useless for NN training AFAIK
Peak FP16 Tensor TFLOPS with FP16 Accumulate
RTX 2080 Ti | 107.6 / 113.8
Quadro RTX 6000 | 130.5
 
Not sure if anybody noticed, but tensor TFLOPs are reduced by half on Geforce RTX versus Quadro RTX.
That is for the mixed precision FP16/FP32, which are important for neural network training.

Peak FP16 Tensor TFLOPS with FP32 Accumulate
RTX 2080 Ti | 53.8 / 56.9
Quadro RTX 6000 | 130.5

Only the FP16/ FP16 accumulate are full speed, but these are useless for NN training AFAIK
Peak FP16 Tensor TFLOPS with FP16 Accumulate
RTX 2080 Ti | 107.6 / 113.8
Quadro RTX 6000 | 130.5

The Quadro RTX 6000 sells for $6300 whereas the RTX 2080 Ti only sells for $1200. So cutting down the performance on the Tensor cores on the RTX 2080 Ti is expected really.
 
Interestingly, the CUDA Turing Compatibility Guide mentions someting along these lines, but does not discriminate between Quadro and Geforce -- or rather, does not give away the Quadros hidden feature.
„Most applications compiled for Volta should run efficiently on Turing, except if the application uses heavily the Tensor Cores, or if recompiling would allow use of new Turing-specific instructions. Volta's Tensor Core instructions can only reach half of the peak performance on Turing. Recompiling explicitly for Turing is thus recommended.“ [my bold]
 
Not sure if those are connected. The quoted peak performance was with FP16 requirement. It should not auto-demote to INT8, since this is a manual optimization for a reaseon. It should only be valid once you have ascertained that reducing precision does not invalidate your algorithm.
 
Back
Top