Nvidia Turing Architecture [2018]

Hopefully someone does a good article on Primitive/Mesh/Task shaders with real world examples. Still not clear to me what’s so terrible about the existing pipeline and what’s so great about the combined stages. It’s all the same code in the end isn’t it?
 
You could end up with a lot less draw calls using NVs or AMDs revised pipelines. But the feature has a) to work and b) to be implemented by developers, AFAIU.
 
Honestly, for AMD I think it's dead for Vega at this point. Not even available for devs. So I guess we'll see for Navi, and how it compares to the nVidia solution. Pretty interesting.
 
Most of the articles on the architecture are BS, not surprisingly. "Variable rate shading" you mean the F*ing software paper on that recently that has shit all to do with hardware?
The way I read the Variable Rate Shading section of the whitepaper, Turing has hardware support for something that, in the past, was done in software. If you can specify, in hardware, for each 16x16 rectangle on the screen that the shading rate can be lowered to 1/2 or 1/4 of the pixels, then you are reducing the load on the shaders and texture units by a similar amount. (I assume that the ROPs will still be operating at the same rate?)

Yeah thanks for including that in a hardware paper. Why is the "white paper" is filled with PR bullshit? I'm just trying to read about your computer architecture guys, keep it out of the hands of the god damned PR people.
I'm trying to read about computer architecture too, and my reading is clearly very different than yours. I must be biased.

But I'm sensing a pattern in your line of argumentation here:
iff <could be done in software> and <Turing does it in hardware> then <it's a bullshit feature>

First ray tracing (news flash: I ran ray tracing on my 80286), now this. Why don't you start by considering performance aspects as well before making judgement?

From that paper that's all I can really conclude.
Here's another conclusion from the whitepaper: the RTX cores may simply be 'an extra instruction', as you wrote earlier, but it seems pretty clear from the whitepaper that this extra instruction kicks off a LOT of hardware. I can't wait to see how Vega will pull that off with a software driver.

Fortunately Anandtech has done a proper job, and shows that CUDA cores are responsible for BVH construction.
I wonder how you come to this conclusion. Here's all I can find in the Anandtech article about BHV construction:
AnandTech said:
The other major computational cost here is that BVHs themselves aren’t free. One needs to be created for a scene from the polygons in it, so there is an additional step before ray casting can even begin. This is more a developer concern – when can they modify and reuse a BVH versus building a new one – but it’s another step in the process. Furthermore it’s an example of why developer training and efficient engine implementations are so crucial to the process, as a poor implementation can make ray tracing much too slow to be viable.
Maybe I missed something, but the whitepaper doesn't talk about BHV construction at all either.

Right now, considering talented programmers can get the same level of raytracing performance out of a 1080ti as Nvidia claims can come out of their new RTX cards, well consider me unimpressed.
Oh boy, here we go again.
 
The article does make the claim but I don’t see anything in the white paper. Maybe AT got some info offline.
Thanks for pointing that out. I finally found the quote that I was looking for:
AnandTech said:
But otherwise, everything else is at a high level governed by the API (i.e. DXR) and the application; construction and update of the BVH is done on CUDA cores, governed by the particular IHV – in this case, NVIDIA – in their DXR implementation.
 
For me, "variable rate shading" screams "foveated rendering" for VR. Yes, you can do it in software on older hardware, but for various reasons (SIMD utilization is the big one. Basically, the rasterizer thinks it should render all the pixels, so you get divergence in all your pixel shaders since your SIMDs get populated with all the pixels you're dropping) it turns out to be pretty slow. The paper I read was reporting something like a 10-15% performance increase, but this is with them subsampling down to only 50% or less of the original pixel count! Having the fixed function bits and pieces around the pixel shader actually designed with variable sampling rate in mind should basically make pixel shading costs scale 1:1 with your sampling rate.
 
Thanks for sharing.

So far what I’ve gathered is:

- Mesh shaders ultimately spit out triangles to be rasterized into pixels to be shaded
- Mesh shaders can do much of the object level culling and LOD selection work currently done on the CPU
- This stuff could all be done using compute shaders but a lot of memory allocation and caching optimizations are taken care of by nvidia if you use the mesh pipeline
- Because of these optimizations and offloading work from the CPU we can now handle much more complex objects made up of many triangles by breaking them into smaller chunks of work (meshlets)

What I don’t get is how does this help with having lots more objects?
 
What I don’t get is how does this help with having lots more objects?
I think part of it is that you completely bypass a bunch of fixed function geometry hardware that does not scale as well as the number of SMs.

If that hardware was a bottleneck before, you have now eliminated that.

Another one is that it potentially allows for more reuse of data, which reduces redundant calculations.
 
- Mesh shaders can do much of the object level culling and LOD selection work currently done on the CPU
My reading of the method is that decisions about cluster and LOD handling would occur in the task shader, which would then control how many mesh shader invocations would be created, and what it is they would be working on.
Per-primitive culling is something generally recommended to be left to the fixed-function hardware--which has also been improved since Pascal. The video showed a few scenarios where the conventional and mesh pipelines were rather close in terms of performance, highlighting the balancing act of managing culling in cases where not enough work is saved later in the process to justify the obligatory up-front overhead. While still a generally significant improvement, the video presentation had an interesting, almost awkward, tone about how the fixed-function hardware had kept improving and made the new shaders less impressive in certain places.

- This stuff could all be done using compute shaders but a lot of memory allocation and caching optimizations are taken care of by nvidia if you use the mesh pipeline
At least for the compute methods discussed by the presentation, using them at the front of the overall pipeline means they can stay on chip and reduces the overhead introduced by the launch of large compute batches. I presume there's cache thrashing and pipeline events that must write back to memory if the generating shader is in a prior compute pass.

What I don’t get is how does this help with having lots more objects?
The base pipeline always has a primitive distributor that runs sequentially through a single index buffer, creates a de-duplicated in-pipeline vertex batch, and builds a list of in-pipeline primitives that refer back to the entries from the vertex batch they use.
When much of the workload does not change what this one serial unit has to work through every frame, this means a lot of duplicated work bound by straight-line serial processing speed.

The meshlets effectively take the in-pipeline vertex and primitive lists and package them for reuse. Individual mesh shaders can then read from multiple mesh contexts, increasing throughput. On top of that, they are permitted to use a topology that reuses vertices more than the traditional triangle strip, and can be made to reduce the amount of attribute data that is passed around. This further compresses the bandwidth needs of the process. The overall path also has a threshold versus multi-draw calls where if an instance is small enough to fit in the shader context, it allows more parallel processing of instances versus the serial iteration a multi-draw indirect command has at the top of the pipeline.
In combination with the task shader, the overall pipeline can distribute work more concurrently, align primitive processing more effectively with the SIMD architecture, leverage the existing on-chip management, and it leaves open programmable methods for compression and culling.
What specifically happens for the more dynamic parts of the workload that do no benefit as much from reuse wasn't in the video, though it was noted these methods were considered an adjunct to the existing and still-improved traditional pipeline. More variable work may not "compress" and the existing tessellation path is still more efficient for the specific patterns it works for.


There are some parallels with what AMD has described for its geometry handling. Both methods have merged shader stages, and there's a similar split into two parts at a juncture where there is the potential for data amplification, roughly where the tessellation block is, or where a task shader begins to farm out work to child mesh shaders.
However, which parts of the traditional pipeline are kept as-is or have parts of their functionality replaced differs.
Culling at a cluster or mesh level happens earlier in the Nvidia pipeline, whereas the mesh and primitive shaders have more clear affinity with the vertex shaders.
One emphasizes per-primitive culling and dynamic context management for batching injected into a more traditional hardware flow, while the other has a more significant break exposed to the software that has more management incorporated into the code. The various batch sizes and primitive counts exist in one form or another for both, but what the motivations are and when they are exposed differs along with whether they are transparently handled suggestions versus structure definitions to the shaders.
The more explicit break with task and mesh shaders does seem to offer a wide range of options, though it would be less transparent to developers (or would if primitive shaders had done what was initially promised).
 
A series of Tweets from Sebi … about a slide from the Nvidia Turing Architecture Whitepaper page 33, discussing the performance number of GRays.

This is an important slide. Scenes behind the 10 Gigarays/s number. Each scene has a single high poly mesh (no background). Primary rays = fully coherent.

4K at 60 fps = 0.5 Gigarays/s, assuming one ray/pixel. Thus RTX 2080 Ti allows you to cast 20 rays per pixel in scenes like this at 60 fps (4K). Assuming of course that you use 100% of your frame time for TraceRays calls, or overlap TraceRays calls (async compute) with other work

These are pretty good numbers. Especially if you async compute overlap TraceRays with other stuff, hiding the potential bottleneck of BVH and ray-triangle tests in more complex scenes. I will definitely do some async tests when I get my hands dirty with RTX 2080.


 
@3dilettante well written.
One thing so, the compute overhead is for small dispatches, not large ones (those would be hidden). As you said there is also extra waits for completing previous tasks. Now it may be doable (on console I guess) to tune producer/consumer batch sizes on architecture specific L2 sizes to keep things on chip, but very much not portable.

Also want to stress that the meshlet data structures shown just serve as basic example.

When originally designed it was not sure whether fixed function blocks would see further improvement, hence my "awkward" comment ;) However as scaling can be demonstrated and depending on the features success, it opens the door to not having to improve fixed func at some point. (The market decides).

Not sure what you mean by less transparent about task/mesh split, given that is all up to developer?
 
Last edited:
@3dilettante well written.
One thing so, the compute overhead is for small dispatches, not large ones (those would be hidden). As you said there is also extra waits for completing previous tasks. Now it may be doable (on console I guess) to tune producer/consumer batch sizes on architecture specific L2 sizes to keep things on chip, but very much not portable.
The lead designer for the PS4 indicated there was a triangle sieve option for compiling code from a vertex shader, which would produce a shader containing only position-related calculations for culling. This would then be linked to the geometry front end running a normally compiled version of the vertex shader by a firmware-managed ring buffer hosted in the L2 cache, which was something Sony considered a tweak for their architecture.
It was an optional feature that needed some evaluation to determine whether it would lead to an improvement, although I have not seen references since then on how often that tweak was used.

When originally designed it was not sure whether fixed function blocks would see further improvement, hence my "awkward" comment ;) However as scaling can be demonstrated and depending on the features success, it opens the door to not having to improve fixed func at some point. (The market decides).
It's a frequently retold story for designs facing the programmable/dedicated wheel of reincarnation, sometimes it's where on the cycle a design finds itself. Perhaps a future design will be able to heavily leverage the more programmable path and leave the traditional hardware path alone, and then someone starts to think what hardware could accelerate the new one.

Not sure what you mean by less transparent about task/mesh split, given that is all up to developer?
Transparent in this case would be whether the shaders would need to expose elements like the number of primitives per mesh or limits to the amount of context being passed from one stage to the next. Some alternate methods for refactoring the geometry pipeline had thresholds for batch or mesh context, but they could be hidden from the developer to varying degrees, either by the driver or in some cases by hardware that could start and end batches based (edit: missing words "on conditions") during execution. Granted, being totally unaware of those limits could lead to less than peak performance, and some of those hardware-linked behaviors add some dependence on other parts of the hardware which the explicitly separate task and mesh shaders discard.
 
Last edited:
So much nicer without the AO. AO always seemed like a weird awkward and inaccurate kludge. Some people liked it, but I always felt it made things look worse as it made it look even less natural than without AO.

Regards,
SB
 
So much nicer without the AO. AO always seemed like a weird awkward and inaccurate kludge. Some people liked it, but I always felt it made things look worse as it made it look even less natural than without AO.

Regards,
SB

Without which AO - SSAO or RTAO? :)

Some kind of AO is absolutely necessary in scenes that don’t use static prebaked lighting. Otherwise everything looks very flat and floaty.

Aside from the obvious improvement in dynamic lighting it’s hard to tell whether overall IQ is actually improved under RTX in the Exodus demo. It seems with RTX on the ambient lighting term is greatly reduced but other surfaces seem overly bright.
 
Back
Top