Nvidia Turing Architecture [2018]

What I,m asking me how many mashe shader has tu102 and tu104.

So tu102 has 6 rasterizer and also tu104 has 6 rasterizer. That means both have same capabilities for rasterisation. But do they have also the same cacapilieties for mesh shader? Where are the mesh shader sitting in the architecture?
Thought we have solved that on the other forum ;) Since several generations there is no dedicated "shaders" in hw anymore but all vendors use unified approaches. The hw (SM/CU etc.) executes generic shader code from any shader stage.
 
There is nothing you can really do about that. Unless you happen to come up with a contribution similar to the "theory of relativity" or something as magnanimous then there is always the possibility that it will become part of the archive. If your peers deem it worthy enough then it will hold up against the flow of time, and this is true for any profession. But we digress ...
Indeed. Imo the feature was very well received among developers, technical material/conference presentations exists and we actively engage with developers. From that point I am not worried.
Marketing it to consumers is a question of having applications/titles that use the tech. This will take a bit (compared to dxr there is no standard yet) until in actual products. I am used to different timelines, where it takes many years from research until new technologies get adopted for the masses, and some aren't. That is normal and just how it works.
 
Last edited:
What I,m asking me how many mashe shader has tu102 and tu104.

So tu102 has 6 rasterizer and also tu104 has 6 rasterizer. That means both have same capabilities for rasterisation. But do they have also the same cacapilieties for mesh shader? Where are the mesh shader sitting in the architecture?
TU102 has 50% more ROPs than TU104.

The "Mesh Shader" like other shaders is a "program", a piece of code that runs on the shader processors. The "dedicated hardware" for the mesh shader to be possible is mostly giving those shader processors access to the caches in the front end, as well as support for new, dedicated instructions.
This is just a layman explanation to layman knowledge, though. Others here are probably better suited to explain in more detail.
 
Yes. @ToTTenTranz that's pretty good explanation. Just a clarification about "cache in fronted": there are new instructions to read/write more of the output/input data. For NV architecture, typically stored in L1/SMEM shown in the presentations linked here:
There is no need for "units" as such, just some hw state to disable the traditional index/vertex fetch given developers fetch all themselves and some new plumbing of data flow in the hw for spawning tasks/mesh warps etc.
 
Last edited:
Ray Tracing Gems - Available Mid March
February 6, 2019
To help developers navigate this new technology, a wide-ranging book on the topic is being published early this year: Ray Tracing Gems.

We have some great news: Readers of NVIDIA’s Dev News Center get early access to the text, at no cost! NVIDIA will be distributing the book for free in its entirety, as a series of seven PDFs. Every few days, a new section of the book will be made available.

Ray Tracing Gems
is the work of more than 60 contributors, all experts in the field of ray tracing. Their articles cover techniques that are not often discussed in general texts, but are important for high quality results.

You can find the first installment and future segments on the NVIDIA Developer Zone.
Image-2-1.png


Ray Tracing Gems is not meant as a survey of the field of ray tracing. There are already fine books that provide a general education, many of them free; see this resource list, for example. Rather, this volume is more in the spirit of other gems books, such as GPU Gems, containing articles covering techniques that are often not discussed in general texts but that are important for high-quality results. The book also includes tutorials on newer technologies, along with guides that pull together best practices for solving specific problems. The second half of the book includes studies of larger systems focused on a variety of effects.
https://news.developer.nvidia.com/the-authoritative-book-on-real-time-ray-tracing-has-arrived/
 
I'm having some hard time wrapping my head around the TU116 FP16-units.
Are the really just that, FP16 units, or could they be (software limited?) tensor cores instead? It sounds a bit strange NVIDIA would develop new FP16 CUDA-cores when they've had 2xFP16/1xFP32-units before (Tegra X1, GP100 at least)
 
I'm having some hard time wrapping my head around the TU116 FP16-units.
Are the really just that, FP16 units, or could they be (software limited?) tensor cores instead? It sounds a bit strange NVIDIA would develop new FP16 CUDA-cores when they've had 2xFP16/1xFP32-units before (Tegra X1, GP100 at least)
I was examining the die shots of TU106 and TU116, and it seems to me that TU116's SM structure is no different from the former, so I think TU116 simply disables RTX and either does the same with the tensor cores or just throttles them to ordinary FP16 op's.
 
I was examining the die shots of TU106 and TU116, and it seems to me that TU116's SM structure is no different from the former, so I think TU116 simply disables RTX and either does the same with the tensor cores or just throttles them to ordinary FP16 op's.
Wait, there's actual dieshots of the Turings somewhere? Artist impressions don't count.
 
I'm having some hard time wrapping my head around the TU116 FP16-units.
Are the really just that, FP16 units, or could they be (software limited?) tensor cores instead? It sounds a bit strange NVIDIA would develop new FP16 CUDA-cores when they've had 2xFP16/1xFP32-units before (Tegra X1, GP100 at least)
For what it's worth, NVIDIA says they're new units, and not tensor cores. But truthfully, I fully expect they're leaving out some details in order to obfuscate parts of the architecture and maintain their technological advantage.
 
In general, you'd expect separate FP16 ALUs to be higher area than merged FP16/FP32 ALUs *but* lower power. In theory you'd expect to pay a very small power efficiency cost not just on FP16 calculations but also FP32... so it might make sense for NVIDIA to have separate FP16 ALUs; in fact, I don't see any clear evidence that they weren't already separate physically on Tegra X1 and GP100, even if they couldn't be used simultaneously due to register file/scheduling limitations?

If their main reason was power, at first glance it doesn't feel like it'd make a lot of sense for TU116 since it's a lower-end chip which is more price sensitive... however, it's probably also aggressively targeting the laptop market, so it might make sense because of that. The alternative would have been to remove FP16 support completely as not many current games benefit from it, but those kinds of design decisions are done long in advance, and at that point it might have seemed risky to give AMD a potential advantage in FP16 throughput.

Anyhow in the grand scheme of things it's a minor implementation detail; I very much doubt those FP16 units are taking very much area... It is interesting that TU106 vs TU116 aren't that different despite the lack of RTX and tensor cores though... I don't have the time to do a proper analysis of those die shots but I'd be very curious if anyone does!
 
Integrating Ray Tracing Into an Existing Engine
March 12, 2019

https://news.developer.nvidia.com/i...xisting-engine-three-things-you-need-to-know/

GDC 2019:
Title: A DEVTECH’S ESSENTIAL GUIDE TO RAY TRACING
Location: Room 205, South Hall
Date: Thursday, March 21
Time: 11:30am – 12:30pm
Pass Type: All Access, GDC Conference + Summits, GDC Conference, GDC Summits, Expo Plus, Audio Conference + Tutorial, Indie Games Summit
Topic: Programming
Format: Sponsored Session

https://schedule.gdconf.com/session...ide-to-ray-tracing-presented-by-nvidia/865242
 
Tips and Tricks: Ray Tracing Best Practices
March 20, 2019
This post presents best practices for implementing ray tracing in games and other real-time graphics applications. We present these as briefly as possible to help you quickly find key ideas. This is based on a presentation made at the 2019 GDC by NVIDIA engineers.

Main Points
Optimize your acceleration structure (BLAS/TLAS) build/update to take at most 2ms via pruning and selective updates
Denoising RT effects is essential. We’ve packaged up best in class denoisers with the NVIDIA RTX Denoiser SDK)
Overlap the acceleration structure (BLAS/TLAS) build/update and denoising with other regimes (G-Buffer, shadow buffer, physical simulation) using asynchronous compute queues
Leverage HW acceleration for traversal whenever possible
Minimum number of rays cast should be billed as “RT On” and should deliver noticeably better image quality than rasterization. Increasing quality levels should increase image quality and perf at a fair rate. See table below:

FAQ
Q. What’s the relationship between number of primitives and cost (time) of acceleration structure build/updates?
A. It’s mostly a linear relationship. Well, it starts getting linear beyond a certain primitive count, before that it’s bound by constant overhead. The exact numbers here are in flux and wouldn’t be reliable.

Q. Assuming maximum occupancy, what’s the GPU throughput SOL for acceleration structure build/updates?

A. An order-of-magnitude guideline is O(100 million) primitives/sec for full builds and O(1 billion) primitive/sec for update.

Q. What’s the relationship between number of unique shaders and compilation cost (time) for RT PSOs?

A. It is roughly linear.

Q. What’s the typical cost of RT PSO compilation in games today?

A. Anywhere from, 20ms → 300ms, per pipeline.

Q. Is there guidance for how much alpha/transparency should be used? What’s the cost of anyhit vs closest hit?

A. Any-hit is expensive and should be used minimally. Preferably mark geometry (or instances) as OPAQUE, which will allow ray traversal to happen in fixed-function hardware. When AH is needed (e.g. to evaluate transparency etc), keep it as simple as possible. Don’t evaluate huge shading networks just to execute what amounts to an alpha tex lookup and an if-statement.

Q. How should the developer manage shading divergence?

A. Start by shading in closest-hit shaders, in a straightforward implementation. Then analyze perf and decide how much of a problem divergence is and how it can be addressed. The solution may or may not include “manual scheduling”.

Q. How can the developer query the stack memory allocation?

A. The API has functionality to query per-thread stack requirements on pipelines/shaders. This is useful for tracking and analysis purposes, and an app should always strive to use as little shader stack as possible (one recommendation is to dump stack size histograms and flag outliers during development). Stack requirements are most directly influenced by live state across trace calls, which should be minimized (see Best Practices)..

Q. How much extra VRAM does a typical ray-tracing implementation consume?

A. Today, games implementing ray-tracing are typically using around 1 to 2 GB extra memory. The main contributing factors are acceleration structure resources, ray tracing specific screen-sized buffers (extended g-buffer data), and driver-internal allocations (mainly the shader stack).
https://devblogs.nvidia.com/rtx-best-practices/
 
Last edited by a moderator:
If that's all it takes up, why make anything without them?
 
Back
Top