Nvidia Turing Architecture [2018]

pixeljetstream · Dec 27, 2018

Digidi said:
What I,m asking me how many mashe shader has tu102 and tu104.

So tu102 has 6 rasterizer and also tu104 has 6 rasterizer. That means both have same capabilities for rasterisation. But do they have also the same cacapilieties for mesh shader? Where are the mesh shader sitting in the architecture?

Thought we have solved that on the other forum

Since several generations there is no dedicated "shaders" in hw anymore but all vendors use unified approaches. The hw (SM/CU etc.) executes generic shader code from any shader stage.

pixeljetstream · Dec 27, 2018

pharma said:
There is nothing you can really do about that. Unless you happen to come up with a contribution similar to the "theory of relativity" or something as magnanimous then there is always the possibility that it will become part of the archive. If your peers deem it worthy enough then it will hold up against the flow of time, and this is true for any profession. But we digress ...

Indeed. Imo the feature was very well received among developers, technical material/conference presentations exists and we actively engage with developers. From that point I am not worried.
Marketing it to consumers is a question of having applications/titles that use the tech. This will take a bit (compared to dxr there is no standard yet) until in actual products. I am used to different timelines, where it takes many years from research until new technologies get adopted for the masses, and some aren't. That is normal and just how it works.

Deleted member 13524 · Dec 27, 2018

Digidi said:
What I,m asking me how many mashe shader has tu102 and tu104.

So tu102 has 6 rasterizer and also tu104 has 6 rasterizer. That means both have same capabilities for rasterisation. But do they have also the same cacapilieties for mesh shader? Where are the mesh shader sitting in the architecture?

TU102 has 50% more ROPs than TU104.

The "Mesh Shader" like other shaders is a "program", a piece of code that runs on the shader processors. The "dedicated hardware" for the mesh shader to be possible is mostly giving those shader processors access to the caches in the front end, as well as support for new, dedicated instructions.
This is just a layman explanation to layman knowledge, though. Others here are probably better suited to explain in more detail.

pixeljetstream · Dec 27, 2018

Yes. @ToTTenTranz that's pretty good explanation. Just a clarification about "cache in fronted": there are new instructions to read/write more of the output/input data. For NV architecture, typically stored in L1/SMEM shown in the presentations linked here:

https://twitter.com/x/status/723279009665167363

There is no need for "units" as such, just some hw state to disable the traditional index/vertex fetch given developers fetch all themselves and some new plumbing of data flow in the hw for spawning tasks/mesh warps etc.

Deleted member 2197 · Feb 7, 2019

Ray Tracing Gems - Available Mid March
February 6, 2019
To help developers navigate this new technology, a wide-ranging book on the topic is being published early this year: Ray Tracing Gems.

We have some great news: Readers of NVIDIA’s Dev News Center get early access to the text, at no cost! NVIDIA will be distributing the book for free in its entirety, as a series of seven PDFs. Every few days, a new section of the book will be made available.

Ray Tracing Gems is the work of more than 60 contributors, all experts in the field of ray tracing. Their articles cover techniques that are not often discussed in general texts, but are important for high quality results.

You can find the first installment and future segments on the NVIDIA Developer Zone.

Ray Tracing Gems is not meant as a survey of the field of ray tracing. There are already fine books that provide a general education, many of them free; see this resource list, for example. Rather, this volume is more in the spirit of other gems books, such as GPU Gems, containing articles covering techniques that are often not discussed in general texts but that are important for high-quality results. The book also includes tutorials on newer technologies, along with guides that pull together best practices for solving specific problems. The second half of the book includes studies of larger systems focused on a variety of effects.
https://news.developer.nvidia.com/the-authoritative-book-on-real-time-ray-tracing-has-arrived/

Kaotik · Mar 5, 2019

I'm having some hard time wrapping my head around the TU116 FP16-units.
Are the really just that, FP16 units, or could they be (software limited?) tensor cores instead? It sounds a bit strange NVIDIA would develop new FP16 CUDA-cores when they've had 2xFP16/1xFP32-units before (Tegra X1, GP100 at least)

fellix · Mar 5, 2019

Kaotik said:
I'm having some hard time wrapping my head around the TU116 FP16-units.
Are the really just that, FP16 units, or could they be (software limited?) tensor cores instead? It sounds a bit strange NVIDIA would develop new FP16 CUDA-cores when they've had 2xFP16/1xFP32-units before (Tegra X1, GP100 at least)

I was examining the die shots of TU106 and TU116, and it seems to me that TU116's SM structure is no different from the former, so I think TU116 simply disables RTX and either does the same with the tensor cores or just throttles them to ordinary FP16 op's.

Kaotik · Mar 5, 2019

fellix said:
I was examining the die shots of TU106 and TU116, and it seems to me that TU116's SM structure is no different from the former, so I think TU116 simply disables RTX and either does the same with the tensor cores or just throttles them to ordinary FP16 op's.

Wait, there's actual dieshots of the Turings somewhere? Artist impressions don't count.

fellix · Mar 5, 2019

Kaotik said:
Wait, there's actual dieshots of the Turings somewhere? Artist impressions don't count.

Yep, in the usual place: https://www.flickr.com/photos/130561288@N04

Not classical micro-graphs though, but non-destructive IR scans. The downside is the laser-print obstruction, if present.

Ryan Smith · Mar 6, 2019

Kaotik said:
I'm having some hard time wrapping my head around the TU116 FP16-units.
Are the really just that, FP16 units, or could they be (software limited?) tensor cores instead? It sounds a bit strange NVIDIA would develop new FP16 CUDA-cores when they've had 2xFP16/1xFP32-units before (Tegra X1, GP100 at least)

For what it's worth, NVIDIA says they're new units, and not tensor cores. But truthfully, I fully expect they're leaving out some details in order to obfuscate parts of the architecture and maintain their technological advantage.

Arun · Mar 7, 2019

In general, you'd expect separate FP16 ALUs to be higher area than merged FP16/FP32 ALUs *but* lower power. In theory you'd expect to pay a very small power efficiency cost not just on FP16 calculations but also FP32... so it might make sense for NVIDIA to have separate FP16 ALUs; in fact, I don't see any clear evidence that they weren't already separate physically on Tegra X1 and GP100, even if they couldn't be used simultaneously due to register file/scheduling limitations?

If their main reason was power, at first glance it doesn't feel like it'd make a lot of sense for TU116 since it's a lower-end chip which is more price sensitive... however, it's probably also aggressively targeting the laptop market, so it might make sense because of that. The alternative would have been to remove FP16 support completely as not many current games benefit from it, but those kinds of design decisions are done long in advance, and at that point it might have seemed risky to give AMD a potential advantage in FP16 throughput.

Anyhow in the grand scheme of things it's a minor implementation detail; I very much doubt those FP16 units are taking very much area... It is interesting that TU106 vs TU116 aren't that different despite the lack of RTX and tensor cores though... I don't have the time to do a proper analysis of those die shots but I'd be very curious if anyone does!

SlmDnk · Mar 8, 2019

ustream.tv/gpu-technology-conference

Deleted member 2197 · Mar 13, 2019

Integrating Ray Tracing Into an Existing Engine
March 12, 2019

https://news.developer.nvidia.com/i...xisting-engine-three-things-you-need-to-know/

GDC 2019:
Title: A DEVTECH’S ESSENTIAL GUIDE TO RAY TRACING
Location: Room 205, South Hall
Date: Thursday, March 21
Time: 11:30am – 12:30pm
Pass Type: All Access, GDC Conference + Summits, GDC Conference, GDC Summits, Expo Plus, Audio Conference + Tutorial, Indie Games Summit
Topic: Programming
Format: Sponsored Session
https://schedule.gdconf.com/session...ide-to-ray-tracing-presented-by-nvidia/865242

DavidGraham · Mar 14, 2019

Volta V100 is still faster than Titan RTX in FP32 and FP16 TensorFlow performance.

https://lambdalabs.com/blog/titan-v-deep-learning-benchmarks/

DavidGraham · Mar 18, 2019

Microsoft is supporting Variable Rate Shading officially in DX12, with broad engine and developers support, the feature works on NVIDIA’s Turing and the upcoming Intel Gen11 iGPUs.

Preliminary results done on an RTX 2060 at 4K shows a 14% improvement in Civilization 6.

https://devblogs.microsoft.com/directx/variable-rate-shading-a-scalpel-in-a-world-of-sledgehammers/

Deleted member 2197 · Mar 20, 2019

Tips and Tricks: Ray Tracing Best Practices
March 20, 2019

This post presents best practices for implementing ray tracing in games and other real-time graphics applications. We present these as briefly as possible to help you quickly find key ideas. This is based on a presentation made at the 2019 GDC by NVIDIA engineers.

Main Points
Optimize your acceleration structure (BLAS/TLAS) build/update to take at most 2ms via pruning and selective updates
Denoising RT effects is essential. We’ve packaged up best in class denoisers with the NVIDIA RTX Denoiser SDK)
Overlap the acceleration structure (BLAS/TLAS) build/update and denoising with other regimes (G-Buffer, shadow buffer, physical simulation) using asynchronous compute queues
Leverage HW acceleration for traversal whenever possible
Minimum number of rays cast should be billed as “RT On” and should deliver noticeably better image quality than rasterization. Increasing quality levels should increase image quality and perf at a fair rate. See table below:

FAQ
Q. What’s the relationship between number of primitives and cost (time) of acceleration structure build/updates?
A. It’s mostly a linear relationship. Well, it starts getting linear beyond a certain primitive count, before that it’s bound by constant overhead. The exact numbers here are in flux and wouldn’t be reliable.

Q. Assuming maximum occupancy, what’s the GPU throughput SOL for acceleration structure build/updates?
A. An order-of-magnitude guideline is O(100 million) primitives/sec for full builds and O(1 billion) primitive/sec for update.

Q. What’s the relationship between number of unique shaders and compilation cost (time) for RT PSOs?
A. It is roughly linear.

Q. What’s the typical cost of RT PSO compilation in games today?
A. Anywhere from, 20ms → 300ms, per pipeline.

Q. Is there guidance for how much alpha/transparency should be used? What’s the cost of anyhit vs closest hit?
A. Any-hit is expensive and should be used minimally. Preferably mark geometry (or instances) as OPAQUE, which will allow ray traversal to happen in fixed-function hardware. When AH is needed (e.g. to evaluate transparency etc), keep it as simple as possible. Don’t evaluate huge shading networks just to execute what amounts to an alpha tex lookup and an if-statement.

Q. How should the developer manage shading divergence?
A. Start by shading in closest-hit shaders, in a straightforward implementation. Then analyze perf and decide how much of a problem divergence is and how it can be addressed. The solution may or may not include “manual scheduling”.

Q. How can the developer query the stack memory allocation?
A. The API has functionality to query per-thread stack requirements on pipelines/shaders. This is useful for tracking and analysis purposes, and an app should always strive to use as little shader stack as possible (one recommendation is to dump stack size histograms and flag outliers during development). Stack requirements are most directly influenced by live state across trace calls, which should be minimized (see Best Practices)..

Q. How much extra VRAM does a typical ray-tracing implementation consume?
A. Today, games implementing ray-tracing are typically using around 1 to 2 GB extra memory. The main contributing factors are acceleration structure resources, ray tracing specific screen-sized buffers (extended g-buffer data), and driver-internal allocations (mainly the shader stack).
https://devblogs.nvidia.com/rtx-best-practices/

DavidGraham · Apr 7, 2019

Some rough estimation for the area taken by the RT cores and Tensor cores (about 8~10% of the die)

https://www.reddit.com/r/hardware/comments/baajes

BRiT · Apr 7, 2019

If that's all it takes up, why make anything without them?

DavidGraham · Apr 7, 2019

BRiT said:
New If that's all it takes up, why make anything without them?

Ray Tracing requires a minimum raster performance as well, it will not make sense to offer it below a certain performance tier.

DavidGraham · May 26, 2019

OctaneBench 2019, RTX off vs RTX on, RTX on generally provides 3X the score.

Quadro:

https://www.pugetsystems.com/labs/a...-RTX-Performance-Boost-1387/#BenchmarkResults

Geforce:

https://www.pugetsystems.com/labs/a...-RTX-Performance-Boost-1384/#BenchmarkResults

Nvidia Turing Architecture [2018]

pixeljetstream

pixeljetstream

Deleted member 13524

Guest

pixeljetstream

Deleted member 2197

Guest

Kaotik

Drunk Member

fellix

Kaotik

Drunk Member

fellix

Ryan Smith

Arun

Unknown.

SlmDnk

Deleted member 2197

Guest

DavidGraham

DavidGraham

Deleted member 2197

Guest

DavidGraham

BRiT

(>• •)>⌐■-■ (⌐■-■)

DavidGraham

DavidGraham

Similar threads