Nvidia Turing Architecture [2018]

Video Series: Practical Real-Time Ray Tracing With RTX
October 23, 2018
In this video series, NVIDIA Engineers Martin-Karl Lefrancois and Pascal Gautron help you get started with real-time ray tracing. You’ll learn how data and rendering is managed, how acceleration structures and shaders work, and what new components are needed for your pipeline. We’ll also include key slides from the presentation on which this video series is based.
https://devblogs.nvidia.com/practical-real-time-ray-tracing-rtx/
 
Turing Texture Space Shading
October 25, 2018
Turing GPUs introduce a new shading capability called Texture Space Shading (TSS), where shading values are dynamically computed and stored in a texture as texels in a texture space. Later, pixels are texture mapped, where pixels in screen-space are mapped into texture space, and the corresponding texels are sampled and filtered using a standard texture lookup operation. With this technology we can sample visibility and appearance at completely independent rates, and in separate (decoupled) coordinate systems. Using TSS, a developer can simultaneously improve quality and performance by (re)using shading computations done in a decoupled shading space.
Developers can use TSS to exploit both spatial and temporal rendering redundancy. By decoupling shading from the screen-space pixel grid, TSS can achieve a high-level of frame-to- frame stability, because shading locations do not move between one frame and the next. This temporal stability is important to applications like VR that require greatly improved image quality, free of aliasing artifacts and temporal shimmer.

As mentioned, with TSS, per-pixel shading rate can be dynamically and continuously controlled by adjusting texture LOD. By varying LOD we can select different texture MIP levels as needed to reduce the number of texels shaded. Note that this means that the sampling approach of TSS can also be used to implement many of the same shading rate reduction techniques that are supported by Turing’s Variable Rate Shading feature (VRS). (We’ll have more details on VRS in a later post). Which method is best for the developer depends on their objectives. VRS is a lighter weight change to the rendering pipeline, while TSS has more flexibility and supports additional use cases.
https://devblogs.nvidia.com/texture-space-shading/
 
Effectively Integrating RTX Ray Tracing into a Real-Time Rendering Engine
October 29, 2018
RTX is NVIDIA’s new platform for hybrid rendering, allowing the combination of rasterization and compute-based techniques with hardware-accelerated ray tracing and deep learning. It has already been adopted in a number of games and engines. Based on those experiences, this blog aims to give the reader an insight into how RTX ray tracing is best integrated into real-time applications today. This blog assumes that the reader is familiar with the Microsoft DXR API at a basic level.

https://devblogs.nvidia.com/effectively-integrating-rtx-ray-tracing-real-time-rendering-engine/
 
There is some conceputual overlap between the two (Avoiding divergence).

It's interesting how we got low-level tools for one (turing just has a few key shader instructions that help pulling it off). Whilst a very big blackbox for raytracing, which in theory could have been a few shader instructions as well.
It shows how one (rasterization) is just much further/stable in exposing low level tools, while the other is at beginning.
 
Texture Space Shading is a features that intrigues me much more than RT actually. I am personally much more curious about the potential of that.

This is mostly software dependent, texture space shading has already shipped in a game (Ashes of the Singularity) and visibility buffers share some ideas with this. The real thing Turing does is a neat anisotropic sampling for a bitmask to reduce duplicate work, everything else is up to you and compatible with every other sufficiently advanced GPU.
 
This is mostly software dependent, texture space shading has already shipped in a game (Ashes of the Singularity) and visibility buffers share some ideas with this. The real thing Turing does is a neat anisotropic sampling for a bitmask to reduce duplicate work, everything else is up to you and compatible with every other sufficiently advanced GPU.
hum
 
https://www.digitaltrends.com/computing/nvidia-rtx-2080-ti-graphics-cards-dying/

Concerns are mounting over the failure rate of Nvidia’s RTX 2080 Ti graphics card, with increasing numbers of reports of dead and dying cards from early adopters. Some display issues involving artifacting and instability immediately after being installed, whilst others begin to show signs of degradation after a few days, despite a lack of manual overclocking or voltage manipulation.
 
It's in Vulkan already:

It's an extension. It needs to get into the core api for games to use it. Might be anyway dead for the next 6,7 years if amd powers all consoles and mesh shaders from nvidia and primitive shaders from amd aren't compatible to some degree.
 
It's an extension. It needs to get into the core api for games to use it. Might be anyway dead for the next 6,7 years if amd powers all consoles and mesh shaders from nvidia and primitive shaders from amd aren't compatible to some degree.

Huumm... nvidia does tend to erm.. inspire many PC devs at gunpoint to implement their proprietary optimizations, so I think that scenario is a bit unlikely to be honest.
In fact, we've actually seen the opposite very regularly: AMD-only optimizations that are enabled in consoles for a game but don't get passed on to its Gameworks-powered PC port.
 
It's an extension. It needs to get into the core api for games to use it. Might be anyway dead for the next 6,7 years if amd powers all consoles and mesh shaders from nvidia and primitive shaders from amd aren't compatible to some degree.
Are primitive shaders working or is it a throughly *broken* feature with no support?
 
There is some conceputual overlap between the two (Avoiding divergence).

It's interesting how we got low-level tools for one (turing just has a few key shader instructions that help pulling it off). Whilst a very big blackbox for raytracing, which in theory could have been a few shader instructions as well.
It shows how one (rasterization) is just much further/stable in exposing low level tools, while the other is at beginning.

The thicker abstraction reminds me of the more arcane elements of texture and resource management, particularly in older days where formats and architectural choices were as wide-ranging as the larger number of vendors and their more limited attempts at consistency/compatibility. Silicon can sort of be characterized as a ~2-dimensional space, with stepwise execution often being a ~1-dimensional affair. The units, paths, caches, memory subsystem, and DRAM tend to at most offer a 2-dimensional scheme, often with a very strong preference for movement along one axis (SIMD divergence, pixel quads, cache lines, DRAM pages, virtual memory tables, streaming/prefetch loads, etc.).
Keeping a soup of 3-dimensional geometry up-front and mapping it early-on to a screen space, and using well-researched methods for getting better mapping of 2D elements to more linear cache and DRAM structures has some nice effects in setting down direct relationships between items, resources, and execution. The rasterizer-directed, heavily-threaded, and SIMD hardware maps rather well to the problem of utilizing DRAM arrays and caches as we know them, and the direct relationship between elements in the pipeline in effect serve as compression in terms of data or hardware usage.

Texture-space rendering at least still keeps a 2D space for rendering, albeit no longer the same global 2D screen space as before. The mapping is still somewhat natural, though some of the prior assumptions that could be made when using screen space can no longer be automatically assumed due to the indirection added by the extra pass and variability in the properties versus the global screen. This exposes an extra bit of the process to something the silicon has to perform a bit more work to map to its capabilities.
RT functionality, and the functionality handled by the RT core are a problem space that has more dimensions than can be readily reduced, and like the old days the players in the field do not have a consensus on which judgement calls are to be baked into their methods or acceleration structures.
The RT core at least attempts to protect the vast majority of the SM compute hardware from floundering on a workload that behaves poorly with the granularity of the hardware, or the linearity built into DRAM.
The memory behavior seems to be a big reason why the fixed-function element is paired with the memory pipeline, much like how texturing is generally adjacent and still has internal operations specific to its handling of data with properties that can defy linear breakdowns.
In other ways, the BVH and RT hardware have a few parallels with TLB hardware, which is another case of handling spaces with more movement along other dimensions than the linear hardware would like. Granted, the adoption of a tree (albeit much flatter than many page table formats) and the indirection from traversal (not a directed walk down a hierarchy like page tables) can create a high-level impression of such almost by default. Perhaps some of the elements learned over the years for managing a tree of virtual memory metadata will inform what happens with RT hardware, however.

edit--late correction: I blanked on the characterization of the tree depth facing the RT core. The externally visible flat tree would be separate from the particulars of the BVH traversal, whose specifics regarding depth, subdivision, duplication, and other vendor-specific tweaks lead to a more variable amount of depth and set of operations at each juncture.


Perhaps releasing implementations now can establish a foothold in the market if there is going to be a competition over which methods go into the next generation, and then perhaps which methods will emerge from the black box.


This could have been an interesting data point if we did have primitive shaders to run on that workload. The number is in the same region of the Vega white paper's peak NGG discard rate, although the distance in terms of peak, PR blurbs, and a different architecture on someone else's unique workload make it risky to infer much.
 
Last edited:
Are primitive shaders working or is it a throughly *broken* feature with no support?

If memory doesn't fail me, AMD did show it working on Deus Ex: Mankind Divided, supposedly in a live demo during CES 2017.
It could be that the thing was working on a prototype / engineering sample, but somehow the production units came up with that portion defective.
Though the current understanding seems to be that devs simply prefer to use GPU compute culling and get similar results.
 
Devs don't "prefer", they can't use PS their is no way to use them, no api, nothing (that's what Sebastian Aaltonen said on twitter few months ago when I asked him if PS was faster than his gpu compute culling code)
 
Last edited:
Back
Top