Nvidia Turing Architecture [2018]

Variable rate shading, mesh shading and texture space shading if easily used may become quite nice combination.
Is Variable Rate Shading just Rapid Packed Math on steroids?

NVIDIA showed Wolfenstien 2 running with a technique called Content Adaptive Shading, which uses Variable Rate Shading to dynamically identify portions of the screen that have low detail or large swathes of similar colors, and shades those at lower detail, and more so when you’re in motion. The resulting fps increase reached 20fps just from switching on that technique compared to switching it off.

https://www.pcworld.com/article/330...cs/nvidia-turing-gpu-geforce-rtx-2080-ti.html
 
I see, I though it works through double FP16 which Turing supports in the consumer line now.
The concept in the slides at least seem to point to shading per a given grouping of pixels (I think).

Dynamically switching between fp16/fp32 is probably not what's going on there.
 
It offloads some of the load on the CPU to the GPU to increase the number of drawn objects on screen. It also has a LOD management system that works through automatic adaptive Tessellation. It can also modify and manipulate geometry on the fly, as shown in the Spherical Cutaway example in the white paper. Where the mesh shader is culling and modifying geometry based on its position relative to the sphere.

So while Vega's primitive shaders are focused more on accelerating current geometry processing as a means to improve AMD's shortcomings in that area, Turing's mesh shaders build on NVIDAI's lead in geometry processing to enable more stuff on screen and are aimed more at enhancing some of it's quality and flexibility.

But if you compare diagrams they look now the same.

And if you look at the Turing asteroid tech Demo they also talk about the huge amount of polygons. That was one mayor aspect of Vega.

For Vega AMD claimed that they can cull at a very early stage. Also Turing have these feature?
Furthermore, traditional geometry pipelines discard primitives after vertex processing is completed, which can waste computing resources and create bottlenecks when storing a large batch of unnecessary attributes. Primitive shaders enable early culling to save those resources.
 
But if you compare diagrams they look now the same.
There is a similarity in how the pipelines go from *fixed* *programmable* *programmable* *fixed* *programmable* *programmable* *fixed* to *fixed* *programmable* *fixed* *programmable* *fixed*.
What happens at the input and output of each programmable block also changes in several parts of Nvidia's diagram. AMD's discussion of its new geometry pipeline kept the input assembler and tessellation stages as they were, which may constrain the changes somewhat by not giving it the same kinds of flexible inputs and outputs Nvidia's method can choose. It may be that it has moved portions of those intervening non-programmable functions into the combined programmable stages.
The biggest emphasis from AMD was enhanced culling, which Nvidia's section discussing its mesh and task shaders didn't really emphasize. Granted, the apparent flexibility could mean that the minimum promised by primitive shaders could be handled as well.

And if you look at the Turing asteroid tech Demo they also talk about the huge amount of polygons.
The asteroid demo's change was having the front-end shader select a different variant of the model based on how much detail was actually necessary, not reading in and then culling out non-contributing triangles with extra shader code on top of the existing shaders. The primitive shader is taking orders from a standard draw call where the decision making was done earlier, not selecting different models on the fly.


Other changes indicate some tweaks to the other stages like the hand-off from rasterizer to pixel shader stage, where coverage and shading can be varied on what appears to be a granularity close to a rasterization tile rather than whole swaths of screen space.
The texture-space shading in combination with the other changes remind me of some thoughts I had about VR rendering a while back, where the quick and dirty way of scaling performance was by having a GPU per eye. A modified rendering scheme that could borrow some of the reconstruction techniques that leveraged temporal coherence might be used to skip duplicated rendering operations on objects too far from the viewer to have a significant difference in angle or contribution. Further, with the variable and foveated rendering, more work could be skipped due to the limited space from which the human visual system can infer depth or perceive significant detail.

Some of these shortcuts don't work as well if the system cannot account for the viewer's eyes moving, however.


Perhaps I missed it, but is there some detail in the whereabouts of the polymorph blocks? (edit: never mind, found them after squinting at the block diagram)
 
Last edited:
Regarding the part about mesh shaders in that hexus article "Running on older hardware, though possible, would require a multi-pass compute shader to be used, negating the benefits entirely."

isn't that bs? aren't there games that do exactly that already?
 
Most of the articles on the architecture are BS, not surprisingly. "Variable rate shading" you mean the F*ing software paper on that recently that has shit all to do with hardware? Yeah thanks for including that in a hardware paper. Why is the "white paper" is filled with PR bullshit? I'm just trying to read about your computer architecture guys, keep it out of the hands of the god damned PR people.

That being said, there some cleverness here. The restructured low level cache seems like a good idea and a straight up win. Depending on the separate INT cores actual silicon area though it may not justify their stated max throughput of 36% improvement, but what size it is, well just isn't known. I'd also question the placement of their tensor cores in the same SM as FP/INT compute. A huge amount of energy usage from inferencing comes from memory shuttling, which is why inferencing specific chips have huge local caches, far bigger than those usually needed on other sorts of GPU tasks. It still feels like a compromise between the older training purpose of tensor cores, and what games would use them for in the near future, which is solely for inferencing. If they're really diverging gaming and high performance compute then shoving the cores into their own thing with their own cache structure could be a bigger win.

From that paper that's all I can really conclude. There's a lot of shit, and not a lot of info. Fortunately Anandtech has done a proper job, and shows that CUDA cores are responsible for BVH construction. How this is done isn't shown, and I wonder what the performance is. Too many moving parts could quickly bottleneck any game. Regardless one of the biggest things is how big things are. For reference a GTX 1080 (GP 104) is a mere 341mm^2, the 2080 is a massive 545mm and that's on the smaller 12nm process. Assuming the linked leaks are real, performance per mm has, uhmm, gone down since Pascal. To even equal Pascal from a 1080 to 2080 would need roughly a 60% performance increase. The performance increase is more along the lines of 40-45%. Keep in mind, that's not taking into account that there's more transistors per mm with the improved process. Right now, considering talented programmers can get the same level of raytracing performance out of a 1080ti as Nvidia claims can come out of their new RTX cards, well consider me unimpressed.
 
Even if the new RTX hardware would be only twice as fast as the fastest possible GPU raytracer, that would still be good.
Published results for GPU raytracers primary rays speak in the range of 1-3 Grays/s on a Pascal GPU with scenes ~1 million polygons. Turing is doing 12 Grays/s on the Standford Buddha (100K polygons if I'm not mistake) of the white paper pg 33. On the otherhand the bounding box of the Buddha model covers only 1/3 of the screen where as the scenes in the paper cover the whole screen. Taking that into account that would be ~2 Grays/s for Pascal compared to ~4 Grays/s for Turing.
Question is are primary rays even relevant, as this can be done much faster with rasterizing. Secondary rays can be much more incoherent trashing caches, but those improved a lot also on Turing.
 
That being said, there some cleverness here. The restructured low level cache seems like a good idea and a straight up win. Depending on the separate INT cores actual silicon area though it may not justify their stated max throughput of 36% improvement, but what size it is, well just isn't known.
[…]
I'd also question the placement of their tensor cores in the same SM as FP/INT compute. A huge amount of energy usage from inferencing comes from memory shuttling, which is why inferencing specific chips have huge local caches, far bigger than those usually needed on other sorts of GPU tasks.
[…]
Assuming the linked leaks are real, performance per mm has, uhmm, gone down since Pascal.
I was entertaining the thought, that dedicated INT32 cores might consume less energy doing their INT32 work than having to shove this through the FP32 pipe. Welcome to the world of energy over space. I might be wrong though, but further idle thoughts came up, letting me consider the idea, Turing is a contingency plan for 7 nm not being ready in time/not available in a large enough volume/not living up to the expectations energy wise. That would be supported both by the separation of cores not solely for performance increase but more energy efficiency as well as the immense amount of chip area invested for consumer products. It also would explain why no one outside of Nvidia every heard of Turing a couple of months ago. Maybe it was intended to be Ampere at 7 nm and with ~40% less die space.

I find it amusing, that after the 2006/2007 rally of unifying everything on the chip, we are now back to dedicated units for a lot of special compute cases again. Not counting Rasterizers, TMUs etc. which have been there all along, we now have:
• SMs with Shader-ALUs „Cuda Cores“ including
•FP32 groups
•INT32 groups
•FP64 groups
•Tensor groups
•RT „Cores“ (BVH traversers/Tri intersection checks)
•L/S + TMU​
 
I was entertaining the thought, that dedicated INT32 cores might consume less energy doing their INT32 work than having to shove this through the FP32 pipe. Welcome to the world of energy over space. I might be wrong though, but further idle thoughts came up, letting me consider the idea,

I think that’s a very reasonable idea. It all depends on the complexity of the pipeline. It could also be to shorten timing paths in the pipeline stages to allow for higher clock speeds.
 
In addition to the DLSS capability described above, which is the standard DLSS mode, we provide a second mode, called DLSS 2X. In this case, DLSS input is rendered at the final target resolution and then combined by a larger DLSS network to produce an output image that approaches the level of the 64x super sample rendering

Why isn’t DLSS 2x the standard mode? Seems a little bizarre to push a lower resolution render that just matches TAA instead of the mode that actually improves IQ.

Hopefully the larger network requirement doesn’t limit the availability of the 2x mode.
 
Why isn’t DLSS 2x the standard mode? Seems a little bizarre to push a lower resolution render that just matches TAA instead of the mode that actually improves IQ.
DLSS 2X doesn't give big increases in fps. It comes at the cost of some performance.
I guess he is talking about Distance Field rt, or some kind of voxel cone tracing, or some sphere tracing. Obviously, he doesn't realise there is a difference once we speak about polygon soup tracing, which current games are.
Exactly.

Here is what Sebbi had to say about RTX (hardware BVH) and his cone tracing implementation:


 
Here is what Sebbi had to say about RTX (hardware BVH) and his cone tracing implementation:



If you want to compare with real ray tracing for voxel based rendering, there are some videos here.
The Colon bottom video renders at 4K, with 16 views covering the whole screen at 2 Grays/s on a Pascal GP102 GPU. (only 100 Mrays/s on a 8 core CPU with most of the time spent on trilinear voxel interpolation obviously). No use of BVH as this would be far from optimal, and also not flexible to handle transfer function changes. I can create some extra videos.
 
Last edited:
Regardless one of the biggest things is how big things are. For reference a GTX 1080 (GP 104) is a mere 341mm^2, the 2080 is a massive 545mm and that's on the smaller 12nm process. Assuming the linked leaks are real, performance per mm has, uhmm, gone down since Pascal. To even equal Pascal from a 1080 to 2080 would need roughly a 60% performance increase.
I don't think it's really a valid comparison with all the extra die space dedicated to Tensor and RT cores since comparing to Pascal you're referring to perf/mm2 in standard rasterized games. If you discount the die space for the additional hardware, the uplift is probably expected amount.

Alternatively, get a game with RT and DLSS support with TAA level for comparison and the perf/mm2 is probably huge for Turing.

Yes it starts getting harder now to simply compare perf/mm2 since there's a lot of hardware in there sitting idle currently. I just don't think it's directly comparable anymore, at least until next gen Turing.
 
Back
Top