Nvidia Turing Architecture [2018]

Deleted member 2197 · Sep 13, 2018

New Tesla T4 GPU and New TensorRT Software Enable Intelligent Voice, Video, Image and Recommendation Services
September 12, 2018

NVIDIA today launched an AI data center platform that delivers the industry’s most advanced inference acceleration for voice, video, image and recommendation services.

Delivering the fastest performance with lower latency for end-to-end applications, the platform enables hyperscale data centers to offer new services, such as enhanced natural language interactions and direct answers to search queries rather than a list of possible results.
...
To optimize the data center for maximum throughput and server utilization, the NVIDIA TensorRT Hyperscale Platform includes both real-time inference software and Tesla T4 GPUs, which process queries up to 40x faster than CPUs alone.
NVIDIA estimates that the AI inference industry is poised to grow in the next five years into a $20 billion market.

The NVIDIA TensorRT Hyperscale Inference Platform features NVIDIA Tesla T4 GPUs based on the company’s breakthrough NVIDIA Turing™ architecture and a comprehensive set of new inference software.
Key elements include:

NVIDIA Tesla T4 GPU – Featuring 320 Turing Tensor Cores and 2,560 CUDA® cores, this new GPU provides breakthrough performance with flexible, multi-precision capabilities, from FP32 to FP16 to INT8, as well as INT4. Packaged in an energy-efficient, 75-watt, small PCIe form factor that easily fits into most servers, it offers 65 teraflops of peak performance for FP16, 130 teraflops for INT8 and 260 teraflops for INT4.
NVIDIA TensorRT 5 – An inference optimizer and runtime engine, NVIDIA TensorRT 5 supports Turing Tensor Cores and expands the set of neural network optimizations for multi-precision workloads.
NVIDIA TensorRT inference server – This containerized microservice software enables applications to use AI models in data center production. Freely available from the NVIDIA GPU Cloud container registry, it maximizes data center throughput and GPU utilization, supports all popular AI models and frameworks, and integrates with Kubernetes and Docker.

https://nvidianews.nvidia.com/news/...form-to-fuel-next-wave-of-ai-powered-services

Pressure · Sep 13, 2018

8.1 TFlops (FP32) at 75 Watt, not bad.

jlippo · Sep 14, 2018

Whitepaper is out.
https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf

Rootax · Sep 14, 2018

Already a video up at Gamers Nexus

Bludd · Sep 14, 2018

I see Ryan Smith is piling the pressure on Nate over at Anandtech

https://twitter.com/x/status/1040587085567029248

.

Digidi · Sep 14, 2018

If i read the chapter of mesh shading in the whitepaper, it sounds the same like AMDs primitive shaders and NGG...

https://patents.google.com/patent/EP3300027A1
https://patents.google.com/patent/US20180082399A1/en

Voxilla · Sep 14, 2018

jlippo said:
Whitepaper is out.
https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf

There is some more clarity on the Grays/s, at least it's not for a single on screen triangle.

DavidGraham · Sep 14, 2018

Digidi said:
If i read the chapter of mesh shading in the whitepaper, it sounds the same like AMDs primitive shaders and NGG...

Actually it's a bit more than that. But they are similar in that it needs developer support to work.

Voxilla said:
There is some more clarity on the Grays/s, at least it's not for a single on screen triangle.

Yup. We have 6 benchmarks now to compare Pascal to Turing.

Pinstripe · Sep 14, 2018

This is what the whitepaper says about DX12 Tier levels (page 54):

RESOURCE MANAGEMENT AND BINDING MODEL
DX12 introduced the ability to allow resource views to be directly accessed by shader programs
without requiring an explicit resource binding step. Turing extends our resource support to
include bindless Constant Buffer Views and Unordered Access Views, as defined in Tier 3 of
DX12’s Resource Binding Specification.
Turing’s more flexible memory model also allows for multiple different resource types (such as
textures and vertex buffers) to be co-located within the same heap, simplifying aspects of
memory management for the app. Turing supports Tier 2 of resource heaps.

So I suppose this makes it now equal to Vega?

Digidi · Sep 14, 2018

DavidGraham said:
Actually it's a bit more than that. But they are similar in that it needs developer support to work.

Yup. We have 6 benchmarks now to compare Pascal to Turing.

Where do you see, that it is more than Vega?

DavidGraham · Sep 14, 2018

Digidi said:
Where do you see, that it is more than Vega?

It offloads some of the load on the CPU to the GPU to increase the number of drawn objects on screen. It also has a LOD management system that works through automatic adaptive Tessellation. It can also modify and manipulate geometry on the fly, as shown in the Spherical Cutaway example in the white paper. Where the mesh shader is culling and modifying geometry based on its position relative to the sphere.

So while Vega's primitive shaders are focused more on accelerating current geometry processing as a means to improve AMD's shortcomings in that area, Turing's mesh shaders build on NVIDAI's lead in geometry processing to enable more stuff on screen and are aimed more at enhancing some of it's quality and flexibility.

Kaotik · Sep 14, 2018

Pinstripe said:
This is what the whitepaper says about DX12 Tier levels (page 54):
So I suppose this makes it now equal to Vega?

If they also added Stencil Reference Value from Pixel Shader then yes, otherwise no.

Deleted member 2197 · Sep 14, 2018

The all-new Mesh Shader takes a lot of the load off the CPU by moving the LOD calculation - just how much detail each object must have based on its distance from the viewer - and object culling over to a new intermediate step called the Task Shader. It effectively replaces the vertex and hull shaders of the traditional pipeline that are tasked with generating triangles/work.

In simple terms, the Task Shader generates the triangles and the Mesh Shader shades them. Nothing overly new there, as various shaders already exist to do that job, but the key is that the Task Shader can handle multiple objects rather than just one per traditional draw call per CPU. Helpful for games that run older versions of DirectX that are poor at pushing out draw calls? Most likely.

The point is that these two new shaders offer more flexibility than the standard pipeline. It makes most sense for pre-DX12 titles, you would think, but it will be interesting to see it working in practise.

Don't expect to see them on pre-Turing GPUs, either, and the reason for this is the way in which these two shaders interface with the pipeline, as seen above. Running on older hardware, though possible, would require a multi-pass compute shader to be used, negating the benefits entirely.

https://hexus.net/tech/reviews/grap...g-architecture-examined-and-explained/?page=6

Bludd · Sep 14, 2018

https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive

Ryan Smith · Sep 14, 2018

Bludd said:
https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive

And it's not going to have everything everyone wanted. But Nate is juggling this and benchmarking, so it's a whole lot of plates to spin at once.

Bludd · Sep 14, 2018

Ryan Smith said:
And it's not going to have everything everyone wanted. But Nate is juggling this and benchmarking, so it's a whole lot of plates to spin at once.

The plates look good tho

Malo · Sep 14, 2018

Ryan Smith said:
And it's not going to have everything everyone wanted. But Nate is juggling this and benchmarking, so it's a whole lot of plates to spin at once.

Thanks Ryan. I'm really interested in DLSS and exactly how it all works from start to finish. The whole concept is baffling to me.

The idea of sampling a particular scene at 64x supersampling to determine the best subsample position for that particular frame makes sense. So for final real-time rendering you know ahead of time what a particular subpixel position is best when sampling pixels for geometry that requires it. And it's a concept that works fine with their infiltrator demo as it's on rails, you're using fixed position cameras for every frame. Where the concept breaks down is for a game where the camera position and scene geometry, shading, post-effects etc are all unknown at a point in time.

Do they have monkeys playing a game for days on end at 64x supersampling to create those ground truth reference images? Are game developers required to create a hook for their DNN to then "play" the game at all possible scenes and camera positions? Even if that concept is even plausible (the data processing and image requirements must be enormous for a single game), what then becomes of all that information? What is then created for use by tensors in the game real-time and how is that stored in the driver? I just don't know enough by AI training and inferencing to begin to understand how this works for AA in real-time games.

silent_guy · Sep 14, 2018

@Malo
I think you’re overthinking this a little bit.

Here’s an example about deep learning that I thought very interesting, and that might apply for DLSS.

I was looking at a hobby project: license plate recognition. Some other guy has done the same thing. He had gathered a bunch pictures with license plates around town and trained his network to detect them.

It worked great.

And then he tried it on pictures on the web, and it didn’t work very well in some cases.

Turned out that the network had become very good at recognizing license plates with a particular font for his country, but not one where the characters were curved a bit differently. It wasn’t that it didn’t work at all, but the results were as good as they could be.

With a generic DLSS network, you’d expect similar behavior: it’d do an overall decent job, but it wouldn’t be tuned to the particular visual/artistic mood of the game. Or the camera perspective, etc.

So when developers submit in-game screenshots, you can improve the network to behave better.

That doesn’t mean that you need screenshots everywhere in the game, just like Google doesn’t need pictures of all the traffic signs in the world to recognize them. Neural nets are excellent at coming up with good results as long as they are similar enough to what they have been trained for.

It’s just an incremental improvement to get that last extra bit of quality.

Kaotik · Sep 14, 2018

Malo said:
Thanks Ryan. I'm really interested in DLSS and exactly how it all works from start to finish. The whole concept is baffling to me.

The idea of sampling a particular scene at 64x supersampling to determine the best subsample position for that particular frame makes sense. So for final real-time rendering you know ahead of time what a particular subpixel position is best when sampling pixels for geometry that requires it. And it's a concept that works fine with their infiltrator demo as it's on rails, you're using fixed position cameras for every frame. Where the concept breaks down is for a game where the camera position and scene geometry, shading, post-effects etc are all unknown at a point in time.

Do they have monkeys playing a game for days on end at 64x supersampling to create those ground truth reference images? Are game developers required to create a hook for their DNN to then "play" the game at all possible scenes and camera positions? Even if that concept is even plausible (the data processing and image requirements must be enormous for a single game), what then becomes of all that information? What is then created for use by tensors in the game real-time and how is that stored in the driver? I just don't know enough by AI training and inferencing to begin to understand how this works for AA in real-time games.

Here's tech reports take on it:

Many game developers are hopping on the bandwagon for Nvidia's Deep Learning Super Sampling, or DLSS, technology. Nvidia describes DLSS as a replacement for temporal anti-aliasing, a technique that combines multiple frames by determining motion vectors and using that data to sample portions of the previous frame. Nvidia notes that despite the common use of temporal AA, it remains a difficult technique for developers to effectively employ. For my part, I've never enjoyed the apparent blur that TAA seems to add to the edges of objects in motion.

To attack some of the limitations of TAA, Nvidia took its extensive experience using deep learning to recognize and process images and applied it to games. DLSS depends on a trained neural network that's exposed to a large number of "ground truths," perfect or near-perfect representations of what in-game scenes should look like via 64x supersampling. Once the model is sufficiently trained on those images, Turing cards can use it to render scenes "at a lower input sample count," according to Nvidia, and then infer what the final scene should look like at its target resolution. Nvidia says DLSS offers similar image quality to TAA with half the shading work.

https://techreport.com/review/34095/popping-the-hood-on-nvidia-turing-architecture/2

The way I read the whitepaper would suggest that they actually render on lower than set resolution, use few of rendered frames and the "ground truths" to estimate AA'd target resolution frames at about TAA level of quality (contrary to their whitepaper, people who saw the infiltrator demo live said DLSS doesn't match TAA, not in that demo anyway)

jlippo · Sep 14, 2018

NVIDIA Turing Architecture In-Depth
https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/

Variable rate shading, mesh shading and texture space shading if easily used may become quite nice combination.

Nvidia Turing Architecture [2018]

Deleted member 2197

Guest

Pressure

jlippo

Rootax

Bludd

Experiencing A Significant Gravitas Shortfall

Digidi

Voxilla

DavidGraham

Pinstripe

Digidi

DavidGraham

Kaotik

Drunk Member

Deleted member 2197

Guest

Bludd

Experiencing A Significant Gravitas Shortfall

Ryan Smith

Bludd

Experiencing A Significant Gravitas Shortfall

Malo

Yak Mechanicum

silent_guy

Kaotik

Drunk Member

jlippo

Similar threads