Nvidia Ampere Discussion [2020-05-14]

I'm merely suggesting that if you re-formulate the algorithm using matrices (or matrix-vector math), you might make progress.
I'm afraid it's not that simple, solving non linear partial differential equations if quite different from linear algebra.
 
One has to wonder if fluid/smoke/physics could end up partially/fully accelerated by dnn's. Old stuff from 2016, there has to be newer research available. One more potential use for tensor cores?

Old stuff yes, and I'm not sure it will catch on. Houdini Pyro doesn't use anything like that AFAIK.
Even Nvidia's own GPU fluid simulator does not do this...
 
Solving PDEs with neural networks is still in its infancy but it's making rapid progress.

For instance this was a recent breakthrough by a Caltech and Purdue University collaboration (Anima Anandkumar, who led this project, is a Director of Research at NVIDIA):

So yes, in a not so distant future these workloads might progressively move to tensor core-like HW.
 
Getting Immediate Speedups with NVIDIA A100 TF32
November 13, 2020
TF32 is a great precision to use for deep learning training, as it combines the range of FP32 with the precision of FP16 to deliver up to 5x speedups compared to FP32 precision in the previous generation. In this post, I briefly step through the inner workings of TF32 and discuss performance data that shows its impact across an array of usages and networks.
time-to-solution.png

Using TF32 precision, the A100 delivers significant speedups for computer vision, speech, and language, as well as recommender system networks. The biggest speedup seen was on BERT natural language processing (NLP) networks, where TF32 brought a 5x TTS speedup.

You might notice that NVIDIA included a network called ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), which is a novel pretraining method for language representations. Electra outperforms existing techniques, given the same compute budget on a wide array of NLP tasks. For computer vision networks, the TTS speedup was ~2.5x. For DLRM, a recommender system network created by Facebook, there was a ~3x TTS speedup.

In addition to the networks shown in the chart, we evaluated data across 23 different networks from the Deep Learning Examples on GitHub. All told, we saw an average TTS speedup of 2.6x across these networks. All without any code changes. For more information about performance data, see NVIDIA Data Center Deep Learning Product
...
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/tfloat32.h contains all the details of the tf32 data type, including storage, rounding, conversion, arithmetic operations, etc.

https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32/
 
Last edited by a moderator:
Solving PDEs with neural networks is still in its infancy but it's making rapid progress.


For instance this was a recent breakthrough by a Caltech and Purdue University collaboration (Anima Anandkumar, who led this project, is a Director of Research at NVIDIA):

So yes, in a not so distant future these workloads might progressively move to tensor core-like HW.

The poor Phd student felt so embarassed by his professor he already appologized:
"Thanks! Sorry for the high-pitched expression on Twitter.. In the paper, we are much careful about the wording. "

I read part of the paper regarding Navier Stokes, and they trying to simulate a fluid in 2D on a 64x64 grid.
How fast one iteration step takes, seems to be 0.005s as quoted.
My GPU simulator does simulate 304x304x304 (and also render it at the same time in 4K) at 160 FPS on a 3090
So 0,00625s per iteration, without the rendering it would at least also be 0.005s for the simulation iteration.
The difference we both do 0.005s per iteration but this NN method does it for 64x64 and my method does it for 304x304x304

Or my method is 304^3 / 64^2 = 7000x times faster compared to this NN method.
Or 7000000x times faster to what the NN method compares itself with.
Now I expect to get here 7000 times more likes as the post before :)
 
Last edited:
The poor Phd student felt so embarassed by his professor he already appologized:
"Thanks! Sorry for the high-pitched expression on Twitter.. In the paper, we are much careful about the wording. "

I read part of the paper regarding Navier Stokes, and they trying to simulate a fluid in 2D on a 64x64 grid.
How fast one iteration step takes, seems to be 0.005s as quoted.
My GPU simulator does simulate 304x304x304 (and also render it at the same time in 4K) at 160 FPS on a 3090
So 0,00625s per iteration, without the rendering it would at least also be 0.005s for the simulation iteration.
The difference we both do 0.005s per iteration but this NN method does it for 64x64 and my method does it for 304x304x304

Or my method is 304^3 / 64^2 = 7000x times faster compared to this NN method.
Or 7000000x times faster to what the NN method compares itself with.
Now I expect to get here 7000 times more likes as the post before :)
I believe you're missing the forest for the trees. These are early but tremendously encouraging developments. No one has claimed you should throw away your code.
Also it's a research paper that presents a new idea, not the most optimized way to perform a specific task on some given HW.
 
I believe you're missing the forest for the trees. These are early but tremendously encouraging developments. No one has claimed you should throw away your code.
Also it's a research paper that presents a new idea, not the most optimized way to perform a specific task on some given HW.

Tremendously encouraging developments why? If your job is a data scientist, maybe.
Some data scientists seem to think, we can solve any problem and the cool thing is, we don't need to know anything about the problem.
Give us for a billion inputs the billion expected outputs, and we can solve the problem.
At least we can create a solution that for the billion inputs will approximately produce the billion expected outputs, and we hope that for the infinite other inputs it will approximately produce the infinite other outputs.
For some problems that can be a good approach especially for those where there is not a known exact solution for or when getting it wrong for some inputs is not a big problem.
For some problems there are existing fast solutions, like fluid simulation, that can be computed up to the exact solution.
And there it gets a bit murky. Because there the idea is to try to replace those solutions and also the experts that have a lot of knowledge of how to solve the problem, or how to in the future find better solutions for the problem. The solutions produced by the data scientists can get it right to a certain degree, but not better than that because lack of data or the network gets too big to compute. That said I'm not against NNs, if they produce solutions better than existing solutions and are fast to compute they should be considered.
 
Last edited:
Tremendously encouraging developments why?

Many things are not pursued widely because they are seen as impossible/impractical. Once someone shows something is possible the problem changes and rapid progress can be made as foundation is there. Often it's different team/person who is good at doing foundational research versus taking a known approach/paper and optimizing hell out of it for real world application. 4 minute mile was impossible for long time but once it was broken things changed dramatically.

Viewpoint really differs if one is looking at what kind of games ship 2022 versus looking at research and trying to guess what could be happening few more years down the line.

One thing that keeps dnn's out of games is hardware. There is no way in hell to go deep into dnn's in games until tensor core like solutions are mainstream. AMD's new instructions are nice step forward but still very slow compared to tensor cores.

One solution I can imagine in near(ish) future is really good upscaling material/mesh dnn solution. For example store textures/meshes in lower res/as metadata and upres/resolve extra details in runtime. Something like gimme a brick/brick wall texture and mesh here. Use something lowres/metadata. Let dnn during runtime resolve smaller res texture&mesh into much higher res. In essence wildy trade disk space/artist time in favor of descriptive solution where details are resolved programmatically. I suspect if this takes off first indication will be happening in some kind of build time baking solution and once hw allows it will be moved to be realtime while making game install sizes much smaller.
 
Last edited:
Tomshardware is the only review site which used Crysis Remastered in the 6800 reviews:
W8PR6SojbYwBrgAxferEHo-970-80.png.webp

https://www.tomshardware.com/news/the-amd-radeon-rx-6800-xt-and-rx-6800-review

There was the question what >30TFLOPs compute performance can do. I think this a good example that Ampere has access to all these flops, but it just doesnt make sense for developer to use them...

That is a ridiculous statement, in case you are not aware.
Rendering on those graphs is slow for all cards as raytracing is on.
Watch the graphs without raytracing and see the difference, only under these conditions can the shader TFLOPs be maximised.
 
I believe you're missing the forest for the trees. These are early but tremendously encouraging developments. No one has claimed you should throw away your code.
Also it's a research paper that presents a new idea, not the most optimized way to perform a specific task on some given HW.

I really don't find them "tremendously encouraging" at all. ML is good for what it is, nested probability estimations and generating those from a dataset so people don't have to.

This "ML all the stuff!" approach though just doesn't fundamentally make any sense. If you need entirely known, perfectly predictable results you wouldn't use an essentially statistics based approach to begin with. It's the same reason "neural rendering" is nigh abandoned at this point already. You don't need to guess the rendering equation, you have it and need to do it as fast as possible.

Now at the point when you have enough results, then you can use statistics to get a close guess for the rest of the corellated data. Thus denoising. And probably you "denoise" a bunch of other things as well, adda bunch of "close enough" samples to a navier-stokes simulation after you reach a good threshold. But it's not fundamentally good starting point for many things. That's like asking self driving car to go somewhere without cameras.
 
Last edited:
This "ML all the stuff!" approach though just doesn't fundamentally make any sense.
Well, I guess we'll have to shut down research centers, university departments and corporate research labs. No doubt there is a lot of poor work around, but there are also a flux of new results that were simply unthinkable only 5 year ago.
If you need entirely known, perfectly predictable results you wouldn't use an essentially statistics based approach to begin with.
No Monte Carlo rendering for you.
It's the same reason "neural rendering" is nigh abandoned at this point already. You don't need to guess the rendering equation, you have it and need to do it as fast as possible.
Neural rendering is a hot topic making progress at breakneck speed. I am not sure how you can even dream to say it is being abandoned.

If you think people use ML/DL for rendering to guess the rendering equation you are in for a surprise. Sure, there is the odd paper here and there that pretends to know nothing about gfx but it's hardly representative of the best work.
 
It does use RT cores.
Yes:

You've just mentioned 'hardware-assisted ray tracing being available for Crysis Remastered at launch'. Can you go into a bit more detail? Is this using Turing and Ampere's RT Cores, and what kind of improvement have you seen by enabling them?

[SH] We are using the Vulkan extensions to enable hardware ray tracing on NVIDIA RTX cards.
This gives us a significant performance boost in the game. The differences you will see in the
game and the reflections of animated objects, besides the main character, and performance.

Hardware support gives us a 5-9ms of rendering time performance boost with ray tracing enabled. In areas where ray tracing is not 100% present, like on a wooden floor, you won't see many differences in performance between software and hardware ray tracing, but for 95% on the game, you will feel the performance benefits.

Why did you opt to go with NVIDIA's Vulkan 'VKRay' extension for ray tracing instead of using Microsoft's DXR API?

[SH] We developed the game with DX11 and our own CRYENGINE API in place. The Vulkan extension was a great fit for us to build everything on top of our current solution to improve
performance.
 
Well, I guess we'll have to shut down research centers, university departments and corporate research labs. No doubt there is a lot of poor work around, but there are also a flux of new results that were simply unthinkable only 5 year ago.

No Monte Carlo rendering for you.

Neural rendering is a hot topic making progress at breakneck speed. I am not sure how you can even dream to say it is being abandoned.

If you think people use ML/DL for rendering to guess the rendering equation you are in for a surprise. Sure, there is the odd paper here and there that pretends to know nothing about gfx but it's hardly representative of the best work.

The point is not to use a hammer where you need a screwdriver. Think of it this way: ML can solve rendering, or accurate physics simulations, very quickly. It'd do so by learning to program, then writing an efficient program to do those things, without using a bunch of branching statistical clumps.

"This hammer is new and awesome and I bet I could use it for everything!" is an incredibly common notion for younger programmers to pick up. And they will pick that notion up, just like most all humans they'll make much the same mistakes over and over and over again because that's how new humans work. But eventually you realize the world is full of tradeoffs and there's no one tool for every job.
 
"This hammer is new and awesome and I bet I could use it for everything!" is an incredibly common notion for younger programmers to pick up. And they will pick that notion up, just like most all humans they'll make much the same mistakes over and over and over again because that's how new humans work. But eventually you realize the world is full of tradeoffs and there's no one tool for every job.

You succinctly described Javascript and Angular web development to a T.

I'm still waiting for that last part to happen, where they learn.
 
Crysis Remastered doesnt use hardware raytracing. It is a compute based only solution. And it is the only game in which Ampere has a huge advantage even over Turing:
asfmP5brs64WEw5ENE9rPS-970-80.png.webp

https://www.tomshardware.com/news/nvidia-geforce-rtx-3070-founders-edition-review

A 3090 is 1.8x faster than the 2080TI. Normalized to die size Ampere delivers twice the performance.

Am I missing something here? The 2080Ti and the 3070 trade blows in most games in all of the reviews - this doesn't appear to be a particular outlier performance-wise. In this case the 2080Ti is still faster than the 3070 with RT off, and a bit slower with RT on, but still in the same performance class.

It doesn't really look like an Ampere vs Turing advantage so much as it's taking advantage of the huge memory bandwidth available on the 3080/3090 with RT on. It's allowing GA102 to stretch its legs more vs GA104 than in most titles, which does make sense given that RT enabled at 4k is about as bandwidth-heavy of a scenario as you're going to find.
 
Back
Top