Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

snc · Nov 1, 2020

ThePissartist said:
Does the PS5 not have any ML capabilities? If not, that's definitely a huge potential win for the Xbox hardware. Will be interesting seeing this generation pan out.

probably doesn't have faster int4 and int8 computation capablilities as xsx that you can use in ml

iroboto · Nov 1, 2020

snc said:
probably doesn't have faster int4 and int8 computation capablilities as xsx that you can use in ml

it may. int4 and int8 is supported on RDNA1, but it's possible that explicit customizations to the CU are required to support RPM for int4 and int8. It supports FP16 RPM by default.

what it likely doesn't have the ML add-on to support mixed precision/weight operations. So if you're using a lot of algorithms that are mixing 8, 4, and 16 you're not going to see the speed improvements you'd expect. This is also supported on RDNA1, but the CUs needed to be explicitly customized for this according to AMD white paper.

Some variants of the dual compute unit expose additional mixed-precision dot-product modes in the ALUs, primarily for accelerating machine learning inference. A mixed-precision FMA dot2 will compute two half-precision multiplications and then add the results to a single-precision accumulator. For even greater throughput, some ALUs will support 8-bit integer dot4 operations and 4-bit dot8 operations, all of which use 32-bit accumulators to avoid any overflows.

edit: to put it more succinctly, higher precision produces better results from a quality perspective. Higher precision however has drawbacks on computation and bandwidth. Lower precision has faster computation and lower bandwidth requirements.

After you create your model using high precision, you may, run an application of some sort, or if the ML library you are using supports it, it will port your model (or automatically run) to a mixed precision model. What it does is look for specific weights areas that do not benefit from high precision and lower them. Those specific weights that do require them it will keep them high. So therefore having mixed precision dot products are critical to supporting optimized networks in this fashion.

This is my understanding, but I've never tried it (rather tested against it, since I only have access to Pascal cards and not Volta). But this is a core feature of Tensor Cores since their introduction. I think Pascal may support as low as mixed int8 precision and int16 precision but not int4.

I believe Ampere supports int4 mixed precision, not sure about Volta (it does not). I'm honestly not well versed in this stuff, I'd have to read a lot to see which architectures support what. But I think you get the idea here.

Nvidia marketing materials compare fixed vs mixed precision. You lose approximately < 0.25% accuracy on your network but performance is 3X. This is worth a trade off.

edit 2:
Just going to address one more point with respect to Nvidia tensor cores vs having to do ML on compute. Tensor cores are specifically tuned to performing Deep Learning/Neural Network type machine learning, they are not at all that much useful for other algorithms. So you still need standard compute for all other forms of ML.

WRT DLSS >> DLSS replaces the MSAA step that would normally be handled by your GPU. During this time it will DLSS and the final results will be spit back for the rest of the pipeline to continue. When Turing/Ampere offloads the work away from the CUDA cores to Tensor Cores to do this, the pipeline, as I understand from reading, is stalled until DLSS is complete. So you can't really continue rendering normally until DLSS is done. You can (edit: maybe not) do async compute while waiting for DLSS to complete since your compute is free to do whatever as you offloaded this process to Tensor Cores.

With respect to the consoles, the only losses compared to nvidia's solution is
a) tensor cores power through neural networks at a much faster rate than compute
b) they can do async compute since the CUDA cores are no longer in use at this time

that's about all I can speculate. Since neural networks have a fixed run time, it's going to take a fixed amount of ms out of your render. So fixed rendering @ resolution + fixed NN for upscale. You'll only have so much time to meet 16ms and 33ms respectively. It makes sense for the consoles to leverage this for 33ms where it can take advantage of the additional time for quality improvements.

Alucardx23 · Nov 2, 2020

iroboto said:
edit 2:
Just going to address one more point with respect to Nvidia tensor cores vs having to do ML on compute. Tensor cores are specifically tuned to performing Deep Learning/Neural Network type machine learning, they are not at all that much useful for other algorithms. So you still need standard compute for all other forms of ML.

WRT DLSS >> DLSS replaces the MSAA step that would normally be handled by your GPU. During this time it will DLSS and the final results will be spit back for the rest of the pipeline to continue. When Turing/Ampere offloads the work away from the CUDA cores to Tensor Cores to do this, the pipeline, as I understand from reading, is stalled until DLSS is complete. So you can't really continue rendering normally until DLSS is done. You can (edit: maybe not) do async compute while waiting for DLSS to complete since your compute is free to do whatever as you offloaded this process to Tensor Cores.

With respect to the consoles, the only losses compared to nvidia's solution is
a) tensor cores power through neural networks at a much faster rate than compute
b) they can do async compute since the CUDA cores are no longer in use at this time

that's about all I can speculate. Since neural networks have a fixed run time, it's going to take a fixed amount of ms out of your render. So fixed rendering @ resolution + fixed NN for upscale. You'll only have so much time to meet 16ms and 33ms respectively. It makes sense for the consoles to leverage this for 33ms where it can take advantage of the additional time for quality improvements.

What makes you say "Maybe not"? So far on the graphs I have seen showing the DLSS step, the CUDA cores seem to be idle. I haven't seen anything that would stop the CUDA cores from being active on a task during the DLSS step, but I haven't seen any examples of it. I do understand how the Tensor cores could be working on something like a ML AI simulation, at the same time the CUDA cores are working on traditional rendering tasks, but talking specifically about DLSS, it seems that in order for it to work it has to wait for the frame to be ready, apply the DLSS step and then give the frame back to the CUDA cores to complete the post processing steps.

iroboto · Nov 2, 2020

Alucardx23 said:
What makes you say "Maybe not"? So far on the graphs I have seen showing the DLSS step, the CUDA cores seem to be idle. I haven't seen anything that would stop the CUDA cores from being active on a task during the DLSS step, but I haven't seen any examples of it. I do understand how the Tensor cores could be working on something like a ML AI simulation, at the same time the CUDA cores are working on traditional rendering tasks, but talking specifically about DLSS, it seems that in order for it to work it has to wait for the frame to be ready, apply the DLSS step and then give the frame back to the CUDA cores to complete the post processing steps.

I only wrote maybe not because I don't know if it's been exposed, not whether it cannot be exposed. As in, I just don't know what I don't know.
To me it seems sensible that it's available for async compute. But I don't know if there are exceptions to that rule, ie Ampere Yes, Turing No, or both yes etc.
Unless you're looking specifically at the architectures or use cases, I don't know.

Since Tensor cores are nvidia specific, it's not something that DX would handle normally as behaviour case. So it's really up to nvidia to decide if the driver wants to allow for async compute during the tensor core step.

But from that graph it does look like they can overlap if there is work submitted to do.

Alucardx23 · Nov 2, 2020

iroboto said:
I only wrote maybe not because I don't know if it's been exposed, not whether it cannot be exposed. As in, I just don't know what I don't know.
To me it seems sensible that it's available for async compute. But I don't know if there are exceptions to that rule, ie Ampere Yes, Turing No, or both yes etc.
Unless you're looking specifically at the architectures or use cases, I don't know.

Since Tensor cores are nvidia specific, it's not something that DX would handle normally as behaviour case. So it's really up to nvidia to decide if the driver wants to allow for async compute during the tensor core step.

But from that graph it does look like they can overlap if there is work submitted to do.

Thanks for the info. I'm more interested in how DLSS works on the RTX GPUs and how a similar technique would work on the XSX. The advantage that I see with the Tensor cores is that they can do the DLSS step faster than the repurposed CU on the XSX. DLSS seems to be a sequential process, where the TENSOR cores have to wait for the CUDA to finish, to then start the DLSS step. In that sense a similar technique in the XSX would work sequentially as well, so the full performance of the CU should be available for the DLSS step on the XSX. Am I making sense in the way I understand this to work?

iroboto · Nov 2, 2020

Alucardx23 said:
Thanks for the info. I'm more interested in how DLSS works on the RTX GPUs and how a similar technique would work on the XSX. The advantage that I see with the Tensor cores is that they can do the DLSS step faster than the repurposed CU on the XSX. DLSS seems to be a sequential process, where the TENSOR cores have to wait for the CUDA to finish, to then start the DLSS step. In that sense a similar technique in the XSX would work sequentially as well, so the full performance of the CU should be available for the DLSS step on the XSX. Am I making sense in the way I understand this to work?

DLSS is nvidia's tool and they require that particular step to be completed right at the MSAA stage, and during that time the pipeline is stalled because it cannot continue until the results are back. This is a sequential pipeline problem for them, I do not believe this is a hardware/architecture problem.

If the exact same model was run on XSX, it would perform the same steps just on the CUs and yes tensor cores would just do it much faster.

But the model that MS is working on in conjunction with AMD may not have to be run at that step. It's entirely up to MS and AMD to decide where it is best positioned for their needs. Thus I can't say there is an equivalency. DirectML is just an api library for ML. Meaning you can leverage it wherever you want in your code. DLSS is a packaged solution that I think developers call to a black box, no directML is used. So it's really quite different and I don't know what to expect - as in models can be used anywhere in the pipeline. Whereas with Nvidia they needed to position DLSS in a place that would be simple and straightforward to plug in.

TLDR; Nvidia's solution is plug and play - but limits it's flexibility of where it can be deployed on a certain pipeline. DirectML can be used anywhere and everywhere, but that makes modelling all the more difficult to deploy a generic solution for all titles. But it may allow for some faster/better optimizations for specific engines.

Alucardx23 · Nov 2, 2020

iroboto said:
DLSS is nvidia's tool and they require that particular step to be completed right at the MSAA stage, and during that time the pipeline is stalled because it cannot continue until the results are back. This is a sequential pipeline problem for them, I do not believe this is a hardware/architecture problem.

If the exact same model was run on XSX, it would perform the same steps just on the CUs and yes tensor cores would just do it much faster.

But the model that MS is working on in conjunction with AMD may not have to be run at that step. It's entirely up to MS and AMD to decide where it is best positioned for their needs. Thus I can't say there is an equivalency. DirectML is just an api library for ML. Meaning you can leverage it wherever you want in your code. DLSS is a packaged solution that I think developers call to a black box, no directML is used. So it's really quite different and I don't know what to expect - as in models can be used anywhere in the pipeline. Whereas with Nvidia they needed to position DLSS in a place that would be simple and straightforward to plug in.

TLDR; Nvidia's solution is plug and play - but limits it's flexibility of where it can be deployed on a certain pipeline. DirectML can be used anywhere and everywhere, but that makes modelling all the more difficult to deploy a generic solution for all titles. But it may allow for some faster/better optimizations for specific engines.

I guess I have problems visualizing how something like a ML image upscaling step would work, without having something like an almost completed or completed frame. I see it as something that must be integrated further down the pipeline sequence in order for the ML algorithm to have something to work on. Would you say that because the ML image upscaling step will be done on the XSX CU, it is robbing the CU of performance, even if it is a sequential process? I can understand how you can say that for a concurrent process, but not in a sequential one.

iroboto · Nov 2, 2020

Alucardx23 said:
I guess I have problems visualizing how something like a ML image upscaling step would work, without having something like an almost completed or completed frame. I see it as something that must be integrated further down the pipeline sequence in order for the ML algorithm to have something to work on. Would you say that because the ML image upscaling step will be done on the XSX CU, it is robbing the CU of performance, even if this is a sequential process? I can understand how you can say that for a concurrent process, but not in a sequential one.

Depends on what is most advantageous to the engine. Each engine processes rendering different, forward renderers, forward+, deferred, some hybrid, perhaps RT may change the order of things. Perhaps executeIndirect may change the ordering of the pipeline. Perhaps pure compute engines require something else entirely.

Nvidia processes 2 steps in DLSS. First is a super sampling AA. This would like for like replace MSAA. The second step is ML upscale to a target resolution.

If you are uninterested in say Super Sampling AA, or uninterested in ML upscale, you don't have a choice but to accept both.
With DirectML, you would have 2 separate models running, you can split up when the models run, you can choose to have 1 or the other or both. It's entirely up to the needs of the developers. That also makes it a customization/talent problem, but that's another discussion.

I'm not sure I understand the discussion around robbing the CU of performance by performing image upscaling. You have a variety of upscale techniques that would naturally be performed on the CU, ML is just another one of them. If you didn't do ML, you'd have to do something else. It's really going to come down to the performance/quality return on doing ML vs other methods. If developers are choosing ML, it's likely got better performance/quality compared to the other options. You don't just roll a ML solution for shits and giggles - R&D is not cheap in the data science world, let alone High Performance networks required to run in 5ms or less.

Compared to Tensor Cores, sure you get to offload it and free up your CUs for other tasks. But that's about it. Tensor cores are only worth it because they are much faster than standard compute with deep learning networks; if they weren't you would rather just have more CUs.

If you are using Tensor Cores for more than just DLSS... like to use it for AI, or denoising or other things like animation, or voice overlay, recognition etc, we'd need to see how well the architecture is performing in these parallel computation scenarios. (These things take a lot of time)

Right now we are not seeing Tensor cores used for RT denoising. It's entirely possible because the type of denoising is not advantageous for tensor cores, thus, they do denoising on compute, or without DirectML they cannot tackle it. Or there is no plug and play solution to make it straight forward. I do not know. Take what you want of that as you will. In my line of work, we play with neural networks, but most of our solutions are traditional ML based still. Statistics are still very relevant in most cases and run much faster than deep networks

Allandor · Nov 2, 2020

Alucardx23 said:
I guess I have problems visualizing how something like a ML image upscaling step would work, without having something like an almost completed or completed frame. I see it as something that must be integrated further down the pipeline sequence in order for the ML algorithm to have something to work on. Would you say that because the ML image upscaling step will be done on the XSX CU, it is robbing the CU of performance, even if it is a sequential process? I can understand how you can say that for a concurrent process, but not in a sequential one.

Well, I guess you could do something like RGSSAA with multiple frames. Just Render frame after frame with different pixel positions (just a bit) and than you have e.g. 4 frames with almost the same content from a little different perspectives. Now you could try to get details that were not visible with the new position from older frames with slightly different pixel positions. But you need something like object detection so you can be sure you use the pixels of the correct object.

Well at least this is how I could imagine how it works.
Another way would be render a frame every x frames with a absurd high resolution. Than match the content with those of the lower res frames.

But I guess, DLSS does it a whole different way.

cheapchips · Nov 2, 2020

No RT being mentioned for the Fortnite next gen update, just 4k 60. You'd think as an UE showcase they'd offer a lower res+RT mode too.

Alucardx23 · Nov 2, 2020

iroboto said:
Depends on what is most advantageous to the engine. Each engine processes rendering different, forward renderers, forward+, deferred, some hybrid, perhaps RT may change the order of things. Perhaps executeIndirect may change the ordering of the pipeline. Perhaps pure compute engines require something else entirely.

Nvidia processes 2 steps in DLSS. First is a super sampling AA. This would like for like replace MSAA. The second step is ML upscale to a target resolution.

If you are uninterested in say Super Sampling AA, or uninterested in ML upscale, you don't have a choice but to accept both.
With DirectML, you would have 2 separate models running, you can split up when the models run, you can choose to have 1 or the other or both. It's entirely up to the needs of the developers. That also makes it a customization/talent problem, but that's another discussion.

I'm not sure I understand the discussion around robbing the CU of performance by performing image upscaling. You have a variety of upscale techniques that would naturally be performed on the CU, ML is just another one of them. If you didn't do ML, you'd have to do something else. It's really going to come down to the performance/quality return on doing ML vs other methods. If developers are choosing ML, it's likely got better performance/quality compared to the other options. You don't just roll a ML solution for shits and giggles.

Compared to Tensor Cores, sure you get to offload it and free up your CUs for other tasks. But that's about it. Tensor cores are only worth it because they are much faster than standard compute with deep learning networks; if they weren't you would rather just have more CUs.

Thanks, I think you answered my question with that. I asked the question because I'm seeing some people saying that because you have to do the ML image upscaling step on the XSX CU, it is taking performance away from the GPU, when in reality the comparison that is being done here is 1080P (Or some other resolution) + a ML image upscaling step to simulate a 4K (Or some other resolution) image VS rendering a native 4K image. The way I visualize this is that if something like DLSS would run on the XSX CU, it would be in the same sequential order that it does on the RTX GPUs, so 100% if the CUs to run traditional rendering tasks, then 100% of the CUs to run the ML image upscaling step, then finally 100% of the CUs to run the post processing effect.

pjbliverpool · Nov 2, 2020

iroboto said:
I only wrote maybe not because I don't know if it's been exposed, not whether it cannot be exposed. As in, I just don't know what I don't know.
To me it seems sensible that it's available for async compute. But I don't know if there are exceptions to that rule, ie Ampere Yes, Turing No, or both yes etc.
Unless you're looking specifically at the architectures or use cases, I don't know.

On Ampere the Tensor cores can run simultaneously with both the RT cores and the CUDA cores. On Turing you could only run any 2 at once I believe. Nvidia call it "Second Gen Concurrency" and the following slide shows the impact it has in Wolfenstein:

https://www.kitguru.net/components/...ss/nvidia-rtx-3080-founders-edition-review/2/

iroboto · Nov 2, 2020

Alucardx23 said:
Thanks, I think you answered my question with that. I asked the question because I'm seeing some people saying that because you have to do the ML image upscaling step on the XSX CU, it is taking performance away from the GPU, when in reality the comparison that is being done here is 1080P (Or some other resolution) + a ML image upscaling step to simulate a 4K (Or some other resolution) image VS rendering a native 4K image. The way I visualize this is that if something like DLSS would run on the XSX CU, it would be in the same sequential order that it does on the RTX GPUs, so 100% if the CUs to run traditional rendering tasks, then 100% of the CUs to run the ML image upscaling step, then finally 100% of the CUs to run the post processing effect.

I haven't seen those discussions crop up here; but sure. I guess I would say any work you ask the CUs to do will rob it of potential? Lol I'm not sure I understand the argument being posed. Who cares of robbing it of potential. The goal is trying to accomplish more _work_ for less. From this perspective it shouldn't matter how much or how little hardware is used as long as the result is that you accomplished more work with less. And when you can't do more for less the goal is to get as close to 100% as you can.

iroboto · Nov 2, 2020

pjbliverpool said:
On Ampere the Tensor cores can run simultaneously with both the RT cores and the CUDA cores. On Turing you could only run any 2 at once I believe. Nvidia call it "Second Gen Concurrency" and the following slide shows the impact it has in Wolfenstein:

https://www.kitguru.net/components/...ss/nvidia-rtx-3080-founders-edition-review/2/

Thanks for the info dump. Yea I was sort of wondering if per architecture things ran differently.

Alucardx23 · Nov 2, 2020

iroboto said:
I haven't seen those discussions crop up here; but sure. I guess I would say any work you ask the CUs to do will rob it of potential? Lol I'm not sure I understand the argument being posed. Who cares of robbing it of potential. The goal is trying to accomplish more _work_ for less. From this perspective it shouldn't matter how much or how little hardware is used as long as the result is that you accomplished more work with less. And when you can't do more for less the goal is to get as close to 100% as you can.

Exactly! The post are not from here but from Resetera.

Alucardx23 · Nov 2, 2020

pjbliverpool said:
On Ampere the Tensor cores can run simultaneously with both the RT cores and the CUDA cores. On Turing you could only run any 2 at once I believe. Nvidia call it "Second Gen Concurrency" and the following slide shows the impact it has in Wolfenstein:

https://www.kitguru.net/components/...ss/nvidia-rtx-3080-founders-edition-review/2/

That should explain why we didn't see the tensor cores taking care of the denoising task on the Turing GPUs, right? Performance should increase even more going forward once the Tensor cores take care of the denoising instead of the Cuda cores.

iroboto · Nov 2, 2020

Alucardx23 said:
Exactly! The post are not from here but from Resetera.

oh ok.. well then.
I mean, you need to talk about having apples to apples conversations and level on some common ground.
Tensor Cores take up silicon budget and are generally not used in a large number of games. And when they are used, they are only used for a small fraction of the frame.

All RTX owners have paid a massive premium for silicon that is largely under used. And comparing it to a console where the silicon is being used nearly 100% of the time as is the default behaviour for all developers. We are now only coming to a discussion point of how much the traditional rasterizer pipeline will be used over compute.

There's no comparison that needs to be made really, the only question that needs to be asked is whether it's fast enough to run a ML solution on Compute with better quality and perform better vs checkerboarding/temporal injection etc. And that's more of a software development issue than it is a hardware problem. It's more than capable I think.

People need to get out of the mindset that tensor cores are required to run neural networks. We've been running them on CPUs and GPUs well before tensor cores arrived.

mpg1 · Nov 2, 2020

How can we speculate on ML acceleration for XSX if we haven't even seen how it works/performs on AMD cards yet? Could this not be extra hardware borrowed from AMD CDNA cards?

iroboto · Nov 2, 2020

mpg1 said:
How can we speculate on ML acceleration for XSX if we haven't even seen how it works/performs on AMD cards yet? Could this not be extra hardware borrowed from AMD CNDA cards?

I'm not even sure if this is a discussion around speculation of performance. It's like asking which one is better at bitcoin mining - an ASIC miner or a GPU. The obvious answer is ASIC, but it can only do bitcoin mining. Consoles are being asked to support a variety of methods that many of which may not have been invented yet. Just because it's slower than a dedicated piece of silicon, doesn't mean it's worthless. But it won't be hitting 120fps with it that's all, 60fps seems tight. And that's fine considering how acceptable 30fps is on console.

mpg1 · Nov 2, 2020

iroboto said:
I'm not even sure if this is a discussion around speculation of performance. It's like asking which one is better at bitcoin mining - an ASIC miner or a GPU. The obvious answer is ASIC, but it can only do bitcoin mining. Consoles are being asked to support a variety of methods that many of which may not have been invented yet. Just because it's slower than a dedicated piece of silicon, doesn't mean it's worthless. But it won't be hitting 120fps with it that's all, 60fps seems tight. And that's fine considering how acceptable 30fps is on console.

Right but a lot of these discussions are framed in general terms of how does ML acceleration work in regards to resolution upscaling and is it worth it?... as opposed to the important question....specifically how does AMD's ML acceleration work with regards to resolution upscaling and is it good?....and the reality is we don't know yet because the hardware/software hasn't been out there yet to really know.

It's seems like XSX would have half the ML acceleration of a 2060. But again we don't know how relatively performant and AMD/MS DLSS-like solution would be.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

snc

iroboto

Daft Funk

Alucardx23

iroboto

Daft Funk

Alucardx23

iroboto

Daft Funk

Alucardx23

iroboto

Daft Funk

Allandor

cheapchips

Alucardx23

pjbliverpool

B3D Scallywag

iroboto

Daft Funk

iroboto

Daft Funk

Alucardx23

Alucardx23

iroboto

Daft Funk

mpg1

iroboto

Daft Funk

mpg1

Similar threads