Machine Learning: WinML/DirectML, CoreML & all things ML

Sounds like what xbox was talking about that they was working on.
Where everyone thought it was image reconstruction.
But not heard anything about it since, so who knows if still working on it.
 
Last edited:
December 15, 2023
Clearly, this look at FP16 compute doesn't match our actual performance much at all. That's because optimized Stable Diffusion implementations will opt for the highest throughput possible, which doesn't come from GPU shaders on modern architectures. That brings us to the Tensor, Matrix, and AI cores on the various GPUs.
...
It's interesting to see how the above chart showing theoretical compute lines up with the Stable Diffusion charts. The short summary is that a lot of the Nvidia GPUs land about where you'd expect, as do the AMD 7000-series parts. But the Intel Arc GPUs all seem to get about half the expected performance — note that my numbers use the boost clock of 2.4 GHz rather than the lower 2.0GHz "Game Clock" (which is a worst-case scenario that rarely comes into play, in my experience).
 
Last edited:
So I can haz a question : Take for example Nvidia DLSS (Deep Learning Super Sampling) from what I understand my p.c is using an algorithm created on Nvidia super computers using ML
but when if ever will the learning take place on my computer (eg: I use some future version of DLSS and the more I use it the better it gets because its learning ?
 
So I can haz a question : Take for example Nvidia DLSS (Deep Learning Super Sampling) from what I understand my p.c is using an algorithm created on Nvidia super computers using ML
but when if ever will the learning take place on my computer (eg: I use some future version of DLSS and the more I use it the better it gets because its learning ?
Never, not current technologies anyway. It doesn't learn on your computer, it gets taught by NVIDIA and you just run it.
 
You will have to run s/w in "learning" mode which in case of DLSS presumably means something like 64x SSAA. You want to do that?
 
So I can haz a question : Take for example Nvidia DLSS (Deep Learning Super Sampling) from what I understand my p.c is using an algorithm created on Nvidia super computers using ML
but when if ever will the learning take place on my computer (eg: I use some future version of DLSS and the more I use it the better it gets because its learning ?
If your question is only related to DLSS, not currently but no one knows what the direction will be over the next 5 - 10 years. I wouldn't be surprised to see LLM models running locally on automobiles making minute changes based on real time data, so why not the PC.

If you are talking about "training" in general on your PC and have an RTX card you can already run local LLM models against data residing on your pc today.
 
Games are applications where a user expects predictable results which isn't something you get from a learning NN where results may be different because of said learning - and not always in a good way.
 
Currently the problem of training locally is not just it's very costly, but also that it can be hard to do quality control on the training results.
Using DLSS as an example. The models NVIDIA trained using their supercomputers have to be verified before shipping. That is, they'll run tests on many games to make sure that the new training results are still good. Otherwise, you could have something that's good with the new training data but somehow worse for other scenarios.
 
Quick question : If something has the ML performance of 300TOPS and 67 TFLOPs of 16-bit floating point (I have no idea what that means) is that impressive ?
 
Quick question : If something has the ML performance of 300TOPS and 67 TFLOPs of 16-bit floating point (I have no idea what that means) is that impressive ?

It depends on what these numbers means though. For example, a 4090 has 82.6 TFLOPS of peak FP16 performance (non-tensor). When using tensor core, it's 330 TFLOPS with FP16 accumulate and 165 TFLOPS with FP32 accumulate. The TOPS number likely refers to FP8/INT8 performance and 4090 has 660 TOPS with tensor (both FP8 with FP16 accumulate and INT8).

So basically these numbers could mean a lot of things and whether that's impressive really depends on what exactly they are. Also the amount of memory and memory bandwidth are also important in some applications. For example, when doing LLM inference, a smaller model which can fit inside 4090's 24GB memory will run much faster than a M3 Max, but when running a larger model requiring more than 24GB, a M3 Max with say 64GB or 128GB memory will be faster.
 
Quick question : If something has the ML performance of 300TOPS and 67 TFLOPs of 16-bit floating point (I have no idea what that means) is that impressive ?
In absolute terms it's not a world beater but given the general constraints and value of the package as a whole it's impressive

However there are some asterisks here - the FP16/FP32 numbers are doubled with dual issue and I don't know how widespread or useful that is, maybe it's easy to do and will be common but I don't know if that's the case right now. Then as pcchen says bandwidth is potentially problematic, apparently 576GB/s pro vs 448GB/s which is 1.29x. ~1.3x bandwidth for 3.26x flops with dual issue, that's 0.4x bandwidth/flop for the shaders. On top of that we don't know how nicely the NPU/whatever they're going to call it will play with the rest of the system wrt bandwidth, bandwidth required to exploit the performance, possible bandwidth contention issues reducing available bandwidth for the GPU and CPU on top of that, whether the GPU and NPU can run concurrently or not although I presume it will?

If everything works out well it could be a very exciting time next gen when both will have the extra number crunching. Devs are good at finding interesting and unexpected ways to get the most out of hardare so this is exciting even if it's limited
 
Was thinking.. could AI (ML) be trained to optimize asset loading and streaming during gameplay in video games? Surely it can become better at predicting player patterns and inputs... so realistically devs should be able to allow for finer-grained loading and streaming of assets through much better algorithms designed to do specifically that, right?

What about another scenario.. Let's say you're on you're PS6/XSX2 dashboard, and you've got your library of installed games and whatever else. What if as you hovered over an installed game, before even pressing the button it started loading it up in the background.. cutting down on the perceived start up time further yet. I know this doesn't particularly require ML.. just saying lol
 
I asked asked the question was ML ever going to be performed on the p.c and was told no, but we are to believe (if rumours are true) that it will be done on the ps5 why the discrepancy ?
 
I asked asked the question was ML ever going to be performed on the p.c and was told no, but we are to believe (if rumours are true) that it will be done on the ps5 why the discrepancy ?
It's already 'being done's on PC depending on your definition on machine learning (they're doing inference), could you be more specific?
 
Back
Top