Machine Learning: WinML/DirectML, CoreML & all things ML

Ike Turner

Veteran
Since the first big commercial use case of WinML is now publicly available (in Adobe Lightroom CC 0219) I thought that it would be better to have dedicated thread about all Machine Learning things instead of polluting the Nvidia DLSS thread with semi OT content.

Anyway, here are the goodies:

Adobe Lightroom CC 0219 Image Enhancer using WinML & CoreML:

https://theblog.adobe.com/enhance-details/

Performance (spoiler AMD's GCN is fast):

https://www.pugetsystems.com/labs/a...C-2019-Enhanced-Details-GPU-Performance-1366/
Racn4zZ.jpg


In other ML news..Unity developed its own ML inference engine which is totally cross platform/HW compatible! No need for TensorFlow/WinML,CoreML or any other IE.. "it just works" on anything:

Unity ML-Agents Toolkit:
https://blogs.unity3d.com/2019/03/0...v0-7-a-leap-towards-cross-platform-inference/
 
In other ML news..Unity developed its own ML inference engine which is totally cross platform/HW compatible! No need for TensorFlow/WinML,CoreML or any other IE.. "it just works" on anything:

Unity ML-Agents Toolkit:
https://blogs.unity3d.com/2019/03/0...v0-7-a-leap-towards-cross-platform-inference/

Unity's done a great job integrating ML into their product so far and it makes sense for them to have a layer that can provide ML functionality without any platform specific frameworks. That said, the DirectX platform does have some unique hardware acceleration support in DirectML. At this year's GDC we announced the public release of the DirectML API and Unity announced support for DirectML, leveraging it where they can for increased performance:

https://devblogs.microsoft.com/directx/gaming-with-windows-ml/
https://devblogs.microsoft.com/directx/directml-at-gdc-2019/

Thanks,
Max McMullen
Development Manager
Compute, Graphics, & AI
Microsoft
 
Last edited:
Is it too much to ask for some freeware where I can just input a batch of images/textures/etc., select a SuperScale amount, and presto-output-folder? :p

Kind of curious to see if MS can have that as an option when taking screenshots on Xbox or Win10, for example. It'd be like having Ansel (so hot right now) without needing to inject it or have developer intervention (only select games).

I'd love to see an experiment for 90s 2D/sprite games. Is performance there for real-time superscaling where older-school games are in the 320x200-800x600 range :?: I guess they'd have to be on D3D first... and maybe there'd be crazy artefacting in motion around moving objects (e.g. isometric games). :s zoidblerg.
 
Last edited:
Is it too much to ask for some freeware where I can just input a batch of images/textures/etc., select a SuperScale amount, and presto-output-folder? :p

Kind of curious to see if MS can have that as an option when taking screenshots on Xbox or Win10, for example. It'd be like having Ansel (so hot right now) without needing to inject it or have developer intervention (only select games).

I'd love to see an experiment for 90s 2D/sprite games. Is performance there for real-time superscaling where older-school games are in the 320x200-800x600 range :?: I guess they'd have to be on D3D first... and maybe there'd be crazy artefacting in motion around moving objects (e.g. isometric games). :s zoidblerg.
not quite just a freeware with some customizations ;)
I suspect someone can make the tool like that. what are you looking to do?
 
I'd love to see an experiment for 90s 2D/sprite games. Is performance there for real-time superscaling where older-school games are in the 320x200-800x600 range
well considering it took over a minute on a single 256x256 sized texture on my admittedly inboard graphics I say having this running at good framerates is a long way off.
Also it doesnt handle Jpg artifacts well, just makes them worse, so prolly not the best for photos, but old CGI yes could be good
 
Is it too much to ask for some freeware where I can just input a batch of images/textures/etc., select a SuperScale amount, and presto-output-folder? :p

Kind of curious to see if MS can have that as an option when taking screenshots on Xbox or Win10, for example. It'd be like having Ansel (so hot right now) without needing to inject it or have developer intervention (only select games).

I'd love to see an experiment for 90s 2D/sprite games. Is performance there for real-time superscaling where older-school games are in the 320x200-800x600 range :?: I guess they'd have to be on D3D first... and maybe there'd be crazy artefacting in motion around moving objects (e.g. isometric games). :s zoidblerg.

I think a more workable way of doing this would be to rip the game's source art, superscale that up, and inject the higher res art into the game. Some emulators allow such fan-made texture-patch injection.
 
Hey long time lurker, first time poster. But I had a question on this whole ML/DLSS type stuff. If this isn't the right thread, I apologize and would appreciate it if you could point me to the right thread.

Anyways I was wondering that for the DLSS implementation of AI upscaling, AFAIK it starts on the low-res 2D image as the input to the algorithm, but doesn't that seem like an inefficient way to go about upscaling and filling in detail? Like it mostly works well and gives an amazing performance boost compared to native rendering, but to me as a layman it sounds too inefficient to become the standardized method of the future as the algorithm doesn't have a lot of information that it "knows" since it is just working off of a 2D image. Like if it's just seeing a pattern of pixels but doesn't have a label to say this is an office chair or the main characters pony tail or whatever.

I would think that a better implementation is if the game engine tells the algorithm which object it is working on as well as it's location, size and orientation on the screen. For things like an explosion effect, foliage, semi-transparent clouds, particle effects, etc. Those types of visuals are really dynamic and I can imagine it tricking the DLSS algorithm or at least making it less efficient. I dunno, just a question I wanted to ask as there's a lot of hype about Tensor cores and them being crucial to graphics in the future but I'm skeptical of it the same way I was of SSAA and Physx fifteen years ago. It just feels like there's a better way to go about it.
 
Hey long time lurker, first time poster. But I had a question on this whole ML/DLSS type stuff. If this isn't the right thread, I apologize and would appreciate it if you could point me to the right thread.

Anyways I was wondering that for the DLSS implementation of AI upscaling, AFAIK it starts on the low-res 2D image as the input to the algorithm, but doesn't that seem like an inefficient way to go about upscaling and filling in detail? Like it mostly works well and gives an amazing performance boost compared to native rendering, but to me as a layman it sounds too inefficient to become the standardized method of the future as the algorithm doesn't have a lot of information that it "knows" since it is just working off of a 2D image. Like if it's just seeing a pattern of pixels but doesn't have a label to say this is an office chair or the main characters pony tail or whatever.

I would think that a better implementation is if the game engine tells the algorithm which object it is working on as well as it's location, size and orientation on the screen. For things like an explosion effect, foliage, semi-transparent clouds, particle effects, etc. Those types of visuals are really dynamic and I can imagine it tricking the DLSS algorithm or at least making it less efficient. I dunno, just a question I wanted to ask as there's a lot of hype about Tensor cores and them being crucial to graphics in the future but I'm skeptical of it the same way I was of SSAA and Physx fifteen years ago. It just feels like there's a better way to go about it.
DLSS is 2 fold. First it’s job is to anti alias. The second job is to upscale.

the AI is responsible for detecting aliasing in the image, this is how it is trained. Then using 16K images as training, it tries to figure out what the antialiasing would look like if 16K Super sampling was done on this image and reverted back to the source resolution.

per object would be too slow and it would be too difficult.
Once it’s done with the anti aliasing, it then uses a second AI NN to upscale from the target resolution up to 4K.

the speed gains comes from rendering at a lower resolution (1080p) vs native. By doing so you’re working with 4x less pixels. And I’m general 4x less workload. When we continue to layer more things on like ray tracing which has a massive
Costs with more pixels (as each pixel will
Shoot rays) the costs keep going higher. Keeping the resolution locked at 1080p and using AI to extrapolate the rest and shifting it to 4K is faster than raw calculations, if it weren’t, there wouldn’t be any speed gains.

Hope that Helps
 
AI Researchers Create Looping Videos From Still Images (nvidia.com)
Researchers from University of Washington and Facebook used deep learning to convert still images into realistic animated looping videos.

Their approach, which will be presented at the upcoming Conference on Computer Vision and Pattern Recognition (CVPR), imitates continuous fluid motion — such as flowing water, smoke and clouds — to turn still images into short videos that loop seamlessly.

“What’s special about our method is that it doesn’t require any user input or extra information,” said Aleksander Hołyński, University of Washington doctoral student in computer science and engineering and lead author on the project. “All you need is a picture. And it produces as output a high-resolution, seamlessly looping video that quite often looks like a real video.”

Snoq-GIF-1.gif

To teach their neural network to estimate motion, the team trained the model on more than 1,000 videos of fluid motion such as waterfalls, rivers and oceans. Given only the first frame of the video, the system would predict what should happen in future frames, and compare its prediction with the original video. This comparison helped the model improve its predictions of whether and how each pixel in an image should move.
 
Google and Nvidia Tie in MLPerf; Graphcore and Habana Debut | EE Times
Google and Nvidia tied for first place in the fourth round of MLPerf Training benchmark scores, each winning four of the eight benchmarks in the closed division with their large-scale AI accelerator systems.

First-time contributor Graphcore showed off the capabilities of its 16- and 64-chip pods featuring the second-generation intelligence processing unit (IPU). Habana Labs entered its Gaudi chip for the first time (the company entered its inference chip, Goya, in a previous inference round).

Nvidia said that its own A100 scores improved around 2.1X compared to the previous round, for several reasons. Nvidia’s software stack CUDA has further minimised the required communication with host CPUs. A technique called Sharp has doubled effective bandwidth between nodes by offloading CPU operations to the network, decreasing the data traversing between endpoints. Spatial data parallelism can now split an image across 8 GPUs. And using HBM2e memory has increased A100’s memory bandwidth by nearly 30%.

Graphcore submitted its first set of four MLPerf Training scores, for 16- and 64-chip systems training ResNet and BERT. Running TensorFlow/Poplar software, the 16-IPU system could train ResNet in 37.12 minutes, while the 64-chip system could do it in 14.48 minutes.

For a rough comparison, Dell’s ResNet score for a system with 16x Nvidia A100 accelerators was 20.98 minutes (40GB A100s using MXNet). Nvidia’s own score for ResNet on a 64-chip system (80-GB A100s running MXNet) was 4.91 min.

Graphcore has argued previously that its customers don’t care how many accelerator chips are in a system, and that a comparison normalised on price would mean comparing systems with multiple Graphcore IPUs to a single A100.
...
By entering its third-generation Xeon Scalable CPUs, which have some AI acceleration features, Intel wanted to show that CPU servers customers most likely already have are perfectly capable of handling AI workloads. Intel also wanted to show that Xeon systems can scale with the size of the workload by submitting scores for 4 to 64-CPU systems. The 8-Xeon system trained ResNet in 943.97 minutes, while the 64-Xeon system did it in 213.92 minutes, about 4.4 times faster though it uses 8x as many CPUs.
 
Last edited:
Train your machine learning models on any GPU with TensorFlow-DirectML - Windows AI Platform (microsoft.com)
Sept. 9, 2021
Today we’re excited to exit preview and announce our first generally consumable package of TensorFlow-DirectML! We encourage you to use TensorFlow-DirectML whether you’re a student learning or a professional developing machine learning models for production. Read on for more details on how we invested and where we’re headed next.
...
Our team acted on customer feedback throughout the preview, improving the TensorFlow-DirectML experience on Windows and within WSL. Including optimizing specific operators like convolution and batch normalization and fine-tuning GPU scheduling and memory management so TensorFlow gets the most out of DirectML. We co-engineered with AMD, Intel, and NVIDIA enabling a hardware accelerated training experience across the breadth of DirectX 12 capable GPUs.

sntraining_wsl2_terminal3.gif
 
Last edited:
Regarding TensorFlow-DirectML - some bits from AMD

Performance Improvements Across the Board

The performance optimizations have improved both machine learning training and inference performance. Using the AI Benchmark Alpha benchmark, we have tested the first production release of TensorFlow-DirectML with significant performance gains observed across a number of key categories, such as up to 4.4x faster in the device training score (1). And at AMD we will continue to seek out opportunities to optimize and performance tune AMD Radeon GPUs and continue to deliver great results for our customers.

large

For the AMD-specific ML performance improvements, we’ve updated our driver to deliver substantially better TensorFlow inference performance on AMD Radeon™ RX 6900 XT and RX 6600 XT graphics hardware. When tested with AI Benchmark Alpha and the release version of TensorFlow-DirectML, we saw up to a 3.1x increase in inference performance (2) with this update.

large

When you combine the work on both ML training and inference performance optimizations that AMD and Microsoft have done for TensorFlow-DirectML since the preview release, the results are astounding, with up to a 3.7x improvement (3) in the overall AI Benchmark Alpha score!

large

AMD - AMD GPUs Support GPU-Accelerated Machine Learning with Release of TensorFlow-DirectML by Microsoft
 
Back
Top