Direct3D feature levels discussion

DmitryKo · Mar 4, 2024

DegustatoR said:
There's nothing in common between MSAA and these TAA-based upscalers.

The common part is that both generate an upscaled image from a lower rendering resolution. This image is either 1) upsampled using user-defined sample patterns, then filtered down to backbuffer resolution for presentation to display (MSAA), or 2) upsampled using predefined sample patterns selected by motion estimation derived from motion vectors, then kept at that higher resolution for presentation (TAA).

MSSA API could be extended to apply custom filters that generate the upscaled image by working with front and back buffers, depth buffer, and new motion vector and coverage/reactivity mask buffers, and send it for presentation; also, the same motion vectors can be used for interpolated frame generation at almost no additional cost. This would require modifications to DXGI swapchain management to handle the upsampled and/or interpolated frames (which previously required proprietary APIs in the kernel-mode display driver).

IMHO this should make it easier for developers to modify their code to support Super Resolution, as it would just look like a new MSAA mode to the application - even though from hardware point of view, MSAA works during rasterization as part of the rendering pipeline, while TAA engages after the rendering is fully complete as a post-processing step. Microsoft will probalby have their own ideas though.

DegustatoR said:
DirectML isn't viewed as a real time API

It is, it's right on the intro page:

... You can develop such machine learning techniques as upscaling, anti-aliasing, and style transfer, to name but a few. Denoising and super-resolution, for example, allow you to achieve impressive raytraced effects with fewer rays per pixel.
... If you're counting milliseconds, and squeezing frame times, then DirectML will meet your machine learning needs.

For reliable real-time, high-performance, low-latency, and/or resource-constrained scenarios... you can integrate DirectML directly into your existing engine or rendering pipeline.

Introduction to DirectML

Direct Machine Learning (DirectML) is a low-level API for machine learning (ML).

learn.microsoft.com

DirectML was positioned for tasks like realtime upscaling/antialisasing right from its introduction back in 2018. It's designed to be hardware-accelerated by Direct3D12 GPUs with the help of metacommands, which expose GPU architecture-specific implementations of common inference operations, and these are combined into Direct3D12 command lists to execute graph-based workflows.

DirectML is coming to Windows 10 in Spring 2019 - OC3D

DirectML is coming to Windows 10 in Spring 2019 Microsoft has released an update on the upcoming DirectML API, an addition to the DirectX 12 API that will act in a similar way to DXR (DirectX Raytracing). Instead of adding support for ray tracing, DirectML is designed to add support for...

overclock3d.net

Accelerating GPU inferencing with DirectML and DirectX 12

https://www.highperformancegraphics.org/wp-content/uploads/2018/Hot3D/HPG2018_DirectML.pdf

DmitryKo · Mar 5, 2024

BTW Microsoft recently introduced DirectML support for Intel NPUs (neural processing units) embedded in mobile-only Meteor Lake CPU and upcoming Arrow Lake (Intel Core Ultra 200) parts:

Introducing Neural Processor Unit (NPU) support in DirectML (developer preview) - DirectX Developer Blog

With the release of DirectML 1.13.1 and the ONNX Runtime 1.17, we are excited to announce developer preview support for NPU acceleration in DirectML, the machine learning platform API for Windows. This developer preview enables support for a subset of models on new Windows 11 devices with Intel®...

devblogs.microsoft.com

DirectML: Accelerating AI on Windows, now with NPUs - DirectX Developer Blog

We are thrilled to announce our collaboration with Intel®, one of our key partners, to bring the first Neural Processing Unit (NPU) powered by DirectML on Windows. AI is transforming the world, driving innovation, and creating value across industries. NPUs are critical components in enabling...

devblogs.microsoft.com

Scaling Intel Neural Processing Unit (NPU) in AI Client Ecosystem, with DirectML on Windows MCDM

Scaling Intel Neural Processing Unit (NPU) in AI Client Ecosystem, with DirectML on Windows MCDM (Microsoft Compute Driver Model) Architecture By Rutvi Trivedi, Murali Ambati, and Jaskaran Singh Nagi AI (Artificial Intelligence) scenarios on PC (Personal Computers) Client Ecosystem have grown...

community.intel.com

These NPUs use the new MCDM compute-only device driver model, which only supports feature level 1_0_CORE - these two have been lurking in the Windows SDK for the last 4 years with no explanation, so one less mystery now!

Intel Meteor Lake Technical Deep Dive

Today Intel is taking the wraps off their Meteor Lake Architecture. Our tech preview tells you everything you need to know about Intel's new ideas that will power the company's processors for years to come. Just like AMD, Intel is betting on chiplets, which combine multiple silicon dies into a...

www.techpowerup.com

Intel Unveils Meteor Lake Architecture: Intel 4 Heralds the Disaggregated Future of Mobile CPUs

www.anandtech.com

DavidGraham · Mar 11, 2024

Work Graphs and Shader Model 6.8 are officially released with drivers from AMD and NVIDIA. Also Upload Heaps (the DirectX ReBAR) is released.

The new features are only supported on Ampere and Ada GPUs from NVIDIA, and RDNA3 from AMD.

Agility SDK 1.613.2 Available Now, Including Support for GDC 2024 Showcase Features - DirectX Developer Blog

We’re pleased to announce the release of our latest Agility SDK, complete with features we’ll be showcasing at GDC 2024. Get started today with the retail versions of Work Graphs, Shader Model 6.8, GPU Upload Heaps, and more! SDK Package Features SDK 1.613.1 Work Graphs Shader Model 6.8 Work...

devblogs.microsoft.com

Advancing GPU-Driven Rendering with Work Graphs in Direct3D 12 | NVIDIA Technical Blog

GPU-driven rendering has long been a major goal for many game applications. It enables better scalability for handling large virtual scenes and reduces cases where the CPU could bottleneck a game’s…

developer.nvidia.com

Microsoft announces Agility SDK 1.613.0

Microsoft® Agility SDK 1.613.0 is out now including general availability of Work Graphs 1.0 and the GPU Upload Heaps feature.

gpuopen.com

Dampf · Mar 11, 2024

What's the reason for Turing's exclusion? Ampere is just a beefed up Turing.

DegustatoR · Mar 11, 2024

One more on that from Nvidia:

Work Graphs in Direct3D 12: A Case Study of Deferred Shading | NVIDIA Technical Blog

When it comes to game application performance, GPU-driven rendering enables better scalability for handling large virtual scenes. Direct3D 12 (D3D12) introduces work graphs as a programming paradigm…

developer.nvidia.com

trinibwoy · Mar 11, 2024

Nice, Nvidia must have heard we were talking shit about them

DavidGraham · Mar 11, 2024

trinibwoy said:
Nice, Nvidia must have heard we were talking shit about them

NVIDIA seems less than enthusiastic about the Work Graphs feature though, citing limited performance improvements and several caveats.

On a GeForce RTX 4090 GPU, the work graph lights the scene at 1920 x 1080 within 0.8 ms to 0.95 ms, whereas the uber shader dispatch technique takes 0.98 ms

Work graph execution is not free. There is a cost associated with managing the graph’s records and scheduling work. This cost eats some of the gains the work graphs achieved

These results portray one of the lessons I learned during my adventures with work graphs. That is, the performance gains must outweigh the overhead cost of work graph execution in order to see a net win in performance

Work graphs take a big step toward full GPU-driven frame processing. But it is not yet possible for one work graph to express the work of an entire frame (for example, culling, rasterizing a G-buffer, lighting then postprocessing)

While a single work graph can represent multiple steps of the frame’s rendering, there are still some operations that cannot be done efficiently within a single work graph

Until these questions are resolved, the CPU will continue to play a primary role in frame sequencing. Yet more data-dependent passes can now be moved entirely to the GPU, freeing the CPU from tasks like having to manage scene culling and push commands for every visible entity

we are not yet at a point where the CPU can submit just one call to DispatchGraph to draw the entire frame, but this release of work graphs helps convert more parts of the frame to become GPU-driven, thus reducing the cases where the CPU would be the bottleneck for the application’s performance

DegustatoR · Mar 11, 2024

DavidGraham said:
NVIDIA seems less than enthusiastic about the Work Graphs feature though, citing limited performance improvements and several caveats.

These limitations seem to be inherent to the API itself and thus will likely be present on all h/w. Let's see if Epic will make use of that fast (as Nanite seem to be the prime target for this whole development) and what it will mean for UE5's performance.

Lurkmass · Mar 11, 2024

Lurkmass said:
Add SV_BaseVertexLocation and SV_StartInstanceLocation to VSIn by python3kgae · Pull Request #5770 · microsoft/DirectXShaderCompiler

New VS Input semantics SV_BaseVertexLocation and SV_StartInstanceLocation are added for BaseVertexLocation and StartInstanceLocation of https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-...

github.com

Not in the PR title but shader model 6.8 is finally going to support a gl_DrawID equivalent (SV_IndirectCommandIndex) which makes doing GPU driven rendering more attractive with the vertex shading pipeline ...

Lurkmass said:
Microsoft shadow dropped an enhancement for the ExecuteIndirect API and introduced new feature tiers for it ...

ExecuteIndirect tier 1.1 adds functionality for incrementing root constants ...

I suppose we never saw that specific system value semantic in the end since Microsoft decided to expose the functionality with a different abstraction via an incrementing root constant in the command signature layout instead ...

DmitryKo · Mar 12, 2024

I've updated my tool to support new features in Agility SDK 613 (and Windows Insider Germanium builds 260xx) - this is how the AMD Adrenalin preview driver reports them on my Radeon RX 7800 XT:

Code:

RecreateAtTier : D3D12_RECREATE_AT_TIER_NOT_SUPPORTED (0)
WorkGraphsTier : D3D12_WORK_GRAPHS_TIER_1_0 (10)
ExecuteIndirectTier : D3D12_EXECUTE_INDIRECT_TIER_1_0 (10)
SampleCmpGradientAndBiasSupported : 1
ExtendedCommandInfoSupported : 1
 
Not implemented: D3D12_FEATURE_PREDICATION | HARDWARE_COPY

WaveMMA operations are still reported when experimental shader model is enabled - even though relevant WaveMMA APIs and data types were removed from DirectX Specs and also Agility SDK headers, the drivers still seem to support them:

Code:

HighestShaderModel : D3D12_SHADER_MODEL_6_9 (0x0069)
WaveMMATier : D3D12_WAVE_MMA_TIER_1_0 (10)
WaveMMA operations supported : 3

To check data types and matrix dimensions, run checkformats_agile.cmd, which sets /formats /verbose command-line options:

Code:

Wave Matrix Multiply Accumulate
[M]x[N] TYPE  -> [K] TYPE
16x16 BYTE    -> x16 INT32 (1)
16x16 FLOAT16 -> x16 FLOAT16, FLOAT (6)
16x16 FLOAT   -> x16 NONE (0)

I've also updated FormatSupport.XLSX with color-coding to indicate mandatory and optional resource formats on feature levels 12_x, as well as mandatory and optional capability bits for each of these formats as required by Microsoft specs.

Comparing D3D12_FORMAT_SUPPORT1 and D3D12_FORMAT_SUPPORT2 caps from current WARP12 driver 1.0.9 (build 27566) and most recent drivers for AMD Radeon 6000/7000 series, NVidia RTX 3000/4000 series, and Intel Arc Xe integrated CPU graphics, they all support almost every format and option available in DXGI with only a few minor exceptions - that's a clear improvement over earlier generations of graphics hardware, where many optional formats and capabilties were never supported...

DegustatoR · Mar 12, 2024

4090 with 551.76:

Code:

RecreateAtTier : D3D12_RECREATE_AT_TIER_NOT_SUPPORTED (0)
WorkGraphsTier : D3D12_WORK_GRAPHS_TIER_1_0 (10)
ExecuteIndirectTier : D3D12_EXECUTE_INDIRECT_TIER_1_1 (11)
SampleCmpGradientAndBiasSupported : 1
ExtendedCommandInfoSupported : 1
Predication.Supported : 1
HardwareCopy.Supported : 1

Code:

HighestShaderModel : D3D12_SHADER_MODEL_6_8 (0x0068)
WaveMMATier : D3D12_WAVE_MMA_TIER_1_0 (10)
WaveMMA operations supported : 8

Code:

Wave Matrix Multiply Accumulate
[M]x[N] TYPE     -> [K] TYPE
16x16 BYTE     -> x16 INT32 (1)
16x16 FLOAT16     -> x16 FLOAT16, FLOAT (6)
16x64 BYTE     -> x16 INT32 (1)
16x64 FLOAT16     -> x16 FLOAT16, FLOAT (6)
64x16 BYTE     -> x16 INT32 (1)
64x16 FLOAT16     -> x16 FLOAT16, FLOAT (6)
64x64 BYTE     -> x16 INT32 (1)
64x64 FLOAT16     -> x16 FLOAT16, FLOAT (6)

I am getting "GPUUploadHeapSupported : 0" with both DevMode off and on though. Which is weird considering that it was 1 on the previous driver branch.

DmitryKo · Mar 12, 2024

DegustatoR said:
I am getting "GPUUploadHeapSupported : 0" with both DevMode off and on though. Which is weird considering that it was 1 on the previous driver branch.

Same for me, even though relevant AMD Adrenalin setting remains enabled. I suppose this option should return after restart though.

DegustatoR · Mar 12, 2024

DmitryKo said:
I suppose this option should return after restart though.

Nope, remains at 0 for me after a restart, including a restart with devmode on.
Probably a bug ¯\_(ツ)_/¯

Kobata · Mar 12, 2024

DmitryKo said:
Same for me, even though relevant AMD Adrenalin setting remains enabled. I suppose this option should return after restart though.

They apparently needed a kernel-mode change to make it work reliably, and there is no public version of Windows 11 with that change yet (they listed 26080, current insider preview is 26063)

pharma · Mar 12, 2024

donut_examples/examples/work_graphs at main · NVIDIAGameWorks/donut_examples

Contribute to NVIDIAGameWorks/donut_examples development by creating an account on GitHub.

github.com

This sample demonstrates how to achieve GPU-driven shader launches using the D3D12 work graphs API. In this sample, a large number of meshes animate on the screen in a typical deferred shading rendering environment. After the g-buffer is populated with only normals and material IDs, the sample applies deferred shading using tiled light culling and material shading. This is entirely achieved using one call to DispatchGraph. The sample has three variations of work graphs: one that uses tiled light culling and broadcasting launch nodes, and the second variant uses per-pixel light culling and coalescing launch nodes, and the third variant also uses per-pixel light culling but with thread launch nodes. Finally, a standard Dispatch-based implementation is provided for comparison.

DavidGraham · Mar 12, 2024

A comparison between Work Graphs vs traditional rendering using a sample scene provided by NVIDIA.

In theory, this should improve performance during specific scenarios. As you’ll see in the video, though, these performance improvements are not universal. In other words, there are multiple scenes in which the tech demo runs exactly the same with and without Work Graphs.

techuse · Mar 12, 2024

Any benches on an RDNA 3 GPU?

Lurkmass · Mar 12, 2024

techuse said:
Any benches on an RDNA 3 GPU?

Work graphs API - compute rasterizer learning sample

Learn more about the power of work graphs API in our detailed blog, taking you step-by-step through an example which implements a scanline rasterizer.

gpuopen.com

AMD released a sample last year showing a compute shader software rasterizer implementation with results at the end of the blog post. They'll go into more details next week at GDC about how developers can use work graphs beyond nested parallelism in some of these samples such as implementing persistent producer/consumer work queues for recursive expansion/compaction of complex data structures as seen in the case of Nanite's per-cluster hierarchal LoD selection phase ...

DavidGraham · Mar 14, 2024

Lurkmass said:
AMD released a sample last year showing a compute shader software rasterizer implementation with results at the end of the blog post.

Correct me If I am wrong, but doesn't their result align with NVIDIA? At bin sizes 6 to 13, the Work Graph methods are slower than Multi Pass ExecuteIndirect, only at 14 bins does the Work Graph carve a ~20% win, and at 15 bins it goes down to ~5%.

DavidGraham · Mar 14, 2024

https://twitter.com/i/web/status/1768026098325921975

Direct3D feature levels discussion

Meh