Direct3D feature levels discussion

There's nothing in common between MSAA and these TAA-based upscalers.
The common part is that both generate an upscaled image from a lower rendering resolution. This image is either 1) upsampled using user-defined sample patterns, then filtered down to backbuffer resolution for presentation to display (MSAA), or 2) upsampled using predefined sample patterns selected by motion estimation derived from motion vectors, then kept at that higher resolution for presentation (TAA).

MSSA API could be extended to apply custom filters that generate the upscaled image by working with front and back buffers, depth buffer, and new motion vector and coverage/reactivity mask buffers, and send it for presentation; also, the same motion vectors can be used for interpolated frame generation at almost no additional cost. This would require modifications to DXGI swapchain management to handle the upsampled and/or interpolated frames (which previously required proprietary APIs in the kernel-mode display driver).

IMHO this should make it easier for developers to modify their code to support Super Resolution, as it would just look like a new MSAA mode to the application - even though from hardware point of view, MSAA works during rasterization as part of the rendering pipeline, while TAA engages after the rendering is fully complete as a post-processing step. Microsoft will probalby have their own ideas though.

DirectML isn't viewed as a real time API

It is, it's right on the intro page:

... You can develop such machine learning techniques as upscaling, anti-aliasing, and style transfer, to name but a few. Denoising and super-resolution, for example, allow you to achieve impressive raytraced effects with fewer rays per pixel.
... If you're counting milliseconds, and squeezing frame times, then DirectML will meet your machine learning needs.

For reliable real-time, high-performance, low-latency, and/or resource-constrained scenarios... you can integrate DirectML directly into your existing engine or rendering pipeline.

DirectML was positioned for tasks like realtime upscaling/antialisasing right from its introduction back in 2018. It's designed to be hardware-accelerated by Direct3D12 GPUs with the help of metacommands, which expose GPU architecture-specific implementations of common inference operations, and these are combined into Direct3D12 command lists to execute graph-based workflows.


Accelerating GPU inferencing with DirectML and DirectX 12
 
Last edited:
BTW Microsoft recently introduced DirectML support for Intel NPUs (neural processing units) embedded in mobile-only Meteor Lake CPU and upcoming Arrow Lake (Intel Core Ultra 200) parts:



These NPUs use the new MCDM compute-only device driver model, which only supports feature level 1_0_CORE - these two have been lurking in the Windows SDK for the last 4 years with no explanation, so one less mystery now!




 
Last edited:
Work Graphs and Shader Model 6.8 are officially released with drivers from AMD and NVIDIA. Also Upload Heaps (the DirectX ReBAR) is released.

The new features are only supported on Ampere and Ada GPUs from NVIDIA, and RDNA3 from AMD.



 
Last edited:
Nice, Nvidia must have heard we were talking shit about them :D

NVIDIA seems less than enthusiastic about the Work Graphs feature though, citing limited performance improvements and several caveats.

On a GeForce RTX 4090 GPU, the work graph lights the scene at 1920 x 1080 within 0.8 ms to 0.95 ms, whereas the uber shader dispatch technique takes 0.98 ms

Work graph execution is not free. There is a cost associated with managing the graph’s records and scheduling work. This cost eats some of the gains the work graphs achieved

These results portray one of the lessons I learned during my adventures with work graphs. That is, the performance gains must outweigh the overhead cost of work graph execution in order to see a net win in performance

Work graphs take a big step toward full GPU-driven frame processing. But it is not yet possible for one work graph to express the work of an entire frame (for example, culling, rasterizing a G-buffer, lighting then postprocessing)

While a single work graph can represent multiple steps of the frame’s rendering, there are still some operations that cannot be done efficiently within a single work graph

Until these questions are resolved, the CPU will continue to play a primary role in frame sequencing. Yet more data-dependent passes can now be moved entirely to the GPU, freeing the CPU from tasks like having to manage scene culling and push commands for every visible entity
we are not yet at a point where the CPU can submit just one call to DispatchGraph to draw the entire frame, but this release of work graphs helps convert more parts of the frame to become GPU-driven, thus reducing the cases where the CPU would be the bottleneck for the application’s performance
 
NVIDIA seems less than enthusiastic about the Work Graphs feature though, citing limited performance improvements and several caveats.
These limitations seem to be inherent to the API itself and thus will likely be present on all h/w. Let's see if Epic will make use of that fast (as Nanite seem to be the prime target for this whole development) and what it will mean for UE5's performance.
 

Not in the PR title but shader model 6.8 is finally going to support a gl_DrawID equivalent (SV_IndirectCommandIndex) which makes doing GPU driven rendering more attractive with the vertex shading pipeline ...
Microsoft shadow dropped an enhancement for the ExecuteIndirect API and introduced new feature tiers for it ...

ExecuteIndirect tier 1.1 adds functionality for incrementing root constants ...
I suppose we never saw that specific system value semantic in the end since Microsoft decided to expose the functionality with a different abstraction via an incrementing root constant in the command signature layout instead ...
 
I've updated my tool to support new features in Agility SDK 613 (and Windows Insider Germanium builds 260xx) - this is how the AMD Adrenalin preview driver reports them on my Radeon RX 7800 XT:
Code:
RecreateAtTier : D3D12_RECREATE_AT_TIER_NOT_SUPPORTED (0)
WorkGraphsTier : D3D12_WORK_GRAPHS_TIER_1_0 (10)
ExecuteIndirectTier : D3D12_EXECUTE_INDIRECT_TIER_1_0 (10)
SampleCmpGradientAndBiasSupported : 1
ExtendedCommandInfoSupported : 1
 
Not implemented: D3D12_FEATURE_PREDICATION | HARDWARE_COPY


WaveMMA operations are still reported when experimental shader model is enabled - even though relevant WaveMMA APIs and data types were removed from DirectX Specs and also Agility SDK headers, the drivers still seem to support them:
Code:
HighestShaderModel : D3D12_SHADER_MODEL_6_9 (0x0069)
WaveMMATier : D3D12_WAVE_MMA_TIER_1_0 (10)
WaveMMA operations supported : 3

To check data types and matrix dimensions, run checkformats_agile.cmd, which sets /formats /verbose command-line options:
Code:
Wave Matrix Multiply Accumulate
[M]x[N] TYPE  -> [K] TYPE
16x16 BYTE    -> x16 INT32 (1)
16x16 FLOAT16 -> x16 FLOAT16, FLOAT (6)
16x16 FLOAT   -> x16 NONE (0)


I've also updated FormatSupport.XLSX with color-coding to indicate mandatory and optional resource formats on feature levels 12_x, as well as mandatory and optional capability bits for each of these formats as required by Microsoft specs.

Comparing D3D12_FORMAT_SUPPORT1 and D3D12_FORMAT_SUPPORT2 caps from current WARP12 driver 1.0.9 (build 27566) and most recent drivers for AMD Radeon 6000/7000 series, NVidia RTX 3000/4000 series, and Intel Arc Xe integrated CPU graphics, they all support almost every format and option available in DXGI with only a few minor exceptions - that's a clear improvement over earlier generations of graphics hardware, where many optional formats and capabilties were never supported...
 
Last edited:
4090 with 551.76:

Code:
RecreateAtTier : D3D12_RECREATE_AT_TIER_NOT_SUPPORTED (0)
WorkGraphsTier : D3D12_WORK_GRAPHS_TIER_1_0 (10)
ExecuteIndirectTier : D3D12_EXECUTE_INDIRECT_TIER_1_1 (11)
SampleCmpGradientAndBiasSupported : 1
ExtendedCommandInfoSupported : 1
Predication.Supported : 1
HardwareCopy.Supported : 1

Code:
HighestShaderModel : D3D12_SHADER_MODEL_6_8 (0x0068)
WaveMMATier : D3D12_WAVE_MMA_TIER_1_0 (10)
WaveMMA operations supported : 8

Code:
Wave Matrix Multiply Accumulate
[M]x[N] TYPE     -> [K] TYPE
16x16 BYTE     -> x16 INT32 (1)
16x16 FLOAT16     -> x16 FLOAT16, FLOAT (6)
16x64 BYTE     -> x16 INT32 (1)
16x64 FLOAT16     -> x16 FLOAT16, FLOAT (6)
64x16 BYTE     -> x16 INT32 (1)
64x16 FLOAT16     -> x16 FLOAT16, FLOAT (6)
64x64 BYTE     -> x16 INT32 (1)
64x64 FLOAT16     -> x16 FLOAT16, FLOAT (6)

I am getting "GPUUploadHeapSupported : 0" with both DevMode off and on though. Which is weird considering that it was 1 on the previous driver branch.
 
I am getting "GPUUploadHeapSupported : 0" with both DevMode off and on though. Which is weird considering that it was 1 on the previous driver branch.
Same for me, even though relevant AMD Adrenalin setting remains enabled. I suppose this option should return after restart though.
 
Same for me, even though relevant AMD Adrenalin setting remains enabled. I suppose this option should return after restart though.
They apparently needed a kernel-mode change to make it work reliably, and there is no public version of Windows 11 with that change yet (they listed 26080, current insider preview is 26063)
 
This sample demonstrates how to achieve GPU-driven shader launches using the D3D12 work graphs API. In this sample, a large number of meshes animate on the screen in a typical deferred shading rendering environment. After the g-buffer is populated with only normals and material IDs, the sample applies deferred shading using tiled light culling and material shading. This is entirely achieved using one call to DispatchGraph. The sample has three variations of work graphs: one that uses tiled light culling and broadcasting launch nodes, and the second variant uses per-pixel light culling and coalescing launch nodes, and the third variant also uses per-pixel light culling but with thread launch nodes. Finally, a standard Dispatch-based implementation is provided for comparison.
 
A comparison between Work Graphs vs traditional rendering using a sample scene provided by NVIDIA.

In theory, this should improve performance during specific scenarios. As you’ll see in the video, though, these performance improvements are not universal. In other words, there are multiple scenes in which the tech demo runs exactly the same with and without Work Graphs.

 
Any benches on an RDNA 3 GPU?

AMD released a sample last year showing a compute shader software rasterizer implementation with results at the end of the blog post. They'll go into more details next week at GDC about how developers can use work graphs beyond nested parallelism in some of these samples such as implementing persistent producer/consumer work queues for recursive expansion/compaction of complex data structures as seen in the case of Nanite's per-cluster hierarchal LoD selection phase ...
 
AMD released a sample last year showing a compute shader software rasterizer implementation with results at the end of the blog post.
Correct me If I am wrong, but doesn't their result align with NVIDIA? At bin sizes 6 to 13, the Work Graph methods are slower than Multi Pass ExecuteIndirect, only at 14 bins does the Work Graph carve a ~20% win, and at 15 bins it goes down to ~5%.
 
Back
Top