Direct3D feature levels discussion

Scott_Arm · Jun 23, 2023

https://twitter.com/x/status/1672276563767750657

I haven't seen too many responses that are critical of the programming paradigm, but mostly just about the lack of debugging tools.

Jay · Jun 23, 2023

Scott_Arm said:
I doubt it, unless AMD exposed something similar on Playstation. The reaction from game devs makes it appear as if it's something new.

I assume this is more a programming paradigm than it being hardware related. So don't think PS has anything to do with it.
It's possible that the PS api already made it available, or could be added, not a good fit, who knows. This is a DX thing though.

Sounds like it would help getting better utilisation out of the XSX gpu.

DavidGraham · Jun 23, 2023

One more feedback from a developer.

"This is great, but adopting new Agility SDK features is difficult when the DirectX Shader Compiler binaries have multiple conflicting linkage requirements, meaning you can't migrate an app from the Windows SDK version of DXC to the more recent GitHub release..."

https://twitter.com/x/status/1671980517124612126

https://twitter.com/x/status/1671981520783810561

DegustatoR · Jun 23, 2023

Scott_Arm said:
https://twitter.com/x/status/1672276563767750657

I haven't seen too many responses that are critical of the programming paradigm, but mostly just about the lack of debugging tools.

The paradigm itself isn't new and has been on the roadmaps (so to speak) for quite some time now.

The implementation presented here though may not be very suitable for anything besides GPU compute and rather simple one at that as there are no synchronization options right now.

Still this is a preview and it will likely get expanded upon before release - at which point all IHVs will need to support it.

Jay said:
I assume this is more a programming paradigm than it being hardware related. So don't think PS has anything to do with it.
It's possible that the PS api already made it available, or could be added, not a good fit, who knows. This is a DX thing though.

Sounds like it would help getting better utilisation out of the XSX gpu.

I'd say that it's less interesting for UMA h/w like consoles but it may end up freeing up some CPU cycles there as well.

Lurkmass · Jun 23, 2023

Scott_Arm said:
Seeing some less positive responses now

https://twitter.com/x/status/1672026301140443136

Timothy is complaining about the fact that the solution isn't more explicit and doesn't pivot hard enough to AMD HW despite the reception by his former colleagues over there ...

OlegSH said:
Launching compute kernels from device has been available in CUDA since Kepler.
CUDA Graphs have also been available for a while. Not sure how it compares to the DX Compute Graphs, but the concept seems to be the same.
Differences might lie in the interaction with graphics, but this has not been explored in the DX api yet.

D3D12 Work Graphs are a little bit more powerful than CUDA device graphs. CUDA graphs have restrictions like where memcpy nodes can't be used with CUDA Arrays which isn't ideal for Nanite-style producer-consumer queue work compaction ...

D3D12 Work Graphs shines over CUDA graphs in allowing implementations to efficiently pass registers from producer to the consumer which translates to significant memory bandwidth savings ...

Kaotik · Jun 23, 2023

DavidGraham said:
One more feedback from a developer.

"This is great, but adopting new Agility SDK features is difficult when the DirectX Shader Compiler binaries have multiple conflicting linkage requirements, meaning you can't migrate an app from the Windows SDK version of DXC to the more recent GitHub release..."

https://twitter.com/x/status/1671980517124612126

https://twitter.com/x/status/1671981520783810561

What happened to the DirectX installers that came with every game? Why can't they fulfill this duty?

Alessio1989 · Jun 24, 2023

Yet another feature that will take ages to support a good amount of hardware, to master it and I bet will become bloated soonish :\
Going to a fully programmable pipeline should mean more simple solutions :V
But hell, I hope I am wrong :v

OlegSH · Jun 24, 2023

Lurkmass said:
D3D12 Work Graphs shines over CUDA graphs in allowing implementations to efficiently pass registers from producer to the consumer which translates to significant memory bandwidth savings ...

Do you mean producer-consumer queues? They are supported in CUDA since A100, so you can allocate and pin a part of the L2 cache for these queues for the bandwidth savings.

Ethatron · Jun 24, 2023

OlegSH said:
Do you mean producer-consumer queues? They are supported in CUDA since A100, so you can allocate and pin a part of the L2 cache for these queues for the bandwidth savings.

In Workgraphs you have a blackbox data passing mechanism between nodes. You don't need (and should not) do it manually via something like UAV constructs. A hardware implementation can pass the data in any way it wants: registers, extra LDS, caches.

Ethatron · Jun 24, 2023

DegustatoR said:
Or it solves a problem which doesn't exist on their h/w.

Hardware isn't self-serving. It appreciably simplifies development of types of pipelines for ISVs. Especially binning, reduction and occupancy related constructs benefit (runtime performance and development performance).

OlegSH · Jun 24, 2023

Ethatron said:
In Workgraphs you have a blackbox data passing mechanism between nodes. You don't need (and should not) do it manually via something like UAV constructs. A hardware implementation can pass the data in any way it wants: registers, extra LDS, caches.

Registers are private per thread, and LDS is too small, leaving caches as the only feasible solution for the task. However, the lack of restrictions and explicitness raises questions about its speed in practice. It remains to be seen whether this approach will, for example, save bandwidth.

Alessio1989 · Jun 24, 2023

they limited recursion to 32 steps if I understood correctly. But I started getting nausea just with the word recursion xD anything that isn't tail recursion is bad and tail recursion generally is trivial to change in an iteration.. The fact they limited it means they don't expect tail recursion at all... so yes bloating caches?

DmitryKo · Jun 26, 2023

Krteq said:
Seems like @DmitryKo 's feature checker is refusing to work with latest Agile SDK lib (It's working fine with 1.710

This is by design, the developer has to use external symbols to specify the exact version of the Agility SDK to be loaded at runtime - since the latest version of the tool was built with D3D12SDKVersion=710 embedded in the executable file, the OS will only load a matching D3D12Core.dll version 1.710, and throw an error if you replace it with any other version (BTW they used to allow higher versions of the DLL, but this was changed with the transition to SDK 6xx/7xx).

Either way, simply replacing the D3D12Core.dll with a new version will not gain you anything, because the source code needs to be updated to use the new structures defined in the latest Agility SDK header files. Even though I made some changes to support Agility SDK 711, WaveMMA reporting only works on Radeon RX 7000 (RDNA3) and my current card is RX 5700 XT, so I cannot test it unless AMD implements it on RDNA1 cards (or I get a very good deal on a Radeon RX 7600).

EDIT: I've updated my feature checker tool to report new experimental features in the Agility SDK 1.711 preview.

DmitryKo · Jun 27, 2023

DmitryKo said:
This is how the new features are reported by the current redistributable WARP library 1.0.5 (DLL version 10.0.25321.1003).

Code:

NonNormalizedCoordinateSamplersSupported : 1 ManualWriteTrackingResourceSupported : 0 RenderPassesValid : 1 MismatchingOutputDimensionsSupported : 1 SupportedSampleCountsWithNoOutputs : 31 PointSamplingAddressesNeverRoundUp : 1 RasterizerDesc2Supported : 1 NarrowQuadrilateralLinesSupported : 1 AnisoFilterWithPointMipSupported : 1 MaxSamplerDescriptorHeapSize : 2097152 MaxSamplerDescriptorHeapSizeWithStaticSamplers : 2097152 MaxViewDescriptorHeapSize : 2097152 ComputeOnlyCustomHeapSupported : 0

And this is a report by the Adrenalin driver 23.3.1 (build 31.0.14037.1007) on my Radeon 5700 XT (AMD did not relase an Agility SDK 1.710.0-specific driver yet)

BTW the beta AMD Adrenalin driver 23.10.01.14 (build 31.0.21001.14018) now supports a few new features in the Agility SDK 1.710 and 1.706/1.606 even on the Radeon RX 5700 XT:

Code:

EnhancedBarriersSupported : 1
RelaxedFormatCastingSupported : 1
DynamicIndexBufferStripCutSupported : 1
DynamicDepthBiasSupported : 1
GPUUploadHeapSupported : 1
NonNormalizedCoordinateSamplersSupported : 1
MismatchingOutputDimensionsSupported : 1
SupportedSampleCountsWithNoOutputs : 29
PointSamplingAddressesNeverRoundUp : 1
RasterizerDesc2Supported : 1
NarrowQuadrilateralLinesSupported : 1
AnisoFilterWithPointMipSupported : 1
MaxSamplerDescriptorHeapSize : 67108864
MaxSamplerDescriptorHeapSizeWithStaticSamplers : 67108864
MaxViewDescriptorHeapSize : 33554432

DmitryKo · Jul 2, 2023

DegustatoR said:
Has anyone ran @DmitryKo 's utility on an RDNA3 card btw?
It is likely a copy of RDNA2 feature wise but just to be sure.

I've just got myself an Radeon RX 7600 (RNDA3) card, and there are no major new feature options comparing to RDNA2, except for experimental WaveMMA and D3D12_WORK_GRAPHS_TIER_0_1, if you look at the most recent report for RX 6800 posted by CarstenS back in November 2020; his comparison with the Nvidia RTX series remains valid as well.

Here are the differences between RX 7600 and RX 5700 XT as reported by the exprimental WorkGraphs/WaveMMA driver 23.10.01.14 in the post above with the Agility SDK 1.711 preview:

Code:

Maximum feature level : D3D_FEATURE_LEVEL_12_2 (0xc200)
BarycentricsSupported : 1
RaytracingTier : D3D12_RAYTRACING_TIER_1_1 (11)
PerPrimitiveShadingRateSupportedWithViewportIndexing : 1
VariableShadingRateTier : D3D12_VARIABLE_SHADING_RATE_TIER_2 (2)
ShadingRateImageTileSize : 8
MeshShaderTier : D3D12_MESH_SHADER_TIER_1 (10)
SamplerFeedbackTier : D3D12_SAMPLER_FEEDBACK_TIER_1_0 (100)
MeshShaderPipelineStatsSupported : 1
WaveMMATier : D3D12_WAVE_MMA_TIER_1_0 (10)
VariableRateShadingSumCombinerSupported : 1

Here are the differences comparing to your RTX 4090 report posted this March:

Code:

GraphicsPreemptionGranularity : DXGI_GRAPHICS_PREEMPTION_PRIMITIVE_BOUNDARY (1)
ComputePreemptionGranularity : DXGI_COMPUTE_PREEMPTION_DMA_BUFFER_BOUNDARY (0)
PSSpecifiedStencilRefSupported : 1
MaxGPUVirtualAddressBitsPerResource : 44
MaxGPUVirtualAddressBitsPerProcess : 44
ViewInstancingTier : D3D12_VIEW_INSTANCING_TIER_1 (1)
AdditionalShadingRatesSupported : 0
ShadingRateImageTileSize : 8
BackgroundProcessingSupported : 0
SamplerFeedbackTier : D3D12_SAMPLER_FEEDBACK_TIER_1_0 (100)
WaveMMATier : D3D12_WAVE_MMA_TIER_1_0 (10)
MeshShaderPerPrimitiveShadingRateSupported : 0
MSPrimitivesPipelineStatisticIncludesCulledPrimitives : 0

EDIT: I've updated my feature checker tool to report new experimental features in the Agility SDK 1.711 preview.

Cyan · Jul 8, 2023

From D3D12FeatureOptionsAgile.txt file, the result from my Intel A770.

Rich (BB code):

Direct3D 12 feature checker (July 2023) by DmitryKo (x64) (Agility SDK v711)

Windows 10X version 22H2 (build 22621.1928 ni_release) x64

ADAPTER 0
"Intel(R) Arc(TM) A770 Graphics"
VEN_8086, DEV_56A0, SUBSYS_10208086, REV_08
Dedicated video memory : 16256.0 MB (17045651456 bytes)
Total video memory : 24412.4 MB (25598205952 bytes)
BIOS string : Intel Video BIOS
Video driver version : 31.0.101.4314
WDDM version : KMT_DRIVERVERSION_WDDM_3_1 (3100)
Virtual memory model : GPUMMU
Hardware-accelerated scheduler : Disabled, DXGK_FEATURE_SUPPORT_ALWAYS_OFF (0)
GraphicsPreemptionGranularity : DXGI_GRAPHICS_PREEMPTION_TRIANGLE_BOUNDARY (2)
ComputePreemptionGranularity : DXGI_COMPUTE_PREEMPTION_THREAD_GROUP_BOUNDARY (2)
Maximum feature level : D3D_FEATURE_LEVEL_12_2 (0xc200)
DoublePrecisionFloatShaderOps : 0
OutputMergerLogicOp : 1
MinPrecisionSupport : D3D12_SHADER_MIN_PRECISION_SUPPORT_16_BIT (2) (0b0000'0010)
TiledResourcesTier : D3D12_TILED_RESOURCES_TIER_3 (3)
ResourceBindingTier : D3D12_RESOURCE_BINDING_TIER_3 (3)
PSSpecifiedStencilRefSupported : 1
TypedUAVLoadAdditionalFormats : 1
ROVsSupported : 1
ConservativeRasterizationTier : D3D12_CONSERVATIVE_RASTERIZATION_TIER_3 (3)
StandardSwizzle64KBSupported : 0
CrossNodeSharingTier : D3D12_CROSS_NODE_SHARING_TIER_NOT_SUPPORTED (0)
CrossAdapterRowMajorTextureSupported : 1
VPAndRTArrayIndexFromAnyShaderFeedingRasterizerSupportedWithoutGSEmulation : 1
ResourceHeapTier : D3D12_RESOURCE_HEAP_TIER_1 (1)
MaxGPUVirtualAddressBitsPerResource : 44
MaxGPUVirtualAddressBitsPerProcess : 48
Adapter Node 0:     TileBasedRenderer: 0, UMA: 0, CacheCoherentUMA: 0, IsolatedMMU: 1, HeapSerializationTier: 0, ProtectedResourceSession.Support: 1, ProtectedResourceSessionTypeCount: 1 D3D12_PROTECTED_RESOURCES_SESSION_HARDWARE_PROTECTED
HighestShaderModel : D3D12_SHADER_MODEL_6_7 (0x0067)
WaveOps : 1
WaveLaneCountMin : 8
WaveLaneCountMax : 32
TotalLaneCount : 16384
ExpandedComputeResourceStates : 1
Int64ShaderOps : 1
RootSignature.HighestVersion : D3D_ROOT_SIGNATURE_VERSION_1_2 (3)
DepthBoundsTestSupported : 1
ProgrammableSamplePositionsTier : D3D12_PROGRAMMABLE_SAMPLE_POSITIONS_TIER_1 (1)
ShaderCache.SupportFlags : D3D12_SHADER_CACHE_SUPPORT_SINGLE_PSO | LIBRARY | AUTOMATIC_INPROC_CACHE | AUTOMATIC_DISK_CACHE | SHADER_CONTROL_CLEAR | SHADER_SESSION_DELETE (111) (0b0110'1111)
CopyQueueTimestampQueriesSupported : 1
CastingFullyTypedFormatSupported : 1
WriteBufferImmediateSupportFlags : D3D12_COMMAND_LIST_SUPPORT_FLAG_DIRECT | BUNDLE | COMPUTE | COPY (15) (0b0000'1111)
ViewInstancingTier : D3D12_VIEW_INSTANCING_TIER_2 (2)
BarycentricsSupported : 0
ExistingHeaps.Supported : 1
MSAA64KBAlignedTextureSupported : 1
SharedResourceCompatibilityTier : D3D12_SHARED_RESOURCE_COMPATIBILITY_TIER_2 (2)
Native16BitShaderOpsSupported : 1
AtomicShaderInstructions : 0
SRVOnlyTiledResourceTier3 : 1
RenderPassesTier : D3D12_RENDER_PASS_TIER_0 (0)
RaytracingTier : D3D12_RAYTRACING_TIER_1_1 (11)
AdditionalShadingRatesSupported : 1
PerPrimitiveShadingRateSupportedWithViewportIndexing : 1
VariableShadingRateTier : D3D12_VARIABLE_SHADING_RATE_TIER_2 (2)
ShadingRateImageTileSize : 8
BackgroundProcessingSupported : 1
MeshShaderTier : D3D12_MESH_SHADER_TIER_1 (10)
SamplerFeedbackTier : D3D12_SAMPLER_FEEDBACK_TIER_0_9 (90)
UnalignedBlockTexturesSupported : 1
MeshShaderPipelineStatsSupported : 1
MeshShaderSupportsFullRangeRenderTargetArrayIndex : 1
AtomicInt64OnTypedResourceSupported : 0
AtomicInt64OnGroupSharedSupported : 0
DerivativesInMeshAndAmplificationShadersSupported : 0
WaveMMATier : D3D12_WAVE_MMA_TIER_NOT_SUPPORTED (0)
VariableRateShadingSumCombinerSupported : 1
MeshShaderPerPrimitiveShadingRateSupported : 1
AtomicInt64OnDescriptorHeapResourceSupported : 1
DisplayableTexture : 0
DisplayableTexture.SharedResourceCompatibilityTier : D3D12_SHARED_RESOURCE_COMPATIBILITY_TIER_0 (0)
MSPrimitivesPipelineStatisticIncludesCulledPrimitives : 1
EnhancedBarriersSupported : 1
RelaxedFormatCastingSupported : 1
UnrestrictedBufferTextureCopyPitchSupported : 1
UnrestrictedVertexElementAlignmentSupported : 1
InvertedViewportHeightFlipsYSupported : 1
InvertedViewportDepthFlipsZSupported : 1
TextureCopyBetweenDimensionsSupported : 1
AlphaBlendFactorSupported : 1
AdvancedTextureOpsSupported : 0
WriteableMSAATexturesSupported : 0
IndependentFrontAndBackStencilRefMaskSupported : 1
TriangleFanSupported : 1
DynamicIndexBufferStripCutSupported : 1
DynamicDepthBiasSupported : 1
GPUUploadHeapSupported : 1
NonNormalizedCoordinateSamplersSupported : 1
ManualWriteTrackingResourceSupported : 0
RenderPassesValid : 1
MismatchingOutputDimensionsSupported : 0
SupportedSampleCountsWithNoOutputs : 1
PointSamplingAddressesNeverRoundUp : 0
RasterizerDesc2Supported : 1
NarrowQuadrilateralLinesSupported : 0
AnisoFilterWithPointMipSupported : 0
MaxSamplerDescriptorHeapSize : 2048
MaxSamplerDescriptorHeapSizeWithStaticSamplers : 2048
MaxViewDescriptorHeapSize : 1000000
ComputeOnlyCustomHeapSupported : 0
ComputeOnlyWriteWatchSupported : 1
Experimental.WorkGraphsTier : D3D12_WORK_GRAPHS_TIER_NOT_SUPPORTED (0)
Metacommands enumerated : 11
Metacommands [parameters per stage]: Conv (Convolution) [84][1][6], GEMM (General matrix multiply) [67][1][6], Pooling [44][1][4], Conv (Convolution) [108][5][6], GEMM (General matrix multiply) [91][5][6], MVN (Mean Variance Normalization) [91][5][6], Pooling [56][3][4], LSTM (Long Short-Term Memory) [252][10][13], DStorageCustom Metacommand [4][0][11],  [1][0][9],  [4][0][11]

oscarbg · Jul 20, 2023

Sadly 545.37 nvidia driver still doesn’t enable/support work graphs or wmma..

Jay · Jul 20, 2023

Could actually do with an easily digestible table with all the results from the different cards and what they mean. Or by the time that's done the actual detail becomes worthless?

DmitryKo · Jul 24, 2023

FYI, Wikipefia article Feature levels in Direct3D does include a support matrix table for a few important feature options.

There are references to Microsoft Learn (formerly MSDN/Docs) online documentation for D3D12_FEATURE and currently supported feature options, the Direct3D 12 Programming Guide on major feature tiers, and DirectX-Specs (Engineering Specs for DirectX Features) for low-level features currently in development.

Unfortunately online documentation is poorly cross-referenced and does not cover Insider Preview SDK and Agility SDK releases, and DirectX Specs only cover major developments...

DegustatoR · Sep 2, 2023

Advanced API Performance: Shaders | NVIDIA Technical Blog

This post covers best practices when working with shaders on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Shaders play a critical…

developer.nvidia.com

Direct3D feature levels discussion

Scott_Arm

Jay

DavidGraham

DegustatoR

Lurkmass

Kaotik

Drunk Member

Alessio1989

OlegSH

Ethatron

Ethatron

OlegSH

Alessio1989

DmitryKo

DmitryKo

DmitryKo

Cyan

orange

oscarbg

Jay

DmitryKo

DegustatoR

Advanced API Performance: Shaders | NVIDIA Technical Blog