Direct3D feature levels discussion

If you read the DirectStorage 1.1 and GDeflate announcement blog post, it looks like Intel UHD/Arc drivers are currently the only ones to support custom GDeflate decompression.

NVidia maintains the open-source nvcomp CUDA library which supports LZ77, Deflate (PKZIP), GDeflate, zStandard etc., but they will likely have a custom WDDM driver path as well.

AMD Radeon RX would probably rely on the DirectCompute fallback, judging by the carefull language we all know too well.

GDeflate provides a new GPU decompression format that all hardware vendors can support and optimize for. Microsoft is working with key partners like AMD, Intel, and NVIDIA to provide drivers tailored for this format. “Intel is excited to release drivers co-engineered with Microsoft to work seamlessly with the DirectStorage Runtime to bring optimized GPU decompression capabilities to game developers!” said Murali Ramadoss, Intel Fellow and GM of GPU Software Architecture. <...> Hardware vendors may begin releasing drivers for DirectStorage 1.1 in the weeks before our release. <...> “DirectStorage 1.1 with GPU decompression will enable developers to unleash their creativity, delivering more detailed and visually stunning worlds,” said Scott Herkelman, senior vice president and general manager, Graphics Business Unit at AMD. “We have worked closely with Microsoft to ensure the best possible experience on AMD devices and platforms.” If these drivers are not present, DirectStorage will fall back to an optimized DirectCompute implementation. Developers who plan to experiment with the next version of DirectStorage should keep an eye out for new and improved drivers.
 
NVidia maintains the open-source nvcomp CUDA library which supports Deflate (PKZIP), GDeflate, zStandard etc.,
And just quoting NVidia:
  • Cascaded, GDeflate, zStandard, Deflate and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream.
3 different implementations from each IHV (or MS) for GDeflate, and even now, most likely the 3 are not 100% bitstream compatible or standard compliant. And this is just one algorithm in isolation so far, not even a full codec.

These general purpose compression algorithms are one of the cases where you want to make sure everyone uses the exact same implementation. As a developer you don't care at all if the one implementation is conformant (chances are quite low!), as long as it is at least deterministic and just a single compatible compressor/decompressor pair even exists.

Long term, there is also the major risk that fixing just a single bug or case of UB is going break decoding of a large number of assets.
 
Last edited:
These are different compression algorithms and bitstream formats, not different implementations of the same algorithm and bytestream. Even though all of them use the dictionary-coding LZ77 algorithm as the baseline, each specific implementation adds alternate passes by algorithms from the entropy coding domain, so their bytestreams are not directly interchangeable.

Specifically, Deflate uses LZ77 + Huffman coding, while GDeflate uses LZ77 + asymmetric numeral systems (ANS) and zStandard uses LZ77 + Huffman + ANS.

zStandard is also said to use larger look-up tables and deeper search windows in order to find additional repeating patterns, while the LZ4 algorithm reduces computational complexity and resource useage to maximize decoding speed and processing bandwidth.


I don't know what they meant by "other decompressors", but nvcomp's High Level API tries to detect the stream format and compression parameters on the fly, which cannot work reliably with bytestreams produced by external implementations of the same algorithm; the Low Level API expects the programmer to specify algorithms and parameters.
 
Last edited:
I don't know what they meant by "other decompressors", but their High Level API tries to detect the stream format and compression parameters on the fly, which cannot work reliably with bytestreams produced by external implementations of the same algorithm; the Low Level API expects the programmer to specify algorithms and parameters.
In the end, because graphics drivers are "per game", any problems with "automatic bytestream format detection" is solved by the driver being informed of how the game works. Even if the game has multiple bytestream formats, the driver should be adaptable by per-game configuration.
 
graphics drivers are "per game", any problems with "automatic bytestream format detection" is solved by the driver being informed of how the game works
This is not an problem with DirectStorage runtime or WDDM drivers, they only support GDeflate so there is no need to detect additional bytestreams.

The known issue cited above refers to nvcomp CUDA library by NVidia; it implements the GDeflate algorythm but it also supports other common formats like LZ77 and Deflate (PKZIP), so the issue in question seems to relate to 3rd-party implementations. GDeflate has only been implemented by NVidia, there are no 3rd party implementations. The original edit of the readme file doesn't even mention GDeflate.
 
i have made an API table for AMD / NVIDIA / Intel

I would like to complete the DirectX section to be correct, i didnt add 6.8 because i am not sure if its fully released yet for the latest version of windows that is not a preview etc.

(max support for windows OS and latest windows drivers from the companies)

also intel is not complete yet

(Modedit - full size inserts)

Fj-BtGXXkAE5gYU.png

Fj-C0KSXoAILm7O.png

Untitled.png

(Modedit 2: redundant old table removed, updated one in next post)
 
Last edited by a moderator:
Lovelace is a carbon copy of Ampere in what it reports on supported features through D3D if anyone is wondering.
SF is still at Tier 0.9 as well.
 
Some new Agility SDK stuff. Haven't ready through it.


GPU Upload Heaps

Introduction


Historically a GPU’s VRAM was inaccessible to the CPU, forcing programs to have to copy large amounts of data to the GPU via the PCI bus. Most modern GPUs have introduced VRAM resizable base address register (BAR) enabling Windows to manage the GPU VRAM in WDDM 2.0 or later.

With the VRAM being managed by Windows, D3D now exposes the heap memory access directly to the CPU! This allows both the CPU and GPU to directly access the memory simultaneously, removing the need to copy data from the CPU to the GPU increasing performance in certain scenarios.

 
Last edited:
Some new Agility SDK stuff. Haven't ready through it.

the real question is whether it's mutable. Can the CPU modify values on VRAM. Or pull results back from VRAM for it's own usages.
 
I've updated my Direct3D12 feature checker tool to support features in the Agility SDK 1710.0 Preview and Windows Insider Preview Zinc (build 25330), specifically D3D12_FEATURE_D3D12_OPTIONS17, D3D12_FEATURE_D3D12_OPTIONS18, D3D12_FEATURE_D3D12_OPTIONS19 and root signature 1_2. There are improvements to text layout for resource formats and metacommand parameters / types as well.

From now on, I will only provide the Agility SDK version of the tool; you need to download Direct3D 12 Agility SDK 1710.0 Preview NuGet package and extract build\native\bin\x64\D3D12Core.dll to the D3D12\ subfolder using either online NuGet Package Explorer or applications like NuGetPackageExplorer, 7-Zip, WinRAR, or File Explorer. You also need to enable the Developer mode in Windows Settings - Update & Security - For developers.

You can optionally download the redistributable WARP library and extract d3d10warp.dll from bin\x64 to the tool folder; 'WARP version' will be included in the report. The updated DLL only affects features reported by the 'Microsoft Basic Render Adapter', but these could be useful for comparisons with NVidia, AMD, and Intel driver implementations.

This is how the new features are reported by the current redistributable WARP library 1.0.5 (DLL version 10.0.25321.1003).

Code:
NonNormalizedCoordinateSamplersSupported : 1
ManualWriteTrackingResourceSupported : 0
RenderPassesValid : 1
MismatchingOutputDimensionsSupported : 1
SupportedSampleCountsWithNoOutputs : 31
PointSamplingAddressesNeverRoundUp : 1
RasterizerDesc2Supported : 1
NarrowQuadrilateralLinesSupported : 1
AnisoFilterWithPointMipSupported : 1
MaxSamplerDescriptorHeapSize : 2097152
MaxSamplerDescriptorHeapSizeWithStaticSamplers : 2097152
MaxViewDescriptorHeapSize : 2097152
ComputeOnlyCustomHeapSupported : 0

And this is a report by the Adrenalin driver 23.3.1 (build 31.0.14037.1007) on my Radeon 5700 XT (AMD did not relase an Agility SDK 1710.0-specific driver yet)

Code:
NonNormalizedCoordinateSamplersSupported : 0
ManualWriteTrackingResourceSupported : 0
RenderPassesValid : 1
MismatchingOutputDimensionsSupported : 0
SupportedSampleCountsWithNoOutputs : 1
PointSamplingAddressesNeverRoundUp : 0
RasterizerDesc2Supported : 1
NarrowQuadrilateralLinesSupported : 0
AnisoFilterWithPointMipSupported : 0
MaxSamplerDescriptorHeapSize : 2048
MaxSamplerDescriptorHeapSizeWithStaticSamplers : 2048
MaxViewDescriptorHeapSize : 1000000
ComputeOnlyCustomHeapSupported : 0

PS. NVidia RTX4090 with driver 531.41 has NonNormalizedCoordinateSamplersSupported enabled, everything else is currently the same as 5700 XT above.
 
Last edited:
the real question is whether it's mutable. Can the CPU modify values on VRAM. Or pull results back from VRAM for it's own usages.
A note in a blogpost under the CPU Access section:
Both UPLOAD and BAR memory segments are uncached and write-combined. That means that CPU reads from this memory will be slow. Due to the increased distance from the CPU, reading from the BAR introduces a large amount of additional latency.
So CPU reads are slow and CPU write performance is optimized since the memory is write-combined ...
 
Code:
Direct3D 12 feature checker (April 2023) by DmitryKo (x64) (Agility SDK v710)
https://forum.beyond3d.com/posts/1840641/

Windows 10X version 22H2 (build 22621.1485 ni_release) x64

ADAPTER 0
"NVIDIA GeForce RTX 4090"
VEN_10DE, DEV_2684, SUBSYS_F2961569, REV_A1
Dedicated video memory : 24156.0 MB (25329401856 bytes)
Total video memory : 56879.5 MB (59642525696 bytes)
BIOS string : Version95.2.18.80.53
Video driver version : 31.0.15.3141
WDDM version : KMT_DRIVERVERSION_WDDM_3_1 (3100)
Virtual memory model : GPUMMU
Hardware-accelerated scheduler : Enabled, DXGK_FEATURE_SUPPORT_STABLE (2)
GraphicsPreemptionGranularity : DXGI_GRAPHICS_PREEMPTION_PIXEL_BOUNDARY (3)
ComputePreemptionGranularity : DXGI_COMPUTE_PREEMPTION_DISPATCH_BOUNDARY (1)
Maximum feature level : D3D_FEATURE_LEVEL_12_2 (0xc200)
DoublePrecisionFloatShaderOps : 1
OutputMergerLogicOp : 1
MinPrecisionSupport : D3D12_SHADER_MIN_PRECISION_SUPPORT_16_BIT (2) (0b0000'0010)
TiledResourcesTier : D3D12_TILED_RESOURCES_TIER_3 (3)
ResourceBindingTier : D3D12_RESOURCE_BINDING_TIER_3 (3)
PSSpecifiedStencilRefSupported : 0
TypedUAVLoadAdditionalFormats : 1
ROVsSupported : 1
ConservativeRasterizationTier : D3D12_CONSERVATIVE_RASTERIZATION_TIER_3 (3)
StandardSwizzle64KBSupported : 0
CrossNodeSharingTier : D3D12_CROSS_NODE_SHARING_TIER_NOT_SUPPORTED (0)
CrossAdapterRowMajorTextureSupported : 0
VPAndRTArrayIndexFromAnyShaderFeedingRasterizerSupportedWithoutGSEmulation : 1
ResourceHeapTier : D3D12_RESOURCE_HEAP_TIER_2 (2)
MaxGPUVirtualAddressBitsPerResource : 40
MaxGPUVirtualAddressBitsPerProcess : 40
Adapter Node 0:     TileBasedRenderer: 0, UMA: 0, CacheCoherentUMA: 0, IsolatedMMU: 1, HeapSerializationTier: 0, ProtectedResourceSession.Support: 1, ProtectedResourceSessionTypeCount: 1 D3D12_PROTECTED_RESOURCES_SESSION_HARDWARE_PROTECTED
HighestShaderModel : D3D12_SHADER_MODEL_6_7 (0x0067)
WaveOps : 1
WaveLaneCountMin : 32
WaveLaneCountMax : 32
TotalLaneCount : 16384
ExpandedComputeResourceStates : 1
Int64ShaderOps : 1
RootSignature.HighestVersion : D3D_ROOT_SIGNATURE_VERSION_1_2 (3)
DepthBoundsTestSupported : 1
ProgrammableSamplePositionsTier : D3D12_PROGRAMMABLE_SAMPLE_POSITIONS_TIER_2 (2)
ShaderCache.SupportFlags : D3D12_SHADER_CACHE_SUPPORT_SINGLE_PSO | LIBRARY | DRIVER_MANAGED_CACHE | SHADER_CONTROL_CLEAR | SHADER_SESSION_DELETE (115) (0b0111'0011)
CopyQueueTimestampQueriesSupported : 1
CastingFullyTypedFormatSupported : 1
WriteBufferImmediateSupportFlags : D3D12_COMMAND_LIST_SUPPORT_FLAG_DIRECT | BUNDLE | COMPUTE | COPY | VIDEO_DECODE | VIDEO_PROCESS | VIDEO_ENCODE (127) (0b0111'1111)
ViewInstancingTier : D3D12_VIEW_INSTANCING_TIER_3 (3)
BarycentricsSupported : 1
ExistingHeaps.Supported : 1
MSAA64KBAlignedTextureSupported : 1
SharedResourceCompatibilityTier : D3D12_SHARED_RESOURCE_COMPATIBILITY_TIER_2 (2)
Native16BitShaderOpsSupported : 1
AtomicShaderInstructions : 0
SRVOnlyTiledResourceTier3 : 1
RenderPassesTier : D3D12_RENDER_PASS_TIER_0 (0)
RaytracingTier : D3D12_RAYTRACING_TIER_1_1 (11)
AdditionalShadingRatesSupported : 1
PerPrimitiveShadingRateSupportedWithViewportIndexing : 1
VariableShadingRateTier : D3D12_VARIABLE_SHADING_RATE_TIER_2 (2)
ShadingRateImageTileSize : 16
BackgroundProcessingSupported : 1
MeshShaderTier : D3D12_MESH_SHADER_TIER_1 (10)
SamplerFeedbackTier : D3D12_SAMPLER_FEEDBACK_TIER_0_9 (90)
UnalignedBlockTexturesSupported : 1
MeshShaderPipelineStatsSupported : 1
MeshShaderSupportsFullRangeRenderTargetArrayIndex : 1
AtomicInt64OnTypedResourceSupported : 1
AtomicInt64OnGroupSharedSupported : 1
DerivativesInMeshAndAmplificationShadersSupported : 0
WaveMMATier : D3D12_WAVE_MMA_TIER_NOT_SUPPORTED (0)
VariableRateShadingSumCombinerSupported : 1
MeshShaderPerPrimitiveShadingRateSupported : 1
AtomicInt64OnDescriptorHeapResourceSupported : 1
DisplayableTexture : 0
DisplayableTexture.SharedResourceCompatibilityTier : D3D12_SHARED_RESOURCE_COMPATIBILITY_TIER_0 (0)
MSPrimitivesPipelineStatisticIncludesCulledPrimitives : 1
EnhancedBarriersSupported : 1
RelaxedFormatCastingSupported : 1
UnrestrictedBufferTextureCopyPitchSupported : 1
UnrestrictedVertexElementAlignmentSupported : 1
InvertedViewportHeightFlipsYSupported : 1
InvertedViewportDepthFlipsZSupported : 1
TextureCopyBetweenDimensionsSupported : 1
AlphaBlendFactorSupported : 1
AdvancedTextureOpsSupported : 1
WriteableMSAATexturesSupported : 1
IndependentFrontAndBackStencilRefMaskSupported : 1
TriangleFanSupported : 1
DynamicIndexBufferStripCutSupported : 1
DynamicDepthBiasSupported : 1
GPUUploadHeapSupported : 1
NonNormalizedCoordinateSamplersSupported : 1
ManualWriteTrackingResourceSupported : 0
RenderPassesValid : 1
MismatchingOutputDimensionsSupported : 0
SupportedSampleCountsWithNoOutputs : 1
PointSamplingAddressesNeverRoundUp : 0
RasterizerDesc2Supported : 1
NarrowQuadrilateralLinesSupported : 0
AnisoFilterWithPointMipSupported : 0
MaxSamplerDescriptorHeapSize : 2048
MaxSamplerDescriptorHeapSizeWithStaticSamplers : 2048
MaxViewDescriptorHeapSize : 1000000
ComputeOnlyCustomHeapSupported : 0
Metacommands enumerated : 10
Metacommands [parameters per stage]: Conv (Convolution) [84][1][6], CopyTensor [3][1][31], 2x2 Nearest neighbour Upsample [15][1][2], MVN (Mean Variance Normalization) [67][1][6], GEMM (General matrix multiply) [67][1][6], Conv (Convolution) [108][5][6], GEMM (General matrix multiply) [91][5][6], MVN (Mean Variance Normalization) [91][5][6], Pooling [56][3][4], Direct Storage [4][0][11]
 
Code:
Direct3D 12 feature checker (April 2023) by DmitryKo (x64) (Agility SDK v710)
https://forum.beyond3d.com/posts/1840641/

Windows 10X version 22H2 (build 22621.1485 ni_release) x64
Checking for experimental features SM6 TR4 META

ADAPTER 0
"NVIDIA GeForce RTX 4090"
VEN_10DE, DEV_2684, SUBSYS_F2961569, REV_A1
Dedicated video memory : 24156.0 MB (25329401856 bytes)
Total video memory : 56879.5 MB (59642525696 bytes)
BIOS string : Version95.2.18.80.53
Video driver version : 31.0.15.3141
WDDM version : KMT_DRIVERVERSION_WDDM_3_1 (3100)
Virtual memory model : GPUMMU
Hardware-accelerated scheduler : Enabled, DXGK_FEATURE_SUPPORT_STABLE (2)
GraphicsPreemptionGranularity : DXGI_GRAPHICS_PREEMPTION_PIXEL_BOUNDARY (3)
ComputePreemptionGranularity : DXGI_COMPUTE_PREEMPTION_DISPATCH_BOUNDARY (1)
Maximum feature level : D3D_FEATURE_LEVEL_12_2 (0xc200)
DoublePrecisionFloatShaderOps : 1
OutputMergerLogicOp : 1
MinPrecisionSupport : D3D12_SHADER_MIN_PRECISION_SUPPORT_16_BIT (2) (0b0000'0010)
TiledResourcesTier : D3D12_TILED_RESOURCES_TIER_4 (4)
ResourceBindingTier : D3D12_RESOURCE_BINDING_TIER_3 (3)
PSSpecifiedStencilRefSupported : 0
TypedUAVLoadAdditionalFormats : 1
ROVsSupported : 1
ConservativeRasterizationTier : D3D12_CONSERVATIVE_RASTERIZATION_TIER_3 (3)
StandardSwizzle64KBSupported : 0
CrossNodeSharingTier : D3D12_CROSS_NODE_SHARING_TIER_NOT_SUPPORTED (0)
CrossAdapterRowMajorTextureSupported : 0
VPAndRTArrayIndexFromAnyShaderFeedingRasterizerSupportedWithoutGSEmulation : 1
ResourceHeapTier : D3D12_RESOURCE_HEAP_TIER_2 (2)
MaxGPUVirtualAddressBitsPerResource : 40
MaxGPUVirtualAddressBitsPerProcess : 40
Adapter Node 0:     TileBasedRenderer: 0, UMA: 0, CacheCoherentUMA: 0, IsolatedMMU: 1, HeapSerializationTier: 0, ProtectedResourceSession.Support: 1, ProtectedResourceSessionTypeCount: 1 D3D12_PROTECTED_RESOURCES_SESSION_HARDWARE_PROTECTED
HighestShaderModel : D3D12_SHADER_MODEL_6_8 (0x0068)
WaveOps : 1
WaveLaneCountMin : 32
WaveLaneCountMax : 32
TotalLaneCount : 16384
ExpandedComputeResourceStates : 1
Int64ShaderOps : 1
RootSignature.HighestVersion : D3D_ROOT_SIGNATURE_VERSION_1_2 (3)
DepthBoundsTestSupported : 1
ProgrammableSamplePositionsTier : D3D12_PROGRAMMABLE_SAMPLE_POSITIONS_TIER_2 (2)
ShaderCache.SupportFlags : D3D12_SHADER_CACHE_SUPPORT_SINGLE_PSO | LIBRARY | DRIVER_MANAGED_CACHE | SHADER_CONTROL_CLEAR | SHADER_SESSION_DELETE (115) (0b0111'0011)
CopyQueueTimestampQueriesSupported : 1
CastingFullyTypedFormatSupported : 1
WriteBufferImmediateSupportFlags : D3D12_COMMAND_LIST_SUPPORT_FLAG_DIRECT | BUNDLE | COMPUTE | COPY | VIDEO_DECODE | VIDEO_PROCESS | VIDEO_ENCODE (127) (0b0111'1111)
ViewInstancingTier : D3D12_VIEW_INSTANCING_TIER_3 (3)
BarycentricsSupported : 1
ExistingHeaps.Supported : 1
MSAA64KBAlignedTextureSupported : 1
SharedResourceCompatibilityTier : D3D12_SHARED_RESOURCE_COMPATIBILITY_TIER_2 (2)
Native16BitShaderOpsSupported : 1
AtomicShaderInstructions : 0
SRVOnlyTiledResourceTier3 : 1
RenderPassesTier : D3D12_RENDER_PASS_TIER_0 (0)
RaytracingTier : D3D12_RAYTRACING_TIER_1_1 (11)
AdditionalShadingRatesSupported : 1
PerPrimitiveShadingRateSupportedWithViewportIndexing : 1
VariableShadingRateTier : D3D12_VARIABLE_SHADING_RATE_TIER_2 (2)
ShadingRateImageTileSize : 16
BackgroundProcessingSupported : 1
MeshShaderTier : D3D12_MESH_SHADER_TIER_1 (10)
SamplerFeedbackTier : D3D12_SAMPLER_FEEDBACK_TIER_0_9 (90)
UnalignedBlockTexturesSupported : 1
MeshShaderPipelineStatsSupported : 1
MeshShaderSupportsFullRangeRenderTargetArrayIndex : 1
AtomicInt64OnTypedResourceSupported : 1
AtomicInt64OnGroupSharedSupported : 1
DerivativesInMeshAndAmplificationShadersSupported : 0
WaveMMATier : D3D12_WAVE_MMA_TIER_NOT_SUPPORTED (0)
VariableRateShadingSumCombinerSupported : 1
MeshShaderPerPrimitiveShadingRateSupported : 1
AtomicInt64OnDescriptorHeapResourceSupported : 1
DisplayableTexture : 0
DisplayableTexture.SharedResourceCompatibilityTier : D3D12_SHARED_RESOURCE_COMPATIBILITY_TIER_0 (0)
MSPrimitivesPipelineStatisticIncludesCulledPrimitives : 1
EnhancedBarriersSupported : 1
RelaxedFormatCastingSupported : 1
UnrestrictedBufferTextureCopyPitchSupported : 1
UnrestrictedVertexElementAlignmentSupported : 1
InvertedViewportHeightFlipsYSupported : 1
InvertedViewportDepthFlipsZSupported : 1
TextureCopyBetweenDimensionsSupported : 1
AlphaBlendFactorSupported : 1
AdvancedTextureOpsSupported : 1
WriteableMSAATexturesSupported : 1
IndependentFrontAndBackStencilRefMaskSupported : 1
TriangleFanSupported : 1
DynamicIndexBufferStripCutSupported : 1
DynamicDepthBiasSupported : 1
GPUUploadHeapSupported : 1
NonNormalizedCoordinateSamplersSupported : 1
ManualWriteTrackingResourceSupported : 0
RenderPassesValid : 1
MismatchingOutputDimensionsSupported : 0
SupportedSampleCountsWithNoOutputs : 1
PointSamplingAddressesNeverRoundUp : 0
RasterizerDesc2Supported : 1
NarrowQuadrilateralLinesSupported : 0
AnisoFilterWithPointMipSupported : 0
MaxSamplerDescriptorHeapSize : 2048
MaxSamplerDescriptorHeapSizeWithStaticSamplers : 2048
MaxViewDescriptorHeapSize : 1000000
ComputeOnlyCustomHeapSupported : 0
Metacommands enumerated : 10
Metacommands [parameters per stage]: Conv (Convolution) [84][1][6], CopyTensor [3][1][31], 2x2 Nearest neighbour Upsample [15][1][2], MVN (Mean Variance Normalization) [67][1][6], GEMM (General matrix multiply) [67][1][6], Conv (Convolution) [108][5][6], GEMM (General matrix multiply) [91][5][6], MVN (Mean Variance Normalization) [91][5][6], Pooling [56][3][4], Direct Storage [4][0][11]

Differences: TiledResourcesTier : D3D12_TILED_RESOURCES_TIER_4 (4) and HighestShaderModel : D3D12_SHADER_MODEL_6_8 (0x0068)
 
Last edited:
FYI here are the parameters of the various stages of the "DStorageCustom Metacommand" in the latest drivers for Intel UHD630 / UHD730

BTW, the custom DirectStorage metacommand gains 3 additional parameters at the creation stage, and 2 more at the execution stage (you can get metacommand parameters and types with checkformats.cmd):

Code:
DSTORAGE {1bddd090-c47e-459c-8f81-42c9f97a5308}

CREATION       [4]:
_In_ UINT64 Version,
_In_ UINT64 Format,
_In_ UINT64 MaxStreams,
_In_ UINT64 Flags

INITIALIZATION [0]

EXECUTION      [11]:
_In_ GPU_VIRTUAL_ADDRESS InputBuffer,
_In_ UINT64 InputBufferSize,
_Out_ GPU_VIRTUAL_ADDRESS OutputBuffer,
_In_ UINT64 OutputBufferSize,
_In_ GPU_VIRTUAL_ADDRESS ControlBuffer,
_In_ UINT64 ControlBufferSize,
_In_ GPU_VIRTUAL_ADDRESS ScratchBuffer,
_In_ UINT64 ScratchBufferSize,
_In_ UINT64 StreamCount,
_Inout_ GPU_VIRTUAL_ADDRESS StatusBuffer,
_In_ UINT64 StatusBufferSize
 
Last edited:
What I expected. Not sure why I was thinking it would be more than that.
It's better than regular upload heap memory in a one-sided CPU -> GPU data transfer since it's faster for the GPU to access video memory. On a regular upload heap as opposed to a GPU upload heap, GPU access to host/system memory is much slower since it has to happen over PCIe ...

Bonus described in this blogpost under "Mapping Memory" section:

Vulkan provides two options when mapping memory to get a CPU-visible pointer:

  • Do this before CPU needs to write data to the allocation, and unmap once the write is complete
  • Do this right after the host-visible memory is allocated, and never unmap memory
The second option is otherwise known as persistent mapping and is generally a better tradeoff – it minimizes the time it takes to obtain a writeable pointer (vkMapMemory is not particularly cheap on some drivers), removes the need to handle the case where multiple resources from the same memory object need to be written to simultaneously (calling vkMapMemory on an allocation that’s already been mapped and not unmapped is not valid) and simplifies the code in general.

The only downside is that this technique makes the 256 MB chunk of VRAM that is host visible and device local on AMD GPU that was described in “Memory heap selection” less useful – on systems with Windows 7 and AMD GPU, using persistent mapping on this memory may force WDDM to migrate the allocations to system memory. If this combination is a critical performance target for your users, then mapping and unmapping memory when needed might be more appropriate.

The GPU upload heaps was the last piece of the puzzle to "persistent mapping". In the past, we could only persistently map 256MB of video memory before allocations were silently migrated to system memory which wasn't ideal. With GPU upload heap memory, we can persistently map *all* of the video memory which is both good for performance and simplicity ...

More notes:

Resources on CPU-accessible heaps can be persistently mapped, meaning Map can be called once, immediately after resource creation. Unmap never needs to be called, but the address returned from Map must no longer be used after the last reference to the resource is released. When using persistent map, the application must ensure the CPU finishes writing data into memory before the GPU executes a command list that reads or writes the memory. In common scenarios, the application merely must write to memory before calling ExecuteCommandLists; but using a fence to delay command list execution works as well.

All CPU-accessible memory types support persistent mapping usage, where the resource is mapped but then never unmapped, provided the application does not access the pointer after the resource has been disposed.

We call the ID3D12Resource::Map method or the vkMapMemory function once and never unmap the memory hence the magical "persistent mapping" ...
 
Back
Top