Direct3D feature levels discussion

Word of warning - these docs probably got posted by accident and are certainly not finalized. Ordered atomics (they are really more fences) as in GCN in particular are something I doubt we can support efficiently, and I certainly haven't heard that NVIDIA can either although I have no firm info on that one.
I do agree that SM 6.0 feature list looks highly GCN centric (WavePrefixSum is also a single instruction on GCN and WaveBallot returns a 64 bit mask, which efficiently maps to a single 64 bit scalar register on GCN) . Console developers (us included) have given plenty of feedback to get these features included to PC DirectX. These features make it much easier to port console games to PC.

Ordered atomics are similar to rasterizer ordered views, but for compute shaders. If multiple threads (waves) access the same atomic counter they are executed in submission order. GPUs tend to execute waves (and workgroups) roughly in submission order, minimizing the potential stalls. When stalls occur, the latency can be hidden in a similar way as memory latency (cache misses).

The best thing about ordered atomics is that this feature allows you to do execute algorithms with prefix sums in a single pass. Without ordered atomics you often need to: store intermediate results (from registers & LDS) to memory -> wait for GPU idle -> execute separate global prefix sum kernel for output data -> wait for GPU idle -> load prefix sum data + all other data back to registers & LDS. It's clear that even if ordered atomics cause some stalls, it is a significant efficiency boost (performance & power) to algorithms containing global prefix sums.
 
Last edited:
I do agree that SM 6.0 feature list looks highly GCN centric (WavePrefixSum is also a single instruction on GCN and WaveBallot returns a 64 bit mask, which efficiently maps to a single 64 bit scalar register on GCN) . Console developers (us included) have given plenty of feedback to get these features included to PC DirectX. These features make it much easier to port console games to PC.
I don't think any of the ISA features are really a problem for us (Intel) or NVIDIA to implement really. The biggest gotcha for us is the baked-in assumption of a globally static SIMD size, which does not fit our architecture well really. It's unfortunately also an assumption that is completely unnecessary for the vast majority of uses cases being discussed (including stuff like pack operations and similar).

It's clear that even if ordered atomics cause some stalls, it is a significant efficiency boost (performance & power) to algorithms containing global prefix sums.
Yes I agree it's a useful feature; it's just a separate feature from the wave math stuff really, and not one that I have an confidence anyone other than AMD can support. It also requires standardizing "dispatch order" which is straightforward (but not currently standardized) for 1D compute shaders, but also might be a place that has implementation divergence, particularly for 2D/3D dispatches.
 
The new code for shader model version and supported wave operations compiles and runs, but the driver/runtime does not report anything meaningful as of now... on my R9 280X:
Code:
HighestShaderModel : D3D12_SHADER_MODEL_??? (0000)
WaveOps : 0
WaveLaneCountMin : 4
WaveLaneCountMax : 4
TotalLaneCount : 4
ExpandedComputeResourceStates : 1
HighestVersion : D3D_ROOT_SIGNATURE_VERSION_??? (0000)
 
The biggest gotcha for us is the baked-in assumption of a globally static SIMD size, which does not fit our architecture well really. It's unfortunately also an assumption that is completely unnecessary for the vast majority of uses cases being discussed (including stuff like pack operations and similar).
Yes, in majority of cases you don't need to know the wave width. Built-in reductions + WaveOnce() can abstract it away nicely.

I thought that the wave width was only exposed to shader. This allows the shader compiler to determine the best value for each shader and burn it in. But it seems that there's a CPU API as well, and it gives you a single number for the GPU wave width. This obviously doesn't work well with Intel.

WaveReadFirstLane is also good with WaveOnce as neither needs to know the wave width. But WaveOnce documentation (https://msdn.microsoft.com/en-us/library/windows/desktop/mt733262(v=vs.85).aspx) doesn't guarantee that writes to registers are stored to the first lane ("driver implementations typically use the first lane in the wave"). Might be that the documentation is lacking. There should be a guarantee that register writes from WaveOnce() block are either: A) always replicated to all lanes of the register, B) always at first lane (and thus readable by WaveReadFirstLane). The current example expects either A or B to happen (otherwise it is broken).

Personally I would prefer that variables had a "wave_coherent" keyword ("wave_coherent int myCounter") instead of manually using WaveReadFirstLane/WaveReadLaneAt. With this keyword the compiler would store a single value per wave (not an unique value per lane). Reads/writes to this variable would automatically generate code that reads/writes a single lane (and broadcasts the result). The compiler could pack multiple per wave variables to a single SIMD register. Of course in AMDs case, a scalar register could be used instead. This proposal would result in significantly more portable code compared to manually packing wave coherent data to SIMD lanes with WaveReadLaneAt.
 
Last edited:
The biggest gotcha for us is the baked-in assumption of a globally static SIMD size

where is that assumption? I was about to ask about the opposite..
I went through the API but can't see how I can specify a wave size (fair enough, it's supposed to be the real hardware size), or even query it from cpu side - only some minimum value (which could probably be zero). For instance if I want to set up thread group layout (to make a wave a nice square block I know the radius of), buffers etc to match this size I would like to know before.. As far as I can see it can only be queried at runtime in the shader, and I don't even think there a mention that you can expect all waves to be the same size.
 
where is that assumption? I was about to ask about the opposite..
I went through the API but can't see how I can specify a wave size (fair enough, it's supposed to be the real hardware size), or even query it from cpu side - only some minimum value (which could probably be zero). For instance if I want to set up thread group layout (to make a wave a nice square block I know the radius of), buffers etc to match this size I would like to know before.. As far as I can see it can only be queried at runtime in the shader, and I don't even think there a mention that you can expect all waves to be the same size.
According to the documentation, the WaveLaneCountMin (D3D12_FEATURE_DATA_D3D12_OPTIONS1) gives you the "baseline" wave size. WaveLaneCountMax doesn't seem to be supported yet.
WaveLaneCountMin
Specifies the baseline number of lanes in the SIMD wave that this implementation can support. This term is sometimes known as "wavefront size" or "warp width". Currently apps should rely only on this minimum value for sizing workloads.

WaveLaneCountMax
Specifies the maximum number of lanes in the SIMD wave that this implementation can support. This capability is reserved for future expansion, and is not expected to be used by current applications.

On Intel WaveLaneCountMin should be 8 (SIMD8), unless SIMD4x2 mode is exposed. According to documentation, all SM 6.0 wave operations are only supported in pixel shaders and compute shaders. If I understood correctly Intel never uses SIMD4x2 mode in pixel shaders or compute shaders (only in VS, GS, DS and HS).

If you want your thread groups to be screen local, I suggest using morton order. For all 4^X thread group sizes, morton order gives you a square tile (4 = 2x2, 16 = 4x4, 64 = 8x8). "Odd" powers of two give you twice as wide tiles (8 = 4x2, 32 = 8x4). Do a morton remap to your thread id at the beginning of the shader. This works properly on all wave widths. You get screen coherent (tiled) branching and increased data locality for screen local lookups (such as textures, decals and shadow maps).
 
But couldn't Intel just return 0 for WaveLaneCountMin? At least it doesn't dictate a global fixed value, only a "baseline"

Say I want to shoot rays for bounding object testing for each thread and then do a ballot (or OR) to collect which objects to process for each pixel, maybe loop the whole thing objectcount/wavesize times. Then I would need to know the spatial extent of the wave, and with an odd wave size (no guarantee against a non power of two size, even though it's unlikely) it becomes ugly..
 
But couldn't Intel just return 0 for WaveLaneCountMin? At least it doesn't dictate a global fixed value, only a "baseline"
Why would they return smaller wave size than they support? That would only cause lower performance, if the developer chooses a special case shader based on wave size. Also wave size of 0 makes no sense. Size of 4 is the bare wave size minimum, since SM 6.0 compatible GPUs need to support QuadSwap and QuadReadLaneAt.
no guarantee against a non power of two size, even though it's unlikely
Power of two wave size should be guaranteed. Otherwise things become ugly. This is preliminary documentation. I hope they add something like this: "Wave size is a power of two. Wave size greater or equal to 4".
 
Say I want to shoot rays for bounding object testing for each thread and then do a ballot (or OR) to collect which objects to process for each pixel, maybe loop the whole thing objectcount/wavesize times. Then I would need to know the spatial extent of the wave
You could go through the ballot result bitmask with bitscan (https://msdn.microsoft.com/en-us/library/windows/desktop/ff471401(v=vs.85).aspx). Continue while bits are left (clear found bit + do payload). This way you don't need to know how many bits were returned. This of course doesn't give you the wave spatial extents, but often you can write the code in a way that doesn't require it.
 
WaveReadFirstLane is also good with WaveOnce as neither needs to know the wave width. But WaveOnce documentation (https://msdn.microsoft.com/en-us/library/windows/desktop/mt733262(v=vs.85).aspx) doesn't guarantee that writes to registers are stored to the first lane ("driver implementations typically use the first lane in the wave"). Might be that the documentation is lacking. There should be a guarantee that register writes from WaveOnce() block are either: A) always replicated to all lanes of the register, B) always at first lane (and thus readable by WaveReadFirstLane). The current example expects either A or B to happen (otherwise it is broken).
The text of WaveReadFirstLane:

https://msdn.microsoft.com/en-us/library/windows/desktop/mt733265(v=vs.85).aspx

says:

Returns the value of the expression for the active lane of the current wave with the smallest index.

That seems to imply that a lane other than the first lane of the wave can be the lane that's read. The overview:

https://msdn.microsoft.com/en-us/library/windows/desktop/mt733232(v=vs.85).aspx

Distringuishes between Inactive and Active lanes:

Inactive Lane - A lane which is not being executed, for example due to the flow of control, or insufficient work to fill the minimum size of the wave.
Active Lane - A lane for which execution is being performed. In pixel shaders, it may include any helper pixel lanes.

So that would imply that your case B only works if there is no predication.
 
If you want to find out the current wave size it would be most logical to call WaveGetLaneCount But there is no specified guarantee it is constant across all waves. And I would expect it to return less (including odd numbers) if there are not enough threads to fill the full wave.
 
TotalLaneCount : 4
Radeon driver 16.9.1 (Direct3D driver 21.19.134.1) brings WDDM 2.1 support and now reports WaveLineCountMin:64, WaveLineCountMax:64, TotalLaneCount : 2816, D3D12_SHADER_MODEL_5_1 and D3D_ROOT_SIGNATURE_VERSION_1_1.
 
Last edited:
Yes, but probably it is not yet supported by the driver. WDDM 2.1 is required, but it does not guarantee the support of (Windows 10) 1607 Direct3D 12 revision.
 
I have updated my command-line tool to report the new features in build 10.0.14393 (Anniversary Update version 1607). Also fixed command line arguments ("=" is not accepted in batch files for some reason).

Alessio1989, there's nothing wrong with the driver/runtime - the API just expects you to specify maximum shader model and root signature version to check for - so these parameters serve as both input and output, and any incorrect value results in an error. My bad, should have checked the SDK documentation.
 
Last edited:
I updated my command-line tool to report the new features in Windows 10 Creators Update version 1703 and D3D driver version from D3DADAPTER_IDENTIFIER9.

Built with VS2017.1 and Windows SDK build 10.0.15063 - though the API surface is frozen since build 15021, MSDN documentation is not live yet; for now, there is a blog post detailing the changes in header files:

https://naughter.wordpress.com/2017/02/28/changes-in-the-windows-v10-0-15021-sdk-compared-to-windows-v10-0-14393-sdk-part-two/
  • dxgdi.h/idl: New DXGI_SWAP_CHAIN_FLAG_RESTRICTED_TO_ALL_HOLOGRAPHIC_DISPLAYS defines.
  • dxgdi1_5.h/idl: New DXGI_OUTDUPL_FLAG enum.
  • dxgdi1_6.h/idl: New header files for DXGDI 1.6. Includes new IDXGIAdapter4 & IDXGIOutput6 interfaces.
  • d3d11.h\idl: New D3D11_FEATURE_SHADER_CACHE enum value. New D3D11_SHADER_CACHE_SUPPORT_FLAGS enum. New D3D11_FEATURE_DATA_SHADER_CACHE struct.
  • d3d11_3.h\idl: New ID3D11Fence & ID3D11DeviceContext4 interfaces. New D3D11_FENCE_FLAG enum.
  • d3d11_4.h\idl: New ID3D11Device5 interface.
  • d3d11sdklayers.h\idl: Various new D3D11_MESSAGE_ID_* enum values.
  • d3d12.h\idl: New ID3D12GraphicsCommandList1, ID3D12PipelineLibrary1, ID3D12Device2 & ID3D12Tools interfaces. New D3D12_COMMAND_QUEUE_PRIORITY_GLOBAL_REALTIME, D3D12_FEATURE_* & D3D12_HEAP_FLAG_ALLOW_WRITE_WATCH enum values. New D3D12_DEPTH_STENCIL_DESC1, D3D12_RT_FORMAT_ARRAY, D3D12_PIPELINE_STATE_STREAM_DESC, D3D12_FEATURE_DATA_D3D12_OPTIONS2, D3D12_FEATURE_DATA_ARCHITECTURE1, D3D12_FEATURE_DATA_SHADER_CACHE, D3D12_FEATURE_DATA_COMMAND_QUEUE_PRIORITY, D3D12_RANGE_UINT64, D3D12_SUBRESOURCE_RANGE_UINT64 & D3D12_SAMPLE_POSITION structs. New D3D12_PIPELINE_STATE_SUBOBJECT_TYPE, D3D12_PROGRAMMABLE_SAMPLE_POSITIONS_TIER & D3D12_SHADER_CACHE_SUPPORT_FLAGS, D3D12_RESOLVE_MODE enums. D3D12_RESIDENCY_PRIORITY_HIGH enum value has changed.
  • d3d12sdklayers.h\idl: New ID3D12Debug2 interface. New D3D12_GPU_BASED_VALIDATION_FLAGS enum. New D3D12_DEBUG_DEVICE_PARAMETER_GPU_SLOWDOWN_PERFORMANCE_FACTOR enum value. New D3D12_DEBUG_DEVICE_GPU_SLOWDOWN_PERFORMANCE_FACTOR struct. Various D3D12_MESSAGE_ID_* defines have been updated and added.
 
Last edited:
Back
Top