AMD Execution Thread [2024]

The worst part by far about many of these "neural processing units" is that absolutely none of the shipping hardware vendors are interested in providing explicit programming support for them. There's no public compilers to support your usual compiled high-level languages like C++ or custom languages like HLSL and they won't even let you create custom programs in assembly either. I've never seen a class hardware designs before that was immediately outright hostile towards any software developers!

For all of it's faults, at least Sony let developers directly program against the Cell Processor's SPUs despite featuring the very same 'dataflow' architecture. From a programming perspective, all of these 'NPUs' are beneath even things like the Arduino microcontrollers ...
 
The worst part by far about many of these "neural processing units" is that absolutely none of the shipping hardware vendors are interested in providing explicit programming support for them. There's no public compilers to support your usual compiled high-level languages like C++ or custom languages like HLSL and they won't even let you create custom programs in assembly either. I've never seen a class hardware designs before that was immediately outright hostile towards any software developers!

For all of it's faults, at least Sony let developers directly program against the Cell Processor's SPUs despite featuring the very same 'dataflow' architecture. From a programming perspective, all of these 'NPUs' are beneath even things like the Arduino microcontrollers ...

So how do you target them?
 
AFAIK, NPUs are not really supposed to be directly programmable by the application developer, but rather by each vendor's video driver programmers who would adapt it to the specific needs of middleware libraries and runtimes.

For example, DirectML and WinML are designed to use Direct3D metacommands, proprietary opaque implementations of standard reduced-precision inferencing algorithms in the user-mode video driver, which are intended to be consumed programmatically at runtime using the Direct3D metacommand APIs (ID3D12Device5::EnumerateMetaCommands() and ::EnumerateMetaCommandParameters() to get their number and names/GUIDs, and names and the type of input and output parameters for each of these metacommands, create with ::CreateMetaCommand(), then initialize and execute with ID3D12GraphicsCommandList4::InitializeMetaCommand() and ::ExecuteMetaCommand()).

Recent generations of GPUs from NVIdia, AMD, and Intel have settled on a common set of Direct3D metacommands, with some minor variations (NB the three stages - creation, initialization and execution):

Nvida GeForce GTX / RTX
Code:
Metacommands [parameters per stage]:
Conv (Convolution) [84][1][6],
Conv (Convolution) [108][5][6],
GEMM (General matrix multiply) [67][1][6],
GEMM (General matrix multiply) [91][5][6],
GEMM (General matrix multiply) [91][5][6],
MVN (Mean Variance Normalization) [91][5][6],
MVN (Mean Variance Normalization) [67][1][6],
Pooling [56][3][4],
MHA (Multi-Head Attention) [299][13][16],
CopyTensor [3][1][31]

AMD RNDA2 / RDNA3
Code:
Metacommands [parameters per stage]:
Conv (Convolution) [84][1][6],
Conv (Convolution) [108][5][6],
GEMM (General matrix multiply) [67][1][6],
GEMM (General matrix multiply) [91][5][6],
GEMM (General matrix multiply) [91][5][6],
MVN (Mean Variance Normalization) [91][5][6],
MVN (Mean Variance Normalization) [67][1][6],
MHA (Multi-Head Attention) [299][13][16]

Intel Arc / Xe
Code:
Metacommands [parameters per stage]:
Conv (Convolution) [84][1][6],
Conv (Convolution) [108][5][6],
GEMM (General matrix multiply) [67][1][6],
GEMM (General matrix multiply) [91][5][6],
MVN (Mean Variance Normalization) [91][5][6],
Pooling [56][3][4],
Pooling [44][1][4],
LSTM (Long Short-Term Memory) [252][10][13],
MHA (Multi-Head Attention) [299][13][16],

Current NPUs are built for the same DirectML / OpenVINO / ONNX workflow, and their Windows driver model is a compute-only subset of WDDM KMD / Direct3D UMD, with only a limited form of 'core compute' capability.

PS. If you want to check the names and types of parameters for each metacommand supported by your GPU, you can use my command-line feature reporting tool, it has a verbose mode that will get you a dozen screens of C-style definitions for all these twelve hundred parameters.
 
Last edited:
Back
Top