Task shaders on AMD HW
What I discuss here is based on information that is already publicly available in open source drivers. If you are already familiar with how AMD’s own PAL-based drivers work, you won’t find any surprises here.
First things fist. Under the hood, task shaders are compiled to a plain old compute shader. The task payload is located in VRAM. The shader code that stores the mesh dispatch size and payload are compiled to memory writes which store these in VRAM ring buffers. Even though they are compute shaders as far as the AMD HW is concerned, task shaders do not work like a compute pre-pass. Instead, task shaders are dispatched on an async compute queue while at the same time the mesh shader work is executed on the graphics queue in parallel.
The task+mesh dispatch packets are different from a regular compute dispatch. The compute and graphics queue firmwares work together in parallel:
- Compute queue launches up to as many task workgroups as it has space available in the ring buffer.
- Graphics queue waits until a task workgroup is finished and can launch mesh shader workgroups immediately. Execution of mesh dispatches from a finished task workgroup can therefore overlap with other task workgroups.
- When a mesh dispatch from the a task workgroup is finished, its slot in the ring buffer can be reused and a new task workgroup can be launched.
- When the ring buffer is full, the compute queue waits until a mesh dispatch is finished, before launching the next task workgroup.
You can find out the exact concrete details in the PAL source code, or RADV merge requests.
Side note, getting some implementation details wrong can easily cause a deadlock on the GPU. It is great fun to debug these.
The relevant details here are that most of the hard work is implemented in the firmware (good news, because that means I don’t have to implement it), and that
task shaders are executed on an async compute queue and that the driver now has to
submit compute and graphics work in parallel.
Keep in mind that the API hides this detail and pretends that the mesh shading pipeline is just another graphics pipeline that the application can submit to a graphics queue. So, once again we have a mismatch between the API programming model and what the HW actually does.