The relevant details here are that most of the hard work is implemented in the firmware (good news, because that means I don’t have to implement it),
and that task shaders are executed on an async compute queue and that the driver now has to submit compute and graphics work in parallel.
Keep in mind that the API hides this detail and pretends that the mesh shading pipeline is just another graphics pipeline that the application can submit to a graphics queue. So, once again we have a
mismatch between the API programming model and what the HW actually does.
Squeezing a hidden compute pipeline in your graphics
In order to use this beautiful scheme provided by the firmware, the driver needs to do two things:
- Create a compute pipeline from the task shader.
- Submit the task shader work on the asyc compute queue while at the same time also submit the mesh and pixel shader work on the graphics queue.
We already had good support for compute pipelines in RADV (as much as the API needs), but internally in the driver
we’ve never had this kind of close cooperation between graphics and compute.
When you use a draw call in a command buffer with a pipeline that has a task shader, RADV must create a hidden, internal compute command buffer. This internal compute command buffer contains the task shader dispatch packet, while the graphics command buffer contains the packet that dispatches the mesh shaders.
We must also ensure correct synchronization between these two command buffers according to application barriers ― because of the API mismatch it must work as if the internal compute cmdbuf was part of the graphics cmdbuf. We also need to emit the same descriptors and push constants, etc. When the application submits the graphics queue, this new, internal compute command buffer is then submitted to the async compute queue.
Thus far, this sounds pretty logical and easy.
The actual hard work is to make it possible for the driver to submit work to different queues at the same time. RADV’s queue code was written assuming that there is a 1:1 mapping between radv_queue objects and HW queues. To make task shaders work we must now break this assumption.
So, of course I had to do some crazy refactor to enable this. At the time of writing the AMDGPU Linux kernel driver doesn’t support “gang submit” yet, so I use scheduled dependencies instead. This has the drawback of submitting to the two queues sequentially rather than doing everything in the same submit.