You mean just streaming dispatch?
So, effectively indirect dispatch, but generalized to be streaming based on a shader writable, counting semaphore instead (plus second semaphore for exit condition)?
Should be possible to map on all existing hardware out there (can be serialized to indirect dispatch), and if batch size is constant, also quite efficient on hardware with full support...
It does get tricky when you require self-dispatch, at that time emulating by indirect dispatch does require looping, which involves either CPU round trip or mandatory hardware/firmware support.
Problem is, you still have a ping-pong like control flow in there (alternating between BVH traversal, filtering potential candidates, hit shader), which doesn't map to a simple, strictly forward streaming dispatch, but implies feedback loop.
If you have this sort of ping-pong pattern, what you would actually need to do, is to do it all in a single kernel which is toggling between the different possible operations per thread block. And then do your own sub-command-queues in software.
Using different thread blocks of a single dispatch for different, divergent code paths feels odd, I know. But cooperative execution of divergent control flows of a single kernel is a surprisingly effective method.
That's a bit over my head... not sure about the terminology. I know a bit about the OpenCL 2.0 possibilities, but i have never used it and so... i have to admit i do not know exactly what i want, and i also don't know what exactly the hardware could do.
In any case it depends on the latter, i'll just use what's possible. But the problem i face the most is this:
I have a tree with 16-20 levels, and many of my shaders process one level, memory barrier, process next level depending on results. (Like making Mip Maps)
Mostly only 3 levels of the tree have a need to do any work at all.
But i still have to record all indirect dispatches to the command buffer, including all the useless barriers. This causes many bubbles. (I use a static command buffer which i upload only once at startup but execute each frame.)
So i would be happy with one of the following options:
Build Command buffers on GPU from compute. (discussed this with
@pixeljetstream here, who has implemented NVs Vulkan Extension for GPU generated CB, but lacks the option to insert barriers)
Or have the option to skip over recorded commands from GPU.
The former would be better of course.
Actually i can fight the bubbles only with async compute, but at the moment i have no independent task yet for this. After some tests it seems this works pretty well, but the limitation should be finally addressed in Game APIs.
Pre-recorded command buffers with indirect dispatches already gave me a speed up of two when moving to VK, i expect another a big win here.
To saturate GPU, it needs to be able to operate independent from CPU, not just a client behind slow comm to server. I have no chance to saturate a big GPU actually, it's mostly bored. (will chance in practice, but still)
The other option, to launch compute shaders directly from compute shaders is of course super interesting, but i can't say quickly how i could utilize this.
I can't say if it is worth to support functions with call and return, or even recursion like seen in DXR.
On the long run we surely want this, but i'm not one of those requesting OOP or such things just for comfort. It depends on what hardware can do efficiently.
Andrew Lauritzen tried to bundle requests that arise from other applications such as rendering:
https://www.ea.com/seed/news/seed-siggraph2017-compute-for-graphics
I'm no rendering expert, and do not understand everything here, but he also lists my problem described above.
Ping pong control flow is something i'm just used to. I still talk about larger workloads. Much smaller than something like a SSAO pass, but still few thousands of wavefronts of work. I use a work control shader that fills all the indirect dispatch data, problem is it just fills it with mostly zeroes.
So i do net request for totally fine grained flexibility like on CPU.
But it would be wonderful if a GPU driven command buffer could also address async compute, including the synchronization across queues. (assuming the queue concept has hardware backing at all.)
Actually it's hard to utilize async, because you need to divide your static command buffer into multiple fragments to distribute over multiple queues. This alone kills performance for small workloads. Add some sync and the benefit is lost.
It only works if you have independent large workloads, which is not guaranteed.
So what i want is command buffer recording and queue submission on GPU itself. There might be better options i'm unaware of.