From the DXR API we have the impression tracing rays is an atomic operation and execution waits on results. This would hint there is no batching, reordering or however we call it. But that's no proof. It remains a mystery
The API is very high level - shaders dispatch some rays, and at some point a hit/miss/whatever kernel is executed with the results. There's a lot of resemblance to Nvidia's dynamic parallelism in Cuda, honestly.
Rays are already inherently batched to some extent at the wavefront level. They're often even coherent here too!
Scheduling is complicated to say the least. Last thing you want are SMs sitting idle for thousands of clock cycles waiting for rays! And the latency
will be thousands of cycles since traversing an accelleration structure is pointer chasing with a lot of it uncached.
So what do you do with the kernel while the ray dispatch is active? Evicting a block from an SM is an expensive operation since it has a huge chunk of registers and possibly shared memory too that needs to be persisted. We're talking kilobytes here. This is of course assuming the shader isn't just terminated at this point - one possible optimization would indeed be to put the dispatch at the end of a kernel, allowing it to terminate without waiting for results, and have the hit/miss kernels responsible for writing results to a UAV or something.
Asynchronous compute could be very useful at covering stalls from ray dispatch. Though the presence of the stalled kernels limits occupancy.
A driver/compiler level optimization is to let the kernel continue running past the ray dispatch point as far as possible. This sort of thing is a common optimization used by many compilers - by putting as much code between a high latency operation and where the result is consumed, you can cover part or all of the stall. Static reordering!
We can see that depending on where the ray results are used, tracing can either be latency sensitive or not. In the first case, large scale batching and reordering actually hurts performance. In the second, they might be a win. There's a lot of room for hardware, driver, and compiler optimization here.
Anyway,
https://devblogs.nvidia.com/rtx-best-practices/ has a lot of indirect information about how RTX works under the hood.