At least some ground work for reordering seems done. The block compression paper divides BVH into branches, which reminds me on Nvidias 'treelets' paper. (Can't find it anymore, but guys like Laine / Karras were involved iirc.)
My own plans on software reordering were to make such branches which fit into LDS, bin sets of rays to them and then brute force intersect all this without VRAM access in inner loops. It would solve the divergent memory access to BVH. And as i understood it, the treelets paper was about the same idea. (Making such Beams to bound sets of rays for quick rejections would be a nice optimization here too, probably.)
However, while we solve BVH memory issues, rays now become a big such issue on the other hand. To have optimal reordering, we would need to rebin rays in every traversal iteration. So they move from registers into VRAM, requiring huge prefix sum for the binning, and then optionally reordering them in memory to get nice packets for the next step. Super heavy, and no win for a software implementation i guess. For RT cores it would mean rays have to move from one core to the other. Likely one core alone has not enough rays in flight to make reordering a win either.
There might be a proper compromise with doing reorder only each Nth iteration, and using only sets of rays small enough so all work remains on chip and no VRAM roundtrip for binning / reordering is needed.
No matter what - with proper reordering the traceRay function is no longer some atomic small thing, but becomes a global workload like a big compute dispatch.
DXR 1.0 seems not really designed for this, inline tracing would just break it, and potential traversal shaders would go out of reach as well (even more). So again i conclude there is no reordering yet. Also it just seems not worth it yet either.
Interesting for my requests: I would need to handle their block compression on my side. So that's a vendor specialization but no unexpected problem. It's even likely other vendors would do the same.
VALU budget
Per CU per cycle the GPU can process 1 BVH node (aka 1 lane of BVH) and 64 lanes of VALU instructions.
This answers my question about pipelining. I did not know if the instruction is parallel or serial, but considering the unit seems tiny, it's likely serial.
So when CU calls the instruction the wavefront goes idle and another wavefront takes the SIMD. After RT unit is done with all 32/64 intersections, wavefront can continue.
Now, AMD could make RT faster easily just by making the unit wider for future GPUs? Traversal and stack stuff would not be affected. This gives me some hope they stick at this flexible solution.