The alternative is terrible: Execute ray generation on cuda core, submit to RT core and wait. RT core traverses BVH for each ray incoherent (inefficient even with FF i guess). Wake up sleeping cuda core, shade each hitpoint by material (already impossible to have unique shader per thread), return to ray shader etc... horribly inefficient.
AFAIK, BVH traversal is always incoherent for the lower layers, but luckily also pretty good cache locality for the upper layers of the tree. And the cost for swapping out the stack mid-traversal, sorting for a better fit candidate would easily exceed the gains. If any at all, as even if you were to bin ray segments by area, that will only give you 1-2 layers of coherency in BVH access, and beyond that once again incoherent access unless the rays were almost identical.
Anyway, BVH traversal isn't where the cost is exploding yet, it's only on actual hit. The FF hardware is also doing a good job there, I suppose.
Sorting for coherency happens after a hit, before shading begins. For PowerVR in hardware, for DXR currently hacked on application side.
If anything, you could think about starting to bin on near hit already (prior to testing the ray against the triangle soup), in expectation that you might gain some locality there in case the BVH depth is actually insufficient.
Even though it's really a failure from NVidia that they didn't already do that in the driver.
Suppose it's because that would incur a static overhead to every ray cast, which would ruin their nice marketing statement about peak performance. And it's probably also futile if the scene complexity or number of rays exceeds certain limits, as at some point you would have to literally enqueue
millions of hits in order to achieve any form of locality. At which point sorting is hardly worth it any more. Respectively if under-sampling too much, so there isn't any chance of actually utilizing a full cache line either.
So 'recursion' is not really recursion in the sense of what we think of CPU.
It is ordinary recursion, but the BVH has a fixed depth by design in the current implementation. So you do have recursion within a fixed stack size. Shuffling to another core would also mean shuffling the stack, easily a few hundred bytes (coordinates + pointer to BVH-tree node for each frame). You only can discard the stack once you have a hit.