I don't know if it's a useful comparison to make, but mesh shaders is a similar kind of "specialised software" to replace "generic hardware".
It's hard to beat the hardware if you're only "replicating" what the hardware does already. The reason to use mesh shaders is that you want to do something that the hardware can't do directly.
It's at least in part trying to assimilate a fair number of other specialized software methods as well in the form of vertex, geometry, and tessellation shaders. Getting around the serialization of reading from the index buffer would be one fixed-function path replaced, although I'm not sure that on its own would be considered inherent to the hardware concept rather than a result of history. If Turing is used as an example, there were specific cases where the traditional pipeline could still win, like some tessellation scenarios. The continued evolution of the fixed-function pipeline also meant that the mesh shaders could frequently neglect certain things like the full suite of per-primitive culling options if the hardware could cull them easily. Part of the challenge would be weighing a fully-featured replacement shader versus the additional cost each programmable option incurs on the generic execution loop.
Yuri O'Donnell appears to be complaining that an "opaque" BVH, which the developer cannot manipulate directly, is a serious problem all on its own. This data format is the core of the ray accelerator hardware, and so using something like "custom inline traversal" is doomed, because the hardware that accelerates use of the opaque BVH is the most serious bottleneck. You can't help that hardware with its data, because the BVH is inaccessible.
The BVH structure would likely be a black box until there's more standardization. There's not a firm consensus on what the BVH should look like at a lower level, and it's often tuned to the packing and access requirements of specific architectures. I wouldn't relish the prospect of having to write a new BVH node format or heuristic for every cache subsystem that exists or could exist.
Part of winnowing down the possibilities would come from the hard data from deployed RT hardware, but without the acceleration that we have the data would not be forthcoming.
However, is it certain that the most serious bottleneck for ray tracing is the acceleration hardware for the BVH? The most performant solution so far is the one with more dedicated hardware.
Is it a clear win to have a generic software solution running on shader hardware. The AMD execution loop is partially programmable, and as such has a round-trip between the texturing domain, 32-wide granularity, LDS capacity and bandwidth contention, LDS latency, and a pointer-chasing workload on a long-latency memory critical path.
Making more of the process programmable means adding more dependence on things like the SIMD hardware, LDS contention, and more trips through the memory pipeline.
It may not be guaranteed that a more clever solution can get out from under greater fixed costs.