With relatively unpredictable and incoherent memory access RT seems likely to be heavily latency bound whatever you do fairly easily. Fixed function units that increase compute throughput per area would seem of limited benefit then versus a programmable pipeline that would be a bit bigger in die area when it's latency/cache structure you have to worry about fairly quickly.
(Up front do note that the discussion so far has centered on primary rays, so some of my comments have been specifically around that.)
Agreed compute throughput isn't really the issue here for the incoherent rays (although of course with a real RT pipeline it's all over the place depending on the actual data and rays, which is part of the optimization complexity). That said, while it's clear that some stuff about the current black box is suboptimal, it's also not clear to me that some sort of tightly coupled software traversal is really much better.
Consider that we already have several proof points and counterpoints:
- RT pipelines exhibit the expected uber-shader issues around occupancy and payload sizes
- Hit shaders run relatively poorly on all hardware even with ray sorting
- AMD's current software traversal is not very good and while there are certainly some advantages to the flexibility, most of them manifest in terms of not having an opaque TLAS/BLAS formats (and thus can implement some finer grained precomputation and streaming algorithms than on PC) rather than any particularly traversal cleverness that I've seen.
- "Inline" RT/tail recursion/etc. seems to work pretty well in practice, giving the flexibility to do rescheduling or data-aware programmable stuff between large batches of rays, but not trying to inject shader code into performance sensitive inner traversal loops. Of course IMG/PowerVR would have told us this a decade ago...
I think a reasonable analog here would be anisotropic filtering, which despite several attempts over the years has still survived in hardware form. It is also not particularly compute intensive by modern standards and has very coherent access patterns. The key though is that it is a highly data-dependent, variable latency operation (sound a lot like tracing a ray?) and thus funneling requests through a (relatively) long queue to shield the caller from the divergence of execution is still a pretty significant win. This of course has a cost in terms of latency hiding, which manifests as one of the most resource constrained parts of GPUs: the register file size. For RT which has even longer latencies it seems like most folks are trending towards the conclusion that the cost of keeping shader-levels of live state around is not worth the flexibility benefits, but I don't think that question is entirely settled yet.
And a programmable RT pipeline offers fixes concentration on hardware box BVHs wouldn't. Just go from boxes to spheres, now all you have to do is move sphere centers around, much faster refits and rebuilds! Or what if you can move to splats instead of triangles? Now you're geometry testing is faster, you can have simpler acceleration structures altogether because you can just brute force testing geo more.
It's certainly not something to rule out, but there's also just not that many different types of primitives that are commonly used in these acceleration structures. I do expect there to remain some level of tradeoff between structure update costs and ray traversal costs, but it's not clear to me that it's something that needs to be entirely in user space, especially as we move into a world where "general compute" scaling is rapidly slowing.
So it's definitely an interesting discussion and it could go in a few directions, but I do think the critical questions are really all around acceleration structure update, not tracing. If we need to make some sacrifices on the tracing front to make the acceleration structure stuff faster to build and maintain, that's the only place where I think that discussion really matters a lot. The tracing part seems to be mostly understood already, at least for rigid, opaque geometry.