BVH4 is just broken in my opinion, 64-byte per BVH4 with 128-byte cachelines doesn't make sense, you're still fetching the 2nd set of 64 bytes you often won't need. The intersection HW isn't as expensive as the memory hierarchy bandwidth. So BVH8 is basically a "free"(-ish) improvement and it makes complete sense to do it in a single TMU of a single CU.
It feels like an easy improvement and doesn't say much about AMD's long-term plans for raytracing - it'll be interesting to see whether they do stick to adding as little specialised hardware as possible as you say, I guess it depends how much better they can make it with this kind of incremental improvement.
It feels like an easy improvement and doesn't say much about AMD's long-term plans for raytracing - it'll be interesting to see whether they do stick to adding as little specialised hardware as possible as you say, I guess it depends how much better they can make it with this kind of incremental improvement.