Nvidia’s RT patents talk about a local BVH cache inside the RT unit
That's probably something different. If you have a cached path all the way to the most recently used triangle soup (or a non-exposed sub-group inside such), and you end up finding a hit there for an (at least somewhat) coherent ray, it's an instant, "free" hit without having to repeat any part of the pointer chase for the traversal.
Guess why their guide tells you not to have
overlapping bottom level acceleration structures. It's because that results in a reduced cache hit rate or even RT unit internal cache spilling, even if a sibling or nested ray was coherent as too many structures alias spatially.
Also what’s stopping the traversal shader from prefetching wide nodes into LDS?
There's not really a point in prefetching wide nodes in whole. Too much stuff you are never going to need / hit. Most of it is better
streamed and then
discarded right away. Either you get a coherent hit to the exact same path (or a prefix of it!), or you are better off re-filtering starting at the best cached approximation.
Even though you do want to "keep" parts of the tree in cache which represent
siblings which are also already known to hit.
E.g. when filtering for matching BLAS, you already found a 2nd matching one and you write that straight to the cache as well as an additional entry point for further traversal so you get it "instantly" if the traversal was to resume. While streaming the parent structure, filtering for more than one potential hit was "for free" after all as you already had the memory fetch pipelined...
Dang it, such a cache is actually a pretty smart construct, as you get spatially coherent hits
first "by design" (as cached subtrees are explored first), which further provides a massive boost to efficiency of the actual traversal and hit shaders...
PS: No, I did not read the patent. You just said "there is a cache", and the rest is just an extrapolation based on some extremely basic understanding of cache architectures and the implications a loss of coherency would have for traversal performance...