Ok, it’s probably true that the shader doesn’t need to inspect the contents of the node in order to schedule it. But that doesn’t seem to be a notable benefit of shader based scheduling given it’s also the case for Nvidia’s fixed function approach.
The original comparison was with hypothetical RT block that only gave intersection results while not performing traversal, which would leave the SIMD in a position where determining the next node addresses would require explicit vector memory reads to data that would have been fetched and parsed by the RT unit already. AMD's method is at least less redundant than that.
AMD’s patent calls for storing traversal state in registers and the texture cache. It would seem the shader is responsible for managing the traversal stack for each ray and that stack presumably lives in L0. I don’t see how you would avoid thrashing the cache if you try to do anything else alongside RT. Unless of course you have an “infinite” amount of cache
AMD's patent doesn't clearly outline where the process resides for the intermediate work between node evaluations. It highlights that the SIMD and CU have substantial storage available at no additional cost versus the likely hardware footprint of implementing sufficient storage on an independent unit.
AMD's claims are between their hybrid method and a dedicated unit implementing a unit that might be able to traverse a BVH to arbitrary depths without redoing traversal due to losing the full context of what had been traversed already.
Nvidia's scheme appears to have a traversal stack of finite depth that can lead to redundant node traversal, which makes it less expensive than what AMD was using as its baseline.
Whether AMD's method leverages registers, LDS, or possibly spills to memory isn't spelled out. Even if there were spills to memory, writing out data based on pointers and metadata from completed RT node evaluations to something like a stack seems like it could be less disruptive than the SIMD re-gathering node data on its own.
curious, aren't all ROPS typically tied to caches past and current gen?
IIRC the difference with RDNA is that compute is now tied in with the L2 cache, whereas with GCN it went directly to the memory controller. But I think ROPS are unchanged.
ROPs were linked to memory channels until Vega, which made them a client to the L2.
RDNA makes them a client for at least some of their traffic of the new intermediate L1.
GCN had a read-write L2, and compute's use or non-use of the cache depends more on what settings were used for the memory accesses. The choice would be based on the level of coherence needed for the data.
ergo this older post by sebbbi:
https://forum.beyond3d.com/posts/1934106/
with respect to RDNA
it does look like they changed how they accessed data however for the RBs.
Render back ends have had relatively small per-RBE caches throughout the generations. There's evidence that the RBEs still have caches with RDNA, though I haven't seen specific capacities given.