Regarding the on-going thread about the power usage though, I don't see anything contentious... As the patent describes, the intersection engine is basically an alternative path to texture filtering, operating on packed BVH node data. Ray-box and ray-tri testing seem quite straightforward logic, so likely no "power drainage" to be expected... At worst the CU can issue bunch of intersections, issue a vmcnt wait and eventually clockgate the ALU datapaths, if no other kind of kernels is running in parallel.
A vmcnt would apply to that specific wavefront, but it would have no influence on any other wavefronts on the CU. Those other wavefronts could be from other kernels, or part of a multi-wavefront workgroup. The wavefront itself seems like it could be generating additional memory traffic as a result of processing the data returned by BVH instructions, if the process is meant to be more programmable between the node/box evaluation stage of AMD's algorithm. Restricting all those possible non-RT use cases would be a very large restriction on the concurrency and latency-hiding of the CU. There's also another SIMD in the CU, whose interference cannot be isolated unless it were left vacant.
Implementing such a monopoly would require the driver/GPU to be aware that RT was occurring and to vacate a CU in order to give it exactly one wavefront, or setting up a clause where a given wavefront does block other wavefronts from using the vector memory path. The variable number of misses to memory for a complex operation like RT could leave the CU with swathes of idle time if that happened, and it may not be a good fit for that mode since the RT method assumes the SIMD is performing various non-memory operations to calculate the payload for the next BVH instruction--which would break a clause immediately.
Nvidia persumably does the whole traversal process in the fixed function hardware, so one might argue that they could have an edge in power usage in potentially keeping the CU/SM off. But it is uncertain whether it matters with the prevalent use of async compute to fill gaps, and whether the actual saving does make a dent in the grand power consumption.
The emphasis would likely be on maintaining parallel execution, and the time horizon for an RT instruction is short in terms of power gating. Perhaps clock gating could occur, but the likelihood is that something else is happening somewhere in the SM to keep things active.
It matters in case each traversal steps needs to be taken on the CUs, i.e. the pointer chasing happens in shader code while the intersection HW/texture unit simply tells whether a ray intersected one or more nodes of the BVH. If this is the way it works then it requires a constant back and forth between CUs and texture/intersection units.
The patent AMD seems to be following indicates the RT block should pass back intersection results or the pointer to the next node for each ray submitted to the unit by the SIMD. There's still significant back and forth, but at least the SIMD isn't tasked with a redundant lookup of pointer values after the RT hardware looked at the same node data and skipped over the pointer information in it.
In general CU scheduling and cache-friendly operations (such as L0s being able to snoop each other passively) appear to be part of RDNA. How much of that is new for RDNA 2, I can't tell.
At least with RDNA, snooping at the L0 is known not to happen. They are write-through, so the L2 still serves as an eventual place for visibility. LLVM changes specifically point out that the L0s in a WGP are not coherent when running in WGP mode, so kernels running across dual-CUs would need to explicitly flush the L0 at various times to keep some level of consistency if there's a chance of discrepancies between them.
The prospect of AMD using a cut-down ~536mm² Navi 21 to fight 3070 (~394mm²) is sad isn't it?
It seems less ideal to put a chip of that size against another, since even with perfect yields there's the reduction of candidate dies available from each wafer. The maturity and cost of each node isn't clear for the comparison, however.
If we assume a large cache of some kind, the yields could be better than raw area would suggest.
Another possibility that occurred to me that could take up some area besides a cache would be some kind of on-die voltage regulation, which might be desirable for a mobile or possibly HPC solution. However, that's not something AMD's talked about that much, as a product direction.