From AMD's standpoint the ability to have a more programmable pipeline seems a large potential strategic advantage, as they've got the contract for both upcoming "new" generation consoles that will have raytracing, and are thus able to de facto set whatever standard they can get through both Microsoft and Sony as some least common denominator.
It's programmable at the cost of relying on hardware that has some undesirable properties for the workload.
RDNA does improve things a bit over GCN by having 32 instead of 64 lanes subject to divergence, and I'd assume RT hardware can be significantly narrower.
RT cores may have their own caches and buffers. I'm not sure if they interface with the L1 or L2 for Nvidia, but any local storage would probably be tuned for lower latency.
AMD's L1 takes ~100 cycles for a hit, which a TMU intersection engine would be interfacing with, and which the programmable pipeline will need to be careful to use as little as possible.
L1 instruction cache behavior is another pain point, since the misses there are long-latency and worsen under load. In that regard, single-purpose functionality avoids thrashing in the L1, making it a better neighbor to other workloads and less affected by bad neighbors.
It's actually an area where I wonder if an instruction RDNA introduced could have significance. There's a prefetch instruction that changes the fetch behavior such that the prefetcher doesn't discard the previous three cache lines when fetching the next one. The documentation describes this as affecting and L1 of 4 64-byte lines, but I think this makes more sense as a description of a wavefront instruction buffer. A loop that fits in that space could be more self-contained, and if a shader using BVH instructions can be subdivided into phases with outer loops that fit in 256 bytes and the intersection engine helping condense the inner node/intersection evaluation, then maybe AMD's method could more closely approach the instruction cache footprint Nvidia touts as an advantage of its method.
Going back to what I discussed earlier about bugs getting the way of hardware: the prefetch instruction is currently not documented outside of LLVM bug flags because it apparently will freeze the the shader. There's also another RDNA branching bug that seems to occur with branch offsets of 63 bytes, and the workaround seems apt to blow out said instruction buffer (long streams of NOPs).
Other bugs related to workgroup processing mode or Wave32 versus Wave64 indicate there are possible failure cases in and around the TMU--which by happenstance a BVH unit would be as well.
Maybe if there is some internal evaluation hardware for this, it's vastly less useful because of how buggy the hardware is.
Programmable intersection tests could deliver advantages to developers Nvidia's hardware can't deliver, with non box testing showing differing advantages already.
Nvidia also gives the options for custom intersection tests, at some indeterminate performance cost. My question would be whether this means involving SM hardware in the same way AMD's method would. That could create a scenario where it's either fixed-function and faster, or programmable and same speed.
AMD's TMU method similarly has intersection hardware, so I think that intersection testing is an area where there are similarities.
On top of that a programmable traversal stage, already proposed by Intel, has many potential advantages as well, with things like stochastic tracing and easily selectable LODs coming into play. As far as is known neither is available to current Nvidia ratracing hardware, outdating their larger competitors lineup would be a major victory for AMD, though certainly bad for any Nvidia customers expecting their gaming hardware to last longer.
AMD's method is more programmable, but the fixed-function pipeline still generates a list of nodes to traverse, which may constrain what traversal methods can be employed by the programmable portion, since the number of nodes or the methods in finding them is in the intersection engine rather than the shader. AMD's method does offer the ability to skip the hardware, but at that point it's straight compute that doesn't differentiate from other compute methods.
Potential yield advantages? What is that comment based on? Sounds like it was pulled from someone's nether regions.
Each mask used introduces additional chances for alignment errors or defects to be introduced during the exposure process. Every step in lithography has a small but non-zero defect rate.
The large number of masks for the quad or octal patterning steps also creates concerns about the variability of the resulting patterns, even if no singular defects manifest.
The downside for EUV is that there is much less maturity along a number of other components of the process, and much higher sensitivity to things like mask defects, so I don't think there's an unambiguous winner at present.
EUV tends to struggle with exposure power relative to standard lithography, but on the other hand if the standard process needs 4-8 times the number of exposures it might still be worthwhile.
Turnaround time is one area where EUV is expected to be better or at least not as bad as standard lithography is expected to become.