If it's used for it, 128 meg is easily enough to guarantee that only the last few levels of the traversal need to hit actual ram. It makes a huge difference whether you are doing 16 sequential waits for DRAM per ray compared to doing 2-3 per ray.
I am actually very surprised that AMD is not well in the lead in RT on the strength of the cache alone. Maybe the latency out of the 128MB cache is much higher than I'd think?
We'd need to wait on further disclosures to know how much the RDNA2 cache architecture was altered besides the addition of another cache layer. The GCN comparison is increasingly distant, but AMD's internal GPU latencies would be in the range of 190 cycles before leaving the GPU main cache hierarchy.
RDNA may have tweaked some latencies, but there's some evidence that it wasn't substantially lower latency. There was a claim it was lower-latency, but on the back of the reduced miss rate made possible by the L1, which may be less effective if RT doesn't see significant reuse until later cache levels. In that case RDNA has an extra cache layer of undetermined latency.
Zen's L3 is between 35-40 cycles, although it's use case and structure are very different from how such a cache would be structured for RDNA2. That can go both ways, unfortunately. The CPU L3 is more tightly integrated and structured to be central to its local CUs, while the Infinity Cache is on the other side of some fabric links and spread across the outer reaches of the die. Going to fabric in AMD CPUs is tantamount to accepting latency on the order of main memory, not sure about the GPU.
The CPU pipeline would have to worry about 4.5GHz+ clocks, which lengthens the pipeline. On the other hand the GPU needs to emphasize density, and RDNA2 isn't quite as modest clock-wise as it once was, and there is some sign that there is a clock domain crossing.
Per a footnote in AMD's RDNA2 web page:
"Measurement calculated by AMD engineering, on a Radeon RX 6000 series card with 128 MB AMD Infinity Cache and 256-bit GDDR6. Measuring 4k gaming average AMD Infinity Cache hit rates of 58% across top gaming titles, multiplied by theoretical peak bandwidth from the 16 64B AMD Infinity Fabric channels connecting the Cache to the Graphics Engine at boost frequency of up to 1.94 GHz."
Being 2/3 or more of the latency of main memory would probably not shift the fortunes of the RT block if it's sensitive to that metric, or it does help and it would in part be compensating for AMD's longer RT critical loop. AMD's method has the RT hardware's node evaluation and traversal steps straddling the typically long-latency interface between the SIMD and TEX block.
It's also possible (likely?) that Sony had access to the full RDNA2 roadmap (barring perhaps VRS if AMD are implementing MS patents), but chose to fork from it at a point where some elements were not ready, and so some elements were closer to RDNA1. You could still reasonably claim this is based on RDNA2 because it is indeed based on, and implements much of, the RDNA2 roadmap.
Two good reasons to take this approach could be:
- Your date for having working silicon demands it
- The time required for customisation (requiring a fork from the roadmap) means you can't wait.
Both are entirely reasonable. And it certainly appears that Sony had properly functioning silicon widely available to developers a while before MS did. Even Craig didn't get to run on XSX as late as this summer, while Sony was showing off Spiderman on at least partially complete PS5 hardware 18 months ago. Guess who's going to have the more impressive exclusive launch title to show off this Christmas?
DX12U is core to MS's gaming strategy, and it'll be a big driver of AMDs RDNA2 roadmap (perhaps the biggest). It seems natural that MS would fork from "full RDNA2". On the down side, it meant that their customisations to it such as Velocity Architecture weren't working properly as late as this spring. Ouch.
There's also the possibility of some of Microsoft's customizations becoming part of a discrete GPU design. Cerny pointed to that collaboration that Sony had with AMD and how a few minor PS4 tweaks showed up in one GPU. It's likely Sony isn't the only party that this can happen for. Of course, if one console's feature shows up in a discrete product, it could be interpreted that the discrete product + vendor-specific tweak is the "full" architecture.
That possibility aside, there's a few discrepancies that have been mentioned in the past that could be other sources of a timing discrepancy or a choice to skip a feature. The inference-focused instructions could be something Microsoft waited for, although in fairness specific formats like that tend to be inconsistent among AMD GPUs.
Sony's geometry front end had some explicitly different naming, and primitive shaders were named for Sony rather than the DX12 mesh shaders. That could be the result of timing, since Mesh shaders weren't an AMD initiative and could have led to AMD's primitive shader functionality being replaced or modified. Sony may have committed prior to that transition, or decided the transition was uncompelling versus what it already had.
The other side of the semi-custom relationship, which Cerny mentioned, is that it's possible for a console's tweaks to be uncompelling to AMD. Sony's architectures have their fair share of customizations AMD didn't want to use, and the PC-side collaboration with Microsoft could mean that what Microsoft wanted would tend to be something AMD would care about.
From this, it was confirmed that 2022 was included.
AMD's projections tend to be conservative so it could be earlier or, barring a delay more serious than they expect, up to the end of 2022.