By having both HBM and GDDR phy/io, could AMD potentially increase their wafer yields?
In the sense of having more of any kind of unit means there's more redundancy, I suppose it might be possible depending on the chip counts for both bus types.
If there were multiple HMB stacks and a significant number of GDDR6 channels, maybe.
If the idea is that there are significantly fewer channels overall, perhaps it's worse. The failure granularity for HBM is seemingly at the stack level, while GDDR6 might be at the 64-bit or 32-bit granularity.
If it's 2xHBM and 4x64-bit GDDR6 (32-bit per chip, but AMD seems to like pairing channels) versus a potentially impractical 512-bit GDDR6 bus, it's worse.
It's a wash with a 384-bit bus, unless AMD went with more granular GDDR6 controllers.
Even at the roughly equivalent level of redundancy, I would have some concern that the HBM side makes yields incrementally worse overall due to the higher failure rate of interposer integration versus PCB mounting--coupled with the fact that there's some multiplicative yield failure due to needing both interposer integration and then some small yield loss due to the GDDR6 bus.
Perhaps it would help from a product mix standpoint, if there were GDDR6-only products and HBM-only products, then the same production could more readily satisfy multiple categories versus AMD's historically poor ability to juggle multiple product lines/types/channels. However, given the very specific needs of one type over the other, and maybe questions like die-thinning or whatever preparatory work is needed for an interposer, it may not be that flexible. Even for yield-recovery purposes, there could also be a question of when failures can be detected. Some issues may not be fully detectable until after some point of no return, like after interposer mounting.
If the chip were to always have an interposer, even if HBM weren't in use (discounting what is needed to have GDDR6 through an interposer) it would be a cost adder that would probably be more significant that whatever yield gain there was from redundancy in the controller.
Even though I'm having trouble seeing where this is applicable? It appears to be focused primarily at reducing the load on the (shared) L2 cache?
Is the L2 cache on GPU actually still that much slower than L1, so that even 3 trips over the crossbar can outweigh an L2 hit?
The L2 is globally shared and tends to be significantly longer latency. As the LLC, it is also in high demand with contention making things worse.
The exact threshold is something that isn't necessarily clear.
For Nvidia, the L2 has been benchmarked as being nearly 7x longer latency for Volta and Kepler, whereas the Maxwell/Pascal generations were in the nearly 3x range. This is more related to the much lower L1 latency for Volta and Kepler rather than a massive shift in L2 latency.
https://arxiv.org/pdf/1804.06826.pdf
GCN's L2 is only 1.7x the latency of the L1, mainly due to an estimated latency of ~114 cycles per a GDC2018 presentation. The L2 is apparently a little faster in cycle terms than Nvidia's although this may be closer in wall-clock terms due to Nvidia's typically higher clocks in the past.
Ampere's a question mark, though if it's similar to Volta the L1 latency should be on the lower end of the GPU scale.
I haven't seen number for RDNA1/2 officially, although if we believe the github leak for the PS5 it might be roughly in the same area as GCN to a little lower-latency in the L1.
The overall picture wouldn't be that different at maybe 90-100 cycles.
The improvement or degradation in latency may depend on how quickly L1 remote accesses can be resolved. If every leg in the transaction were on the order of 90 cycles, I'd say it wouldn't be a good latency bargain. The significant constraints when it comes to L2 bandwidth and contention might still provide an upside, and there seem to be at least some reasonable ways to skip parts the the L1 pipeline in a remote hit scenario given how straightforward it should be to detect that an access won't be locally cached.
I ran across a link from Anandtech's forum that points to a paper by the individuals involved in the patent concerning this sort of scheme:
https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf
I haven't yet had the time to really read the paper. One thing I did notice is that the model GPU architecture isn't quite a fit for GCN or RDNA, with 48KB local scratchpad, 16KB L1, and a separate texture cache. There's perhaps some irony that an AMD-linked paper/patent has analysis profiled on an exemplar that reminds me more of Kepler, although I think I may have seen something like that happen before.
Whether the math necessarily holds at 90-114 cycles latency versus 28 is an unanswered question.
edit: Actually, I just ran across a mention of the coherence mechanism assumed by the analysis, and it's Nvidia's L1 flush at kernel or synchronization boundaries. That could significantly deviate from the expected behavior of a GCN/RDNA cache.
By the looks of it, yes. (3.90TB/s L1 bandwidth, 1.95TB/s L2 bandwidth on the Radeon RX 5700XT.)
Depends on the context of L1 vs L0. The terminology is being handled loosely, and some portions of the description that indicate L1 capacity changes with CU count may mean L0/L1 depending on the GPU.
For what's it's worth, the author also often spoke only about small number of CUs in each cluster, and about locality on the chip. So possibly this isn't even aiming at a global crossbar, but actually only at 4-8 CUs maximum in a single cluster?
I suppose 8 CUs with 128kB of L1 each still yield a 1MB memory pool. And you got to keep in mind L0 cache still sits above that, so at least some additional latency isn't going to ruin performance.
The CUs have 16KB L0/L1 caches, and one of the main benefits of this scheme is that its area cost should be lower than increasing the capacity of AMD's miniscule caches. I'm not sure how many tenths of a mm2 the L0 takes up in a current non-Kepler GPU, however.
ROPs are clients of L1 in Navi. ROP write operations are L1-write-through though, i.e. L1 doesn't support writes per se, so L2 is updated directly by ROP writes.
It's an odd arrangement with the ROPs versus the L1. The RDNA whitepaper considers the ROPs clients of the L1 and touts how it reduces memory traffic. However, if their output is considered write-through, what savings would making them an L1 client bring?
Given screen-space tiling, a given tile will not be loaded by any other RBE but one (no sharing), and unless the RBE loads a tile and proceeds to not write anything to it, the L1 at best holds ROP data that is read once and must be evicted once it leaves the RBE caches (no reuse).