I'm very skeptical that multiple dies on the same process is cheaper *if* there's a lot of redundancy in the monolothic die. What percentage of an AD102 die can be defective and still result in a viable SKU? It looks to me like less than 15% absolutely has to work (probably less than 10% even, e.g. PCI-Express, video decode/encode, central control logic for processing & distributing commands, various multiplexers/buses, etc...) - that means there's potentially less non-redundant logic than on many smartphone chips where practically everything has to work! Now let's say you moved all that to a smaller die that you can test separately, but you've decreased your packaging yields and increased packaging costs while adding a few % of area for the cross-die PHYs/TSVs... you really haven't gained that much I think, and it's a bit academic whether it's 1% better or 1% worse...
But here's a dirty little secret: a *LOT* of GPU dies are probably much more functional than the SKUs they end up in. This depends on the process generation and maturity, as sometimes the process really has fairly low yields and you might struggle to build enough of the top bins... but the rest of the time, a lot of the distinction is artificial (at least in terms of the chips, you still reduce costs with lower TDPs resulting in simpler board designs etc...) - and you might not be able to reliably predict TSMC's yields 12-18 months in advance of mass production so you can't really optimise around it in the design phase too much either.
RNDA3's approach is really elegant there: they can remove a MCD completely for lower-end SKUs, which will save them a lot more money than disabling a fully functional memory controller. The PHY on the GCD isn't free but it's an lot smaller than the area/cost of a MCD. This is also similar (but in the other direction) to how AMD has a single I/O die for multiple SKUs with different numbers of CPU chiplets. There are also verification & tape-out costs benefits if you reuse the same chiplet for different products or even generations, e.g. for AMD's CPUs it's less verification effort to tape-out one I/O die and one CPU chiplet than it would be to tape-out two monolithic CPU+IO dies.
But the biggest benefit of chiplets for GPUs/CPUs is the ability to use different processes for different chips. That includes N5 for GCD and N6 for MCD on RDNA3. I'm convinced AMD is right that that RNDA3 chiplets are significantly more cost effective than a hypothetical monolithic Navi31 given the different processes and the ability to reduce the number of MCDs (which is orthogonal to whether RDNA3 is more or less cost effective than Ada).
My new favourite example of this is Intel's Clearwater Forest which they detailed a bit more yesterday at their foundry event:
https://spectrum.ieee.org/intel-18a
In Clearwater Forest, billions of transistors are divided among three different types of silicon ICs, called dies or chiplets, interconnected and packaged together. The heart of the system is as many as 12 processor-core chiplets built using the Intel 18A process. These chiplets are 3D-stacked atop three “base dies” built using Intel 3, the process that makes compute cores for the Sierra Forest CPU, due out this year. Housed on the base die will be the CPU’s main cache memory, voltage regulators, and internal network. “The stacking improves the latency between compute and memory by shortening the hops, while at the same time enabling a larger cache,” says senior principal engineer Pushkar Ranade.
Finally, the CPU’s I/O system will be on two dies built using Intel 7, which in 2025 will be trailing the company’s most advanced process by a full four generations. In fact, the chiplets are basically the same as those going into the Sierra Forest and Granite Rapids CPUs, lessening the development expense.
So it's Intel 18A for the CPU cores (and presumably L1+L2 cache), Intel 3 for the base die that includes the L3 cache (no scaling/power benefit to doing the L3 on 18A since SRAM has effectively stopped scaling, but you still want L1 and probably L2 on the same die to minimise latency), and Intel 7 for the I/O dies (since I/O has been scaling even worse than SRAM for longer), for a total of 12xCPUs + 3xBase + 2xIO = 17 chiplets. Assuming their packaging yields are good, I expect this to be a lot cheaper than a single 18A monolithic die.
Similarly, I don't think NVIDIA has much reason to go for "multiple GCD-like chiplets" ala MI300X short-term (especially given how profitable their ex-Mellanox datacenter networking business is), but they'd be missing a trick if they don't go for heterogeneous chiplets given how bad SRAM/IO scaling is on TSMC N3E. At their current scale, they have to consider the risk of supply constraints for advanced packaging though, so they might decide to be more conservative for that reason.