That said, AMD has aimed low with just four cores for a shared L3, on top of that, Ryzen seem to have a weak inter-CCX fabric, which boggles the mind given how cheap on-die bandwidth is. This makes it much more sensitive to pathological scheduling cases.
Should the fabric be strong enough, the protocol and architecture may not be designed to leverage it. Intel's MESIF came about as part of its creation of the ring bus and LLC, and those two features provide a fair amount of bandwidth amplification and absorb snoop traffic. It starts to falter in higher counts, but it has some nice behaviors until then.
MOESI, at least as we know it, is older and doesn't derive as much read amplification from on-die storage. It should save writes back to DRAM when sharing dirty lines relative to MESIF, but then Ryzen's lack of an LLC could force the so-called savings in that scenario to incur coherence broadcasts and writes back to the home node like this was a multi-socket system.
What this might offer, however, is simplifying the scale-up to larger chip counts. The system doesn't try to track as many lines capable of sharing clean data, and dirty lines write back to the home node more quickly in the absence of single-die optimizations. If there's something like HT assist, the interchange of snoops and data would be simplified at the cost of chaining the fabric to the memory controller's throughput--which per Vega's GMI fabric and elsewhere is described as being the case.
The price of scalability is per the usual that performance per granule and at lower counts is the poorer for it.
In Vega's case it's described as a positive that the interconnect is equal to memory bandwidth, although interestingly Naples is confirmed to be able to connect to Radeon Instinct cards with GMI (presumably the MI-25). It would have been interesting if Ryzen were able to do something similar as a value-add.
Another area of scalability at the cost of effectiveness may be AMD's turbo and XFR functionality. They perhaps benefit some kind of data center load with a VM per CCX or sustained multisocket load. Either it's some rather coarse turbo bins for a limited subset of cores for spiky loads, or the relatively fixed all-core turbo a significant ways below. The case with a few highly active cores and the rest being mostly but not fully parked is not handled all that well compared to what Intel can do and may match games more than other applications.
A question then for these multi-die solutions is if the power delivery and clocking scales up, such that Ryzen's significant loss of clock speed with more than 2 cores active actually translates into 8 cores with the equivalent of a quad Ryzen, or if it's actually 2 cores per package.
The millisecond polling for AMD's power management is in some ways a bit slow, possibly to allow the MCM to coordinate?
This seems to be a given. Intel changed their fabric topology and properties multiple times when they started to scale beyond 4-cores. Current one is the result of many iterations. AMD clearly is behind in this area.
AMD's prior interconnect started with the first dual-core A64s, and lasted in some form until Zen--theoretically. Jim Keller's presence at AMD sort of book-ends the lifespan of a really creaky interconnect.