Trying hard to understand why anyone would want this on?
Per the article it's in the memory standard for the higher speeds, for stability purposes. For a wider range of module qualities, the standard is more concerned about getting as broad a level of compatibility and stability instead of performance.
hm... How does it compare to IBM's implementation/configuration for L3? (Power 7/8/9)
IBM's eDRAM cache is large but is subdivided into local slices. It depends on which generation whether it's private to a core or shared between two. Relative to a CCX, there's at least twice as much L3 per core than AMD.
IBM's bandwidth between levels ranges from 2-4 times that of AMD, and unlike AMD the L3 has a more complex relationship in that it will eventually copy hot L2 lines into itself, and the chip overall has a complex coherence protocol and migration policies for cloning shared data between partitions.
Latency-wise, the L1 and L2 in recent power chips is on the order of Intel's caches, and the local L3 is something like ~27 cycles versus AMD's ~30-40. Power 8 can seemingly muster this at significantly more than 4GHz, Power 9 is less documented, but seems to have 4 as starting point. The L3 is an EDRAM cache, so it may have non-best case latencies that differ from the SRAM-based L3 of Zen. DRAM can be more finicky in terms of its access patterns and when it is occupied with internal array maintenance like refresh.
The cache hierarchy is dissimilar, with a write-through L1 and an L2-L3 relationship that is more inclusive. Unlike AMD, the Power L3 can cache L2 data and can participate in memory prefetch.
Remote L3 access is pretty long with IBM, but Zen is equally poor. In terms of the bandwidth for those remote hits, AMD's fabric is missing a zero in the figure even on-die.
IBM's die, power, and price for all that is typically not in the same realm.
Some areas that do add up more favorably for EPYC are the DRAM and IO links per socket.
@3dilettante
AMD confirm singled ended not LVDS for the GMI links ( around 5:20) so that makes a lot more sense given the bandwidth number, clock rates and number of pins.
If the differential PHY is running at 10.6 Gbps for xGMI, The Stilt's that GMI is twice as wide and half as fast gives each package link 32 signals in each direction, at a more sedate 5.8 Gbps.
AMD's diagram in in its Processor Programming Reference document has 4 GMI controllers, though the MCM only uses 3 per die. (edit: 5.3, bad at math today)