Interesting, but more or less a fat PCIe connection. IMHO it won't really change the architecture, but accelerate certain tasks with a lot of traffic going off chip. The exception would be if it was enabling some sort of MCM interconnection like Epyc.
There is only one reason. The hardware is broken.
The patents are out so AMD should have no fear to talk about it and also if it's a software issue AMD could use the open source community to help them get the driver on.
Not necessarily broken, but just not worth the effort given a more versatile solution. With compute based culling the implicit paths may perform worse. The explicit paths probably better, but that requires Vega specific programming that in the current mining environment I'm not sure makes sense. No devs, or even AMD for that matter, would see a point in CURRENTLY developing it for the existing Vega marketshare. That's in addition to or replacing the compute based culling implementation that should run on most if not all current hardware.
RPM does not help with scientific FP32 or FP64 but yeah nice for mixed-precision, HBCC (beyond the Unified Memory between HBM and system) still needs to be seen operating with a real world HPC application on scaled up/out nodes until then is what one expects in a modern accelerator with Unified Memory.
RPM and Tensor Cores are somewhat analogous, so more than just mixed precision. HBCC is probably working on some HPC workloads. Specifically the oil and gas or astronomy guys with extremely parallel workloads on large datasets. The oil guys were the ones requesting the feature in the first place as I recall. They were also likely using a SAN for storage and we haven't seen any relevant cards publicly.
For future generations in the HPC-compute segment, AMD will probably have to implement one/some of those patents you and others have found in the past that focus either on scaling CU or changes at Vector Unit-ALU level.
I'd imagine there will be some changes to the CUs in the future. Personally I still favor cascaded DSPs to avoid heavy VGPR traffic and essentially create deeper pipelines if practical. Creating a systolic network for macro blocks. That design should be a step beyond what Nvidia is doing with the register file caching.
That being said, I'm not sure AMD needs to worry about scaling past 64 CUs as process tech isn't getting much smaller and their goal was multiple chips working together. The 64 CU limit is a nice square number from a parallel hardware perspective. The solution would be to make extremely efficient, mobile-like cores with an infinitely fast interconnect between them. Some of the upcoming differential or modulated signaling implementations may allow that. It's always possible Vega 12 has xGMI and is doing something similar. The current roadmap is mobile or 7nm compute parts so it would seem plausible. Leave Navi for scaling beyond two(?) chips efficiently.
There are two Linux drives outside. One is from AMD and one is from the community.
AMD actively maintains both, although the community one does have additional devs working on it. The problem lies in code that can't be freely distributed or readily meet kernel standards.