The point is, larger caches and/or faster links with improved protocols may allow further unification - like it happened in Vega10 for pixel and geometry engines which were connected to the L2 cache.
They were connected to the L2, but the descriptions seem more consistent with them using it as a means to either spill or stream out in a unidirectional manner. There's no participation in a protocol, and there are instances where there are heavy synchronization and flush events that would not be needed if they were full clients.
As far as the L2 is concerned, there is currently no protocol as far as they need, because there is only one destination and minimal interference from other clients. Examples where interference can happen tend to be ones with flushes and command processor stalls.
The experts are on the x86 side. GCN's architecture reflects little to none of that expertise. It has a memory-side L2 cache where data can only exist in one L2 slice, and the L1s are weakly coherent by writing back any changes within a handful of cycles. Coherence between the GPU and CPU space is one-sided. The CPU side can invalidate and respond with copies in the CPU caches--which are designed by the experts. The GPU side with the GCN-level expertise cannot be snooped and cannot use its cache hierarchy to interact with the CPUs. At most, the GPU code can be directed to write back to DRAM and avoid the caches, after which it may be safe to read that data after some kind of synchronization command packet is satisfied.
But how this would be different from the current architecture?
Coherence in GCN's L2 is handled by there only being one physical place that a given cache copy can be in, the L2 slice assigned to a subset of the memory channels.
This is a trivial sort of coherence. A copy cannot be incoherent with another copy because it cannot be in more than one place.
Vega whitepaper and Vega ISA document imply that L2 is split between separate memory controllers/channels and interconnected through IF/crossbar.
The L2 has always been split between memory controllers. Vega's mesh could in theory permit flexibility in how the slices map physically, but the current products seem to have the same L2 to memory channel relationship for the same chip. The fabric would not make the L2 function correctly if it somehow allowed two slices to share the same address ranges.
What happens if another L2 slice has a copy is undefined. What is "coherent" as far as GCN goes is that a client be write-through to the L2, which only the CUs do.
For global coherence, GCN works by skipping the L1 and L2 caches entirely and writing back to DRAM, since none of them can participate in system memory coherently.
What if we scale down the package to 2 HBM and 2 GPU dies and a 7nm respin of the IO die?
Respinning the IO die would have limited benefit, which is why AMD is able to use 14nm IO dies. The size of the PHY and allocating perimeter length to IO is where the process node has limited benefit. Shrinking the IO die might add some challenges based on how extreme its length to width ratio gets to maintain sufficient perimeter.
Cutting the number of HBM stacks in half means evaluating a system with 2.5-5x the link bandwidth of Rome.
It's not clear how much of the overall DRAM bandwidth Rome can supply to a single chiplet. If between 0.5 to 1.0 of the overall bandwidth, something like 5x the link capability may stretch the capabilities of the IO die and the perimeter of a Rome-style chiplet.
Since we do not have a clear die shot of the involved silicon in Rome, it's hard to say how much area on the IO die 2.5-5x the links would take up, or how much perimeter it would need. Supplying one GPU chiplet with the bandwidth of 2 HBM stacks would be all or most of one side of the IO die, before considering the other GPU chiplet.(edit: at least if going by the drawn block diagram given by AMD)
At least for current products and this incoming generation of MCMs, it doesn't seem like it the technology has advanced enough to make as practical as it might be for CPUs like Rome.
Could lower voltage logic maintain signal integrity?
The on-package links already do operate at lower voltages, or at least as low as the engineers at a given vendor have managed while still being able to get usable signals from them. The reduced wire lengths and reduced error handling requirements bring the power cost per bit much lower than the inter-socket connections.
There may be future technologies that bring these downsides down, although AMD's projections for GPU MCMs have included interposers rather than assuming sufficient scaling for package interconnects.