I'm not sure it's actually going to be a 2.5D design, besides, there is already GMI for chip interconnects, bridging in between on-die Coherent Fabric networks if I haven't misunderstood that.
I misinterpreted the statement of abandoning internal crossbars as a reversion to something physically external to the die.
As long as the unified fabric features proper routing and efficient switching, I wouldn't expect all too many issues/overhead with such an approach, the gains from keeping most of the traffic "local" can be kept.
The logical level of the interconnect shouldn't necessarily decide the physical topology. AMD's coherent processors have crossbars internally, for example.
The current method for GCN is to rely on the last-level cache for the CUs or other hardware units, and the ROP cache hierarchy for export, so the crossbars in question are between memory clients and the first point of global coherence.
Putting the fabric where those crossbars has to add overhead, since the current situation is either non-existent or traffic that is physically incapable of being incoherent.
A much more limited set of values is needed to address the appropriate target, and the coherent fabric would be putting a higher-level set of considerations (flit routing, global addressing, packetization, coherence broadcasts) between what should be a simple L1-L2 inclusive cache hierarchy, or specific producer-consumer data flows where the endpoints are fixed and coherence implies outside interference that is not wanted.
But let's step this up with an example: What if the geometry processors would be able to stream geometry via to interconnect to rasterizers not even physically placed on the same die? An arbitrary number of units addressable directly if needed? No more DMA/XDMA, just a single virtual fabric with a global, unified address space?
That could be after a large data amplification step, and things on-die are very cheap relative to anything that goes off of it.
The traffic could be "compressed", but the compressor/decompressor would be giving the command stream to a local geometry processor and rasterizer.
And if you want to extend your virtual GPU, you just plug in additional resources which become addressable by the existing command processors? Yes, I don't expect it to be EFFICIENT if you start streaming geometry like this, that's just an extreme example. But at a higher level, plugging a virtual GPU together should work quite well. The comparisons with NVLink might not be too far off, especially regarding the capabilities to schedule transparently.
There are already queues and memory paths for the front ends to fetch work from, and various methods of stream out. It's generally helpful to take things as far as you can with the internal networks, and then take the hit of going off-die or worrying about arbitrary access and routing.