Aside from that, there are decisions that have to be made with regards to what direction to devote sufficient resources, unless the argument is that it's merely a matter of devoting sufficient resources to every direction they could possibly go in.
The success or market adoption are not guaranteed, and can also leave a workable direction with sufficient resources abandoned.
AMD thus far seems to be open to a direction that is conceptually more easily implemented and lower-risk direction, and hasn't fleshed out future prospects as fully as some competing proposals. Those competing proposals have laid out more architectural changes than AMD has thought up, and yet more would be needed to go as far as AMD says they need to for MCM graphics in gaming.
That's one of many barriers to adoption. Being able to address memory locations consistently doesn't resolve the need for sufficient bandwidth to/from them, or the complexities of parts of the architecture that do not view memory in the same manner as the more straightforward vector memory path.
Infinity Fabric is an extension of coherent Hypertransport, and the way the data fabric functions includes probes and responses meant to support the kinds of shared multiprocessor cache operations of x86. As of right now, GCN caches are far more primitive and seem to only use part of the capability. There are a number of possibilities that fall short of full participation in the memory space.
The other parts of the GPU system that do not play well with the primary memory path are not helped by IF. It is a coherent fabric meant to connect coherent clients, it cannot make non-coherent clients coherent. The fabric also doesn't make the load balancing and control/data hazards of multi-GPU go away.
The HBCC with its translation and migration hardware is a coarser way of interacting with memory. One thing I do recall at least early on is that while it was more automatic in its management of resources, at least early on the driver-level management of NVLink page migration had significantly higher bandwidth. I have not seen a comparison in more modern times.
HBCC may be necessary for the kind of coarse workloads GPUs tend to be given, yet it's something not cited as being necessary on the CPU architectures that have had scalable coherent fabrics for a decade or more. The CPU domain is apparently considered robust enough to interface more directly, whereas the other domain is trying to automate a fair amount of special-handling without changing the architecture that needs hand-holding.
There may be changes in the xGMI-equipped Vega that have not been disclosed. Absent those changes, going by what we know of the GCN architecture significant portions of its memory hierarchy do not work well with non-unified memory. The GPU caches depend on an L2 for managing visibility that architecturally does not work with memory channels it is not directly linked to, and whose model doesn't work if more than one cache can hold the same data.
Elements of the graphics domain are also uncertain. DCC is incoherent within its own memory hierarchy, the Vega whitepaper discusses the geometry engine being able to stream out data using the L2, but in concert with a traditional and incoherent parameter cache. The ROPs are L2 clients in Vega, but they treat the L2 as a spill path that inconsistently receives updates from them.
Thank you.
But still there is the actual data we do have in the slides. Also, during the presentation, where he is referencing Infinity Fabric 2.0 and lets us know about achieved low latencies, then later on quietly lets it drop that they also have unified memory. That later in 2020 hinted that We will be seeing more of this Fabric's abilities in the future.. (xGMI or layers of fabric on chiplet designs...?) Additionally, I am sure there has been further advancements/refinements to the HBCC touching the fabric.
AMD's goal is heterogenous computing. Unified memory (of any sort) is their goal.
Back to gaming.
The problem with SLI/Crossfire in gaming was not the PCI Express bus & 2 video cards. It was that each GPU had it's own memory pool.... Thus, the Game's issues/problems was alter-frame-rendering, because there was no unified memory between GPUs.
Now, if all the GPUs on a chiplet (not MCM) use the same memory pool (not having their own separate pool) at normal memory bandwidth & speeds. What about if there is a dedicated fabric/substrate (perhaps even unified L2 cache.??) then how is IF's bandwidth a problem. Yes, latencies matter, but can also be tailored to application.
theory: if you sandwiched two GPU's back to back with IF in the middle, arbitrating...