One quote from that article that I wish there had been more elaboration on was the following:
“We’re going down that path on the CPU side, and I think on the GPU we’re always looking at new ideas. But the GPU has unique constraints with this type of NUMA [non-uniform memory access] architecture, and how you combine features… The multithreaded CPU is a bit easier to scale the workload. The NUMA is part of the OS support so it’s much easier to handle this multi-die thing relative to the graphics type of workload.”
In particular, which features and what constrains them. There are architectural elements that are part of the graphics context or fixed-function pipeline that are not relevant to compute workloads, whose contexts are stripped down and try to keep as much as possible accessible via memory pointers and explicitly addressed locations. The modes and output paths of various engines or metadata generated by them are not consistently accessible or addressed in that manner, or they have modes that do not flow back to memory in a consistent fashion. Since their function presumes their being unique or in a single memory pool, coherence and consistency need more explicit management with higher overheads (i.e. flushes, device stalls). There are a limited number of meta-level functions that CPUs have, such as TLB or translation cache updates, which can be dangerous if mismanaged and can involve wide-ranging stalls--which the OS has significant infrastructure or sole authority to manage.
In that regard, it's not clear if it's NUMA as much as the architectures we know of currently have undefined behavior if the memory hierarchy is no longer unified.
The front end doesn't appear to communicate with the CUs over IF, or if some data does go that route the IF's presence is a coincidental link in a round-trip to memory. Vega is a single-chip GPU, and the IF is an intermediary interconnect between the core GPU area and memory controllers instead of whatever bespoke links between the GPU units and memory controllers existed prior.It might actually make sense along the lines of Rome. That's more or less what Vega already does, just within a single chip. Using IF internally to connect GPU to controller. Concern is getting the singular front end running as fast as possible and communicating with CUs over IF efficiently while using an older node for costs.