AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion


One quote from that article that I wish there had been more elaboration on was the following:
“We’re going down that path on the CPU side, and I think on the GPU we’re always looking at new ideas. But the GPU has unique constraints with this type of NUMA [non-uniform memory access] architecture, and how you combine features… The multithreaded CPU is a bit easier to scale the workload. The NUMA is part of the OS support so it’s much easier to handle this multi-die thing relative to the graphics type of workload.”

In particular, which features and what constrains them. There are architectural elements that are part of the graphics context or fixed-function pipeline that are not relevant to compute workloads, whose contexts are stripped down and try to keep as much as possible accessible via memory pointers and explicitly addressed locations. The modes and output paths of various engines or metadata generated by them are not consistently accessible or addressed in that manner, or they have modes that do not flow back to memory in a consistent fashion. Since their function presumes their being unique or in a single memory pool, coherence and consistency need more explicit management with higher overheads (i.e. flushes, device stalls). There are a limited number of meta-level functions that CPUs have, such as TLB or translation cache updates, which can be dangerous if mismanaged and can involve wide-ranging stalls--which the OS has significant infrastructure or sole authority to manage.

In that regard, it's not clear if it's NUMA as much as the architectures we know of currently have undefined behavior if the memory hierarchy is no longer unified.


It might actually make sense along the lines of Rome. That's more or less what Vega already does, just within a single chip. Using IF internally to connect GPU to controller. Concern is getting the singular front end running as fast as possible and communicating with CUs over IF efficiently while using an older node for costs.
The front end doesn't appear to communicate with the CUs over IF, or if some data does go that route the IF's presence is a coincidental link in a round-trip to memory. Vega is a single-chip GPU, and the IF is an intermediary interconnect between the core GPU area and memory controllers instead of whatever bespoke links between the GPU units and memory controllers existed prior.
 
We've discused multi-die GPUs quite a lot, mostly in neighboring threads:

The AMD Execution Thread [2018]#92
AMD: Navi Speculation, Rumours and Discussion #519
AMD: Navi Speculation, Rumours and Discussion #533
etc.


In short: Nvidia has supercomputers capable of running a nearly-realtime simulation of VHDL model, which they used to test an experimental multi-GPU design. The results were published in June 2017 as a Nvidia research paper:

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability
http://research.nvidia.com/publication/2017-06_MCM-GPU:-Multi-Chip-Module-GPUs

The research studies an optimized multi-chip GPU, featuring L1.5 caches, modified virtual page mapping algorithms, and a global thread scheduler, all to improve data locality and minimize far memory accesses. They conclude that this optimized multi-chip GPU design:
1) can achieve 90-95% performance of a comparable monolithic GPU design when using 768 GByte/s data links;
2) would be fully transparent to the programmer, working like a single monolithic GPU .


Another Nvidia research paper from October 2017 studies requirements for a NUMA-aware GPU:

Beyond the Socket: NUMA-Aware GPUs
http://research.nvidia.com/publication/2017-10_Beyond-the-socket

They conclude that multi-slot (or multi-socket) memory access can be improved with:
1) bi-directional inter-die links with dynamic reconfiguration between read and write lanes, and
2) improved cache policies with L2 cache coherency protocols and dynamic partitioning between local and remote data.


So, everything is possible to implement given sufficient engineering resources. I wouldn't take these interviews too seriously - one day some AMD guy shares a personal well-informed opinion that an explicit multi-chip GPU would be too complex for game developers to support blah-blah-blah-blah, next day AMD introduces their own multi-chip GPU and the same guy will narrate how they worked hard to actually implement the exact same features he previously said were not really feasible...
 
Last edited:
Past discussions centered on research into MCM GPUs running compute workloads, which is consistent with the RTG interview mentioning that compute can more readily benefit. Which graphics workloads were evaluated?
There are additional architectural changes made to make parts of the hierarchy aware of the non-unified memory subsystem, and some leverage behaviors unique to Nvidia's execution model, such as the synchronization method for the remote L1.5 caches, which synchronize at kernel execution boundaries that also synchronize the local L1. GCN's write-through behavior is very different and would make Nvidia's choice significantly less effective.
 
We've discused multi-die GPUs quite a lot, mostly in neighboring threads:

The AMD Execution Thread [2018]#92
AMD: Navi Speculation, Rumours and Discussion #519
AMD: Navi Speculation, Rumours and Discussion #533
etc.

In short: Nvidia has supercomputers capable of running real-time simulation of VHDL description, which they used to test an experimental multi-GPU design. The results were published in June 2017 as a Nvidia research paper:

From "stripped" conclusion of these two studies it seems that making mcm gpu is not problem at all. But what type of workloads they did test ? HPC type and compute workloads which is in line with Wang's own words about MCM GPU.

But games and game API's is another story....
I didn't notice, that any of these studies tested such MCM approach in games.

cheers
 
The moment game-developers target tiled mobile architectures, pulling out a MCM GPU shouldn't really be a problem. Driving Vulkan for such a current tiled chip is so excessively explicit, it would almost work for auto multi-GPU as well (apart from loosing the unified memory pool).
 
The moment game-developers target tiled mobile architectures, pulling out a MCM GPU shouldn't really be a problem. Driving Vulkan for such a current tiled chip is so excessively explicit, it would almost work for auto multi-GPU as well (apart from loosing the unified memory pool).

Good to know, everything is so clear then and easy to be done, so I guess nothing prevent AMD/RTG from stealing Nvidia's crown and regain marketshare in next year ;-)
 
The only thing preventing SLI/Crossfire, or even dual-GPU cards from scaling and efficiency... was the lack of unified memory.

Everyone here, should know this, & currently why the multi-gpu never really panned out for gaming (even with people with deep pockets), is because it wasn't worth the constant headache. It wasn't native.

Now, go all the way back to when AMD bought ATi & look at AMD's heterogenous goals... and what they are going to announce at CES. Their APU. Study that for a long-hard-while and put the pieces together on what Dr Su said & P-Master said... is that the chiplet design all hangs on Infinity Fabric & that IF2.0 is much more robust.

So much so, that AMD has achieved unified memory with anything attached to fabric, namely the HBCC. Think on that for just a moment...


I think we may see a multi-gpu chip (for gaming) from AMD in the next year. There is nothing technically preventing this, given what we now know.
 
Well, that's just a no. Unified memory is already here and I have missed news about huge efficiency of sticking say 2 Vegas in CF.
I'm inclined to say "that's just a no" too, but for different reason. To my understanding only NVLink currently offers "unified memory" on the discrete side of things and even then it's still far too slow to access anything in any other cards memory. Infinity Fabric will offer same for AMD in Vega 20, but Vega 10 being left without outside IF-links is out.
 
I'm inclined to say "that's just a no" too, but for different reason. To my understanding only NVLink currently offers "unified memory" on the discrete side of things and even then it's still far too slow to access anything in any other cards memory. Infinity Fabric will offer same for AMD in Vega 20, but Vega 10 being left without outside IF-links is out.
Will be interesting if the quick revisions in PCIe allow for enough bandwidth to pull it off. PCIe4 and possibly 5 further increasing limits.
 
Didn't dual-GPUs have the same problem as CF/SLI setups? Even though all the memory is on one board

Yes but this is was not unified memory. Each GPU had acces to his pool of vram.

I don't believe IF will solve multigpu for gaming like magic, a lot of work have to be done on the driver side imo, and I don't see RTG having the man power / focus to do that.
 
Wouldn't multiple GPUs with pooled memory/resources be presented as a separate "virtual" SKU device to the OS to work properly? That would probably require a significant revision of the WDDM foundation.
 
Yes but this is was not unified memory. Each GPU had acces to his pool of vram.

I don't believe IF will solve multigpu for gaming like magic, a lot of work have to be done on the driver side imo, and I don't see RTG having the man power / focus to do that.

This Vega2 as a possible dual-gpu argument is going in circles.
Unified memory means that both GPU use & share the same memory pool. Until now, no uarch was able to maintain that claim. (Nvlink is not the same thing as InfinityFabric.) Each GPU doesn't have it's own memory, but share the same memory.

With true unified memory, there is no need for "gaming aware" drivers, etc. It is native.
 
This Vega2 as a possible dual-gpu argument is going in circles.
Unified memory means that both GPU use & share the same memory pool. Until now, no uarch was able to maintain that claim. (Nvlink is not the same thing as InfinityFabric.) Each GPU doesn't have it's own memory, but share the same memory.

With true unified memory, there is no need for "gaming aware" drivers, etc. It is native.
Other than speed (which inter-chip IF won't fix either I think), how exactly NVLink can't maintain that claim? It offers cache coherent shared memory pool between 2 or more GPUs and even a CPU if you want
 
"given sufficient engineering resources"

That's not RTG motto for years now, it seems...
Aside from that, there are decisions that have to be made with regards to what direction to devote sufficient resources, unless the argument is that it's merely a matter of devoting sufficient resources to every direction they could possibly go in.
The success or market adoption are not guaranteed, and can also leave a workable direction with sufficient resources abandoned.

AMD thus far seems to be open to a direction that is conceptually more easily implemented and lower-risk direction, and hasn't fleshed out future prospects as fully as some competing proposals. Those competing proposals have laid out more architectural changes than AMD has thought up, and yet more would be needed to go as far as AMD says they need to for MCM graphics in gaming.

The only thing preventing SLI/Crossfire, or even dual-GPU cards from scaling and efficiency... was the lack of unified memory.
That's one of many barriers to adoption. Being able to address memory locations consistently doesn't resolve the need for sufficient bandwidth to/from them, or the complexities of parts of the architecture that do not view memory in the same manner as the more straightforward vector memory path.

Now, go all the way back to when AMD bought ATi & look at AMD's heterogenous goals... and what they are going to announce at CES. Their APU. Study that for a long-hard-while and put the pieces together on what Dr Su said & P-Master said... is that the chiplet design all hangs on Infinity Fabric & that IF2.0 is much more robust.
Infinity Fabric is an extension of coherent Hypertransport, and the way the data fabric functions includes probes and responses meant to support the kinds of shared multiprocessor cache operations of x86. As of right now, GCN caches are far more primitive and seem to only use part of the capability. There are a number of possibilities that fall short of full participation in the memory space.
The other parts of the GPU system that do not play well with the primary memory path are not helped by IF. It is a coherent fabric meant to connect coherent clients, it cannot make non-coherent clients coherent. The fabric also doesn't make the load balancing and control/data hazards of multi-GPU go away.

So much so, that AMD has achieved unified memory with anything attached to fabric, namely the HBCC. Think on that for just a moment...
The HBCC with its translation and migration hardware is a coarser way of interacting with memory. One thing I do recall at least early on is that while it was more automatic in its management of resources, at least early on the driver-level management of NVLink page migration had significantly higher bandwidth. I have not seen a comparison in more modern times.
HBCC may be necessary for the kind of coarse workloads GPUs tend to be given, yet it's something not cited as being necessary on the CPU architectures that have had scalable coherent fabrics for a decade or more. The CPU domain is apparently considered robust enough to interface more directly, whereas the other domain is trying to automate a fair amount of special-handling without changing the architecture that needs hand-holding.

I think we may see a multi-gpu chip (for gaming) from AMD in the next year. There is nothing technically preventing this, given what we now know.
There may be changes in the xGMI-equipped Vega that have not been disclosed. Absent those changes, going by what we know of the GCN architecture significant portions of its memory hierarchy do not work well with non-unified memory. The GPU caches depend on an L2 for managing visibility that architecturally does not work with memory channels it is not directly linked to, and whose model doesn't work if more than one cache can hold the same data.
Elements of the graphics domain are also uncertain. DCC is incoherent within its own memory hierarchy, the Vega whitepaper discusses the geometry engine being able to stream out data using the L2, but in concert with a traditional and incoherent parameter cache. The ROPs are L2 clients in Vega, but they treat the L2 as a spill path that inconsistently receives updates from them.
 
This Vega2 as a possible dual-gpu argument is going in circles.
Unified memory means that both GPU use & share the same memory pool. Until now, no uarch was able to maintain that claim. (Nvlink is not the same thing as InfinityFabric.) Each GPU doesn't have it's own memory, but share the same memory.

With true unified memory, there is no need for "gaming aware" drivers, etc. It is native.


But is unified memory all there is to make multiples gpu look like one to the devs ? Or, will it be loke CPU, where you have to, in a way, optimize your engine to use more threads&co ?

Maybe it deserves a separate topic.
 
AMD thus far seems to be open to a direction that is conceptually more easily implemented and lower-risk direction, and hasn't fleshed out future prospects as fully as some competing proposals.
With HBM and running for first on 7nm (above 100mm2), they seem to be everything but that.
 
With HBM and running for first on 7nm (above 100mm2), they seem to be everything but that.
With regard to HBM, do you mean Fiji's being the product using the early version of HBM in 2015?
The revised HBM2 would be seen with Nvidia high-end products at roughly the same pace as AMD, or slightly faster. AMD has not yet released a product using GDDR6.
7nm Vega is an earlier node transition.

Ideas like more memory bandwidth or better node are conceptually straightforward choices that would work for virtually any architecture. Committing to a direction for architectural choices was my thesis, and where RTG has been in a period of retrenchment publicly. Vega 20 with xGMI brings in ISA or platform features that are meant to bring it up to where there have been Pascal, Volta, or Turing products. With math or interconnect features present from 1-2 years. A significant portion of the architecture has been left unchanged or de-emphasized, such as the graphics architecture that is the discussion focus concerning MCM.

The interview with RTG's technical lead is a leading indicator for AMD's considering more firmly bifurcated gaming and professional products, which is a decision to follow the leader in this case. In other considerations for multi-GPU or MCM GPUs in particular, RTG's still playing catch-up in many ways with Vega 20 and the next intermediate steps towards that end have not been fleshed out. Perhaps 2019 will bring more disclosures on AMD's goals, and perhaps a sign that it has been rigorously pursuing a direction for some time rather than making a call recently.
 
Back
Top