AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,008
    Likes Received:
    2,518
    Location:
    Well within 3d
    One quote from that article that I wish there had been more elaboration on was the following:
    “We’re going down that path on the CPU side, and I think on the GPU we’re always looking at new ideas. But the GPU has unique constraints with this type of NUMA [non-uniform memory access] architecture, and how you combine features… The multithreaded CPU is a bit easier to scale the workload. The NUMA is part of the OS support so it’s much easier to handle this multi-die thing relative to the graphics type of workload.”

    In particular, which features and what constrains them. There are architectural elements that are part of the graphics context or fixed-function pipeline that are not relevant to compute workloads, whose contexts are stripped down and try to keep as much as possible accessible via memory pointers and explicitly addressed locations. The modes and output paths of various engines or metadata generated by them are not consistently accessible or addressed in that manner, or they have modes that do not flow back to memory in a consistent fashion. Since their function presumes their being unique or in a single memory pool, coherence and consistency need more explicit management with higher overheads (i.e. flushes, device stalls). There are a limited number of meta-level functions that CPUs have, such as TLB or translation cache updates, which can be dangerous if mismanaged and can involve wide-ranging stalls--which the OS has significant infrastructure or sole authority to manage.

    In that regard, it's not clear if it's NUMA as much as the architectures we know of currently have undefined behavior if the memory hierarchy is no longer unified.


    The front end doesn't appear to communicate with the CUs over IF, or if some data does go that route the IF's presence is a coincidental link in a round-trip to memory. Vega is a single-chip GPU, and the IF is an intermediary interconnect between the core GPU area and memory controllers instead of whatever bespoke links between the GPU units and memory controllers existed prior.
     
    yuri, pharma and Alexko like this.
  2. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    607
    Likes Received:
    411
    Location:
    55°38′33″ N, 37°28′37″ E
    We've discused multi-die GPUs quite a lot, mostly in neighboring threads:

    The AMD Execution Thread [2018]#92
    AMD: Navi Speculation, Rumours and Discussion #519
    AMD: Navi Speculation, Rumours and Discussion #533
    etc.


    In short: Nvidia has supercomputers capable of running a nearly-realtime simulation of VHDL model, which they used to test an experimental multi-GPU design. The results were published in June 2017 as a Nvidia research paper:

    MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability
    http://research.nvidia.com/publication/2017-06_MCM-GPU:-Multi-Chip-Module-GPUs

    The research studies an optimized multi-chip GPU, featuring L1.5 caches, modified virtual page mapping algorithms, and a global thread scheduler, all to improve data locality and minimize far memory accesses. They conclude that this optimized multi-chip GPU design:
    1) can achieve 90-95% performance of a comparable monolithic GPU design when using 768 GByte/s data links;
    2) would be fully transparent to the programmer, working like a single monolithic GPU .


    Another Nvidia research paper from October 2017 studies requirements for a NUMA-aware GPU:

    Beyond the Socket: NUMA-Aware GPUs
    http://research.nvidia.com/publication/2017-10_Beyond-the-socket

    They conclude that multi-slot (or multi-socket) memory access can be improved with:
    1) bi-directional inter-die links with dynamic reconfiguration between read and write lanes, and
    2) improved cache policies with L2 cache coherency protocols and dynamic partitioning between local and remote data.


    So, everything is possible to implement given sufficient engineering resources. I wouldn't take these interviews too seriously - one day some AMD guy shares a personal well-informed opinion that an explicit multi-chip GPU would be too complex for game developers to support blah-blah-blah-blah, next day AMD introduces their own multi-chip GPU and the same guy will narrate how they worked hard to actually implement the exact same features he previously said were not really feasible...
     
    #5782 DmitryKo, Dec 26, 2018
    Last edited: Jan 5, 2019
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,008
    Likes Received:
    2,518
    Location:
    Well within 3d
    Past discussions centered on research into MCM GPUs running compute workloads, which is consistent with the RTG interview mentioning that compute can more readily benefit. Which graphics workloads were evaluated?
    There are additional architectural changes made to make parts of the hierarchy aware of the non-unified memory subsystem, and some leverage behaviors unique to Nvidia's execution model, such as the synchronization method for the remote L1.5 caches, which synchronize at kernel execution boundaries that also synchronize the local L1. GCN's write-through behavior is very different and would make Nvidia's choice significantly less effective.
     
    iMacmatician likes this.
  4. Rootax

    Regular Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    982
    Likes Received:
    452
    Location:
    France
    "given sufficient engineering resources"

    That's not RTG motto for years now, it seems...
     
    yuri likes this.
  5. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    63
    Likes Received:
    40
    From "stripped" conclusion of these two studies it seems that making mcm gpu is not problem at all. But what type of workloads they did test ? HPC type and compute workloads which is in line with Wang's own words about MCM GPU.

    But games and game API's is another story....
    I didn't notice, that any of these studies tested such MCM approach in games.

    cheers
     
  6. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    843
    Likes Received:
    240
    The moment game-developers target tiled mobile architectures, pulling out a MCM GPU shouldn't really be a problem. Driving Vulkan for such a current tiled chip is so excessively explicit, it would almost work for auto multi-GPU as well (apart from loosing the unified memory pool).
     
  7. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    63
    Likes Received:
    40
    Good to know, everything is so clear then and easy to be done, so I guess nothing prevent AMD/RTG from stealing Nvidia's crown and regain marketshare in next year ;-)
     
  8. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    105
    Likes Received:
    23
    The only thing preventing SLI/Crossfire, or even dual-GPU cards from scaling and efficiency... was the lack of unified memory.

    Everyone here, should know this, & currently why the multi-gpu never really panned out for gaming (even with people with deep pockets), is because it wasn't worth the constant headache. It wasn't native.

    Now, go all the way back to when AMD bought ATi & look at AMD's heterogenous goals... and what they are going to announce at CES. Their APU. Study that for a long-hard-while and put the pieces together on what Dr Su said & P-Master said... is that the chiplet design all hangs on Infinity Fabric & that IF2.0 is much more robust.

    So much so, that AMD has achieved unified memory with anything attached to fabric, namely the HBCC. Think on that for just a moment...


    I think we may see a multi-gpu chip (for gaming) from AMD in the next year. There is nothing technically preventing this, given what we now know.
     
  9. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    688
    Likes Received:
    417
    Location:
    Slovenia
    Well, that's just a no. Unified memory is already here and I have missed news about huge efficiency of sticking say 2 Vegas in CF.
     
    del42sa likes this.
  10. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    7,941
    Likes Received:
    1,652
    Location:
    Finland
    I'm inclined to say "that's just a no" too, but for different reason. To my understanding only NVLink currently offers "unified memory" on the discrete side of things and even then it's still far too slow to access anything in any other cards memory. Infinity Fabric will offer same for AMD in Vega 20, but Vega 10 being left without outside IF-links is out.
     
    Silent_Buddha likes this.
  11. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,438
    Likes Received:
    359
    Will be interesting if the quick revisions in PCIe allow for enough bandwidth to pull it off. PCIe4 and possibly 5 further increasing limits.
     
  12. SpaceBeer

    Newcomer

    Joined:
    Apr 15, 2017
    Messages:
    34
    Likes Received:
    14
    Location:
    The Balkans
    Didn't dual-GPUs have the same problem as CF/SLI setups? Even though all the memory is on one board
     
  13. Rootax

    Regular Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    982
    Likes Received:
    452
    Location:
    France
    Yes but this is was not unified memory. Each GPU had acces to his pool of vram.

    I don't believe IF will solve multigpu for gaming like magic, a lot of work have to be done on the driver side imo, and I don't see RTG having the man power / focus to do that.
     
    del42sa likes this.
  14. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,448
    Likes Received:
    334
    Location:
    Varna, Bulgaria
    Wouldn't multiple GPUs with pooled memory/resources be presented as a separate "virtual" SKU device to the OS to work properly? That would probably require a significant revision of the WDDM foundation.
     
  15. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    105
    Likes Received:
    23
    This Vega2 as a possible dual-gpu argument is going in circles.
    Unified memory means that both GPU use & share the same memory pool. Until now, no uarch was able to maintain that claim. (Nvlink is not the same thing as InfinityFabric.) Each GPU doesn't have it's own memory, but share the same memory.

    With true unified memory, there is no need for "gaming aware" drivers, etc. It is native.
     
    DmitryKo likes this.
  16. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    7,941
    Likes Received:
    1,652
    Location:
    Finland
    Other than speed (which inter-chip IF won't fix either I think), how exactly NVLink can't maintain that claim? It offers cache coherent shared memory pool between 2 or more GPUs and even a CPU if you want
     
    DmitryKo likes this.
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,008
    Likes Received:
    2,518
    Location:
    Well within 3d
    Aside from that, there are decisions that have to be made with regards to what direction to devote sufficient resources, unless the argument is that it's merely a matter of devoting sufficient resources to every direction they could possibly go in.
    The success or market adoption are not guaranteed, and can also leave a workable direction with sufficient resources abandoned.

    AMD thus far seems to be open to a direction that is conceptually more easily implemented and lower-risk direction, and hasn't fleshed out future prospects as fully as some competing proposals. Those competing proposals have laid out more architectural changes than AMD has thought up, and yet more would be needed to go as far as AMD says they need to for MCM graphics in gaming.

    That's one of many barriers to adoption. Being able to address memory locations consistently doesn't resolve the need for sufficient bandwidth to/from them, or the complexities of parts of the architecture that do not view memory in the same manner as the more straightforward vector memory path.

    Infinity Fabric is an extension of coherent Hypertransport, and the way the data fabric functions includes probes and responses meant to support the kinds of shared multiprocessor cache operations of x86. As of right now, GCN caches are far more primitive and seem to only use part of the capability. There are a number of possibilities that fall short of full participation in the memory space.
    The other parts of the GPU system that do not play well with the primary memory path are not helped by IF. It is a coherent fabric meant to connect coherent clients, it cannot make non-coherent clients coherent. The fabric also doesn't make the load balancing and control/data hazards of multi-GPU go away.

    The HBCC with its translation and migration hardware is a coarser way of interacting with memory. One thing I do recall at least early on is that while it was more automatic in its management of resources, at least early on the driver-level management of NVLink page migration had significantly higher bandwidth. I have not seen a comparison in more modern times.
    HBCC may be necessary for the kind of coarse workloads GPUs tend to be given, yet it's something not cited as being necessary on the CPU architectures that have had scalable coherent fabrics for a decade or more. The CPU domain is apparently considered robust enough to interface more directly, whereas the other domain is trying to automate a fair amount of special-handling without changing the architecture that needs hand-holding.

    There may be changes in the xGMI-equipped Vega that have not been disclosed. Absent those changes, going by what we know of the GCN architecture significant portions of its memory hierarchy do not work well with non-unified memory. The GPU caches depend on an L2 for managing visibility that architecturally does not work with memory channels it is not directly linked to, and whose model doesn't work if more than one cache can hold the same data.
    Elements of the graphics domain are also uncertain. DCC is incoherent within its own memory hierarchy, the Vega whitepaper discusses the geometry engine being able to stream out data using the L2, but in concert with a traditional and incoherent parameter cache. The ROPs are L2 clients in Vega, but they treat the L2 as a spill path that inconsistently receives updates from them.
     
  18. Rootax

    Regular Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    982
    Likes Received:
    452
    Location:
    France

    But is unified memory all there is to make multiples gpu look like one to the devs ? Or, will it be loke CPU, where you have to, in a way, optimize your engine to use more threads&co ?

    Maybe it deserves a separate topic.
     
  19. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,407
    Likes Received:
    4,057
    With HBM and running for first on 7nm (above 100mm2), they seem to be everything but that.
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,008
    Likes Received:
    2,518
    Location:
    Well within 3d
    With regard to HBM, do you mean Fiji's being the product using the early version of HBM in 2015?
    The revised HBM2 would be seen with Nvidia high-end products at roughly the same pace as AMD, or slightly faster. AMD has not yet released a product using GDDR6.
    7nm Vega is an earlier node transition.

    Ideas like more memory bandwidth or better node are conceptually straightforward choices that would work for virtually any architecture. Committing to a direction for architectural choices was my thesis, and where RTG has been in a period of retrenchment publicly. Vega 20 with xGMI brings in ISA or platform features that are meant to bring it up to where there have been Pascal, Volta, or Turing products. With math or interconnect features present from 1-2 years. A significant portion of the architecture has been left unchanged or de-emphasized, such as the graphics architecture that is the discussion focus concerning MCM.

    The interview with RTG's technical lead is a leading indicator for AMD's considering more firmly bifurcated gaming and professional products, which is a decision to follow the leader in this case. In other considerations for multi-GPU or MCM GPUs in particular, RTG's still playing catch-up in many ways with Vega 20 and the next intermediate steps towards that end have not been fleshed out. Perhaps 2019 will bring more disclosures on AMD's goals, and perhaps a sign that it has been rigorously pursuing a direction for some time rather than making a call recently.
     
    Tkumpathenurpahl, Lightman and yuri like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...