AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,807
    Likes Received:
    2,072
    Location:
    Germany
    Considering Nvidia is competing with AMDs current 4k-ALU offerings with their own 2.5k versions and would be able to leverage the same ALU-levels of improvement, I guess the above Navi proposal by el etro would not be good enough for the 2018/9.timeframe.
     
    Ext3h likes this.
  2. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    What makes you go for four 36CU chiplets as opposed to, say, two 64CU chiplets?

    I'm concerned with the power consumption from all of the IF lanes necessary to pull that off. It feels simpler to just have two chiplets and a ton of IF lanes back & fourth between those two (assuming you can get enough bandwidth in the first place).

    Back when Nvidia wrote that paper on MCM-style graphics cards, I believe the conclusion was that the MCM-style only made sense over a traditional monolithic GPU if you were using that MCM technique to make something so big that it wasn't physically possible to pull it off monolithically.

    So that generally means you're getting the most bang for buck if you take two of your biggest die and duct tape them together.

    Just as a friendly note, the "NexGen" thing was a typo that got corrected in subsequent presentations.

    I remember googling "NexGen Memory" when I first saw it, so I can empathize, lol.

    [​IMG]

    Do you have a source for the existence of an SoC variant of GloFo's 7LP?

    I thought that it was just going to be the high perf 7LP "Leading Performance" (definitely not "Low Power", lol...) version initially.

    https://www.anandtech.com/show/1155...nm-plans-three-generations-700-mm-hvm-in-2018

    Also, note that Anandtech reported that GloFo was bragging that their die size limit going up. It'd be weird to do that if you didn't expect your marquee customer to use that extra die size headroom (i.e. no tiny 36CU chips):

    "GlobalFoundries also expects to increase the maximum die size of 7LP chips to approximately 700 mm², up from the roughly 650 mm² limit for ICs the company is producing today. In fact, when it comes to the maximum die sizes of chips, there are certain tools-related limitations."
     
    DavidGraham and Lightman like this.
  3. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    One GDDR6 channel per module. This does unfortunately mean that the fabric speed needs to be built for the largest possible configuration, but everything down to the L2 size can be scaled down to a single module.

    Epyc has proven that the fabric is fast enough to provide shared L3 across modules, so it should also be fast enough to provide a distributed L2, ROP and memory controller on Navi. With the added benefit of also getting independent memory channels.

    Gesendet von meinem ONEPLUS A3003 mit Tapatalk
     
  4. el etro

    Newcomer

    Joined:
    Mar 9, 2014
    Messages:
    95
    Likes Received:
    12
    I think it can be MCM mounted at interposer, this last one serving for connecting HBM3 memory to the system.




    Because 64CU will be over 100W of GPU Power consumption, making it hard to "tame" with SoC version of the process. Will be Polaris and Vega all over again. Uncompetitive in a performance per watt basis. Shrinking the 232mm² 14LPP of Polaris 10 will be easier to make a Four-die Epyc approach.

    Sub 75W Nvidia GPUs(GP107 and GP 108) are fabbed at Samsung 14LPP, why GP106/104/102 aren't fabbed at it?

    The marketing material linked at the post says it. SoC version for AMD and SHP/HPC version for IBM.

    [​IMG]
     
  5. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,258
    Likes Received:
    1,948
    Location:
    Finland
    Of course not, using interposer means automatically that it's MCM (but being MCM doesn't necessarily mean you're using interposer)
     
  6. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    That's an interesting point. I forgot about that.

    Did we ever get any substantial analysis confirming that was a functional improvement to GP107 & GP108 as opposed to, perhaps, just a measure to relieve TSMC's supply constraints?

    Yes, I see the image you originally linked, but I'm not familiar with it. Not knowing more, that image might as well be a random graph with the names "14nm", "7nm SoC" and "7nm HPC" next to some lines.

    I'm sorry to be suspicious, but do we have a source that officially ties that graph to GloFo's roadmap?

    I don't mean to be confrontational, I'm just not familiar with this image and I'd love the opportunity to add it to my collection of knowledge (we all come here to learn, right? :p).

    You think Navi's "next gen memory" meant GDDR6? Whew, that'd be brutal on AMD's public image (but I can't deny that it's a plausible possibility).

    By "single channel", I'm assuming you mean a 128-bit config that's "half" as wide as Polaris 10's 256-bit config (I might be misinterpreting you).

    At 128-bit, you'd need 16 Gbps GDDR6 to equal Polaris 10's 256 GB/s of bandwidth. 16 Gbps is the long term goal for GDDR6; we won't get it for a long time.

    And even assuming all of that works, you're still left with 16 GDDR6 chips on a single graphics card, equivalent to a 512-bit config. No way AMD wants a Hawaii-esque repeat.

    So I'm not sure that GDDR6 would be the right choice for a large MCM-style GPU with 400-500+ mm2 of total die size. That's HBM's arena.[/QUOTE]
     
  7. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    I was actually thinking around the line of 64 or 128bit per module, depending on the memory interface used. At 128bit, that would be a full 512 bit interface for a full 4 module configuration.

    "Channels", as this no longer behaves like a single, monolithic 512bit interface.

    Not sure if a split memory controller is actually beneficial for graphic work loads, or if it would cause issues.

    "Problem" is what to do if one channel is stalled, due to serving a small buffer which doesn't span multiple controllers. On the positive side, this would actually had been wasteful on a wider interface, due to command overhead vs burst transfer. But being stalled on parts of a larger buffer, which was more or less a single burst with a wider interface, sounds like it would complicated scheduling. For the better or for the worse.

    With a 64 bit channel width and GDDR6 (burst length 16), we would still have a burst size of 1kb. Or 2kb with 128bit.
     
  8. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Are you sure? I remember reading somewhere the an MCM was a specific packaging type. Personally I informally call interposer based modules MCM's as well, but when trying to be accurate I stick to MCM for "actual MCM" modules. I could be wrong though.
     
  9. bazooka_penguin

    Joined:
    Oct 21, 2017
    Messages:
    3
    Likes Received:
    1
    Wouldn't it be better to use as small as possible configurations of shaders? The RX460 has the best performance per teraflop out of the current generation. MCM seems like it would be best used to overcome the lost efficiency introduced by feeding an excessive number of shaders, i.e. maximizing the utilization of shaders, rather than just building up massive GPUs, though I guess you can do both.
     
  10. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,172
    Location:
    La-la land
    MCM is a TLA of "multi-chip module"; it doesn't say anything about how the multi-chipping is accomplished... Whether it is interposer, or sticking several chips onto a PCB substrate as has usually been the norm. Not sure if chip stacking using wirebonding is considered as qualifying under the MCM moniker, but logically it should. :p
     
    Cat Merc likes this.
  11. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,258
    Likes Received:
    1,948
    Location:
    Finland
    MCM is just generic term for "multi-chip module" aka many chips on 1 substrate
     
  12. el etro

    Newcomer

    Joined:
    Mar 9, 2014
    Messages:
    95
    Likes Received:
    12

    But it would make the Interposing Yielding harder due to many GPUs to attach, as Opposed to do with bigger GPUs.

    MCM and Interposers are way found by AMD to overcome the trend of each new process node being more and more mobile-focused.


    http://btbmarketing.com/iedm/docs/ go for IEDM 29-5, Fig. 2
    I have to check if the power/performance curves given for both HPC and SoC flavors of GF 7LP is a makeup of the marketing firm or it is really GloFo's work.
     
  13. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,205
    Likes Received:
    600
    Location:
    France
    What they need is a "easy for the driver team" chip... They can't have a Vega bis in that aspect...
     
    silent_guy likes this.
  14. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Likely more problematic than beneficial for a workload. From a hardware and production standpoint it would be beneficial. Basically Epyc in regards to cost and scaling. The workload would essentially need to be NUMA aware which is more or less what HBCC currently solves for memory management. Given say four chips/channels each could start rasterization at different corners opposed to one tile. The 4SE arrangement roughly handling that distribution already. That should reasonably eliminate most of the communication between chips with the exception of any work heavy on global synchronization or atomics. Still works, just not ideal.

    As mentioned previously MCM just means multiple chips. Personally I've been drawing a line between on package and interposer based on the level of interconnection. Multiple interposers could exist on one package. Pair a chip to a stack of HBM on an interposer and link those like Epyc. The interposer the faster solution, but density may be a concern. An Epyc sized interposer would be an impressive piece of silicon.
     
  15. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    Navi is supposedly being fabbed by TSMC, not GloFo.

    http://www.digitimes.com/news/a20171023PD201.html

    This is a surprise to me. We all knew Navi was officially on 7nm, but not any particular 7nm.

    [​IMG]

    But GloFo seemed like an obvious choice since they already did Polaris & Vega, and GloFo's first gen 7nm seems strangely well equipped to support a big GPU:
    To me, that spelled "I am preparing to make some big GPUs," but maybe that was off the mark.

    Do we have any other info on Navi's fab?

    EDIT Just re-read that DigiTimes article and it doesn't explicitly say that Navi will use 2.5D packaging, but it mentions that technique extensively, in particular by TSMC (but they could be talking about GV100).

    Can someone remind me, are advanced 2.5D packaging techniques necessary to support memories like HBM with their fancy TSVs in GPUs?

    If so, that might suggest that Navi uses something like HBM, as opposed to GDDR6. It's a stretch, I know.
     
    #255 ImSpartacus, Oct 23, 2017
    Last edited: Oct 23, 2017
  16. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,258
    Likes Received:
    1,948
    Location:
    Finland
    DigiTimes has it's sources, but it's not just one or two times they've been wrong. I'd take it with a pinch of salt 'till confirmed by other sources
     
  17. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    240
    No.
    Besides AMD going TSMC for GPUs sounds silly considering GloFo is increasing Fab8 capacity by 20% next year.
    Feels like IBM-speak.
    Wait, it is IBM-speak!
    Well its either interposers or EMIB. HBM interfaces are too wide for organic substrate.
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    Isn't this is presuming some new variation of the fabric's chip to chip interconnect, given the bandwidth?
    EPYC's package bandwidth is ~170 GB/s, and there are interface and architectural limits to how much further EPYC can push its intra-package links. Per AMD, its MCM approach provides 10% overhead due to duplicated logic and the controllers/PHY for the links, and the proposed MCM GPU is has potentially 4-8x more link bandwidth needed than EPYC.

    GPU memory controllers and graphics resources already do, generally.
    There's a lot of explicit resource tiling, or internal striping of addresses or buffer formats.
    At the hardware level, the L2 and L2 crossbar would tend to be where there's an attempt to bridge that gap. As the LLC for the chip, that section needs to. However, GPU L2s manage this with static partitioning of slices with known physical assignment to a channel, which is among the low-cost methods of achieving this when on-die. That low-overhead choice would need to go away, on top of the 10% EPYC-style links that need to be 4-8x more capable.

    In this context, the vendors discussing it treat MCM as the conventional implementation of multiple silicon chips on a plane interfacing with an organic or ceramic substrate that handles signal/power distribution and pinout.
    The vendors called interposer-based integration 2.5D to distinguish the extra benefits and complexity.
    In some ways, from a package's perspective it might appear as if a 2.5D solution is a single-chip module, since there's still a substrate below and the silicon interposer (technically a chip itself) hides the details of the stack.

    The mobile or embedded formats that physically place wire-bonded chips above usually get marketed as something like PoP, to demonstrate the differences in the Z dimension and the properties of the chips and their connecting to the substrate. Also, it's also useful in allowing for meaningful discussion about how it differs from standard methods. MCM may be generic from the dictionary meaning of its individual words, but it gets used as shorthand for what was already here first and has built up a body of usage and technique already.
     
    iMacmatician, Grall and Ext3h like this.
  19. el etro

    Newcomer

    Joined:
    Mar 9, 2014
    Messages:
    95
    Likes Received:
    12
    For me, is a bye to the "scalability" thing, and a back for the monolithic GPUs. TSMC may have a better Jack-of-all-trades(fabbing from phones to HPC chips) with its 7nm process. AMD may had prototyped both versions, Multi Die MCM and Monolithic, of Navi in GloFo and came to a conclusion that they will go nowere using GloFo any flavors of the process. And TSMC 7nm+ is ready for production in 2H 2018 according to the latest roadmaps, so thats a decision that makes everyone but GloFo happy. But Ryzen can still be fabbed at GloFo 7SoC, it will still reach huge clockspeeds(3.7*1.3= 4.8Ghz all core boost clock and 4Ghz*1.3= 5.2Ghz single core Turbo for a theoretical Ryzen1 8-core fabbed at the process) at lower power consumption according to the graph...
     
  20. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    I think we finally have some ballpark numbers on the cost savings and die size overhead related to a "chiplet" design that could find its way into Navi.

    [​IMG]

    https://fuse.wikichip.org/news/523/iedm-2017-amds-grand-vision-for-the-future-of-hpc/4/

    It's for Epyc, but I think it's a reasonable benchmark for GPUs as well.
    • A monolithic design would've saved about 9% in total die size compared to a 4-die chiplet (777mm2 vs 852mm2).
      • Presumably, this is from all of the overhead needed to connect the chips.
    • A monolithic design would've cost almost 70% more (1/0.59-1=0.69).
      • From Nvidia's old paper, we know that a monolithic design will handle beat an "equivalent" chiplet design, but for these kinds of savings, you can afford to underprice the monolithic design by a wide margin.

    In lieu of achieving Nvidia-tier efficiency, AMD may be able to brute force their way to parity (or near parity) by going wider and slower without bloating their total die costs.

    Then again, according to an old Anandtech article, "the single largest consumer of the additional 3.9B transistors [of Vega 10] was spent on designing the chip to clock much higher than Fiji." Why make all of those architectural changes to increase clocks if you're going to underclock just one generation later?

    Also, just for proper credit, I found the above-linked IEDM 2017 article on r/hardware.

    EDIT Thanks to iMacmatician, I noticed that AMD might not really have much of a choice in pursuing a chiplet design. Initial 7nm will increase the costs of a 250mm2 die tremendously.

    [​IMG]

    EUV can't come soon enough, eh?
     
    #260 ImSpartacus, Dec 10, 2017
    Last edited: Dec 10, 2017
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...