AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

Discussion in 'PC Industry' started by ToTTenTranz, Oct 8, 2018.

  1. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,440
    Likes Received:
    321
    Location:
    Varna, Bulgaria
    The worst case scenario for the entire Zen 2 lineup is that AMD must maintain four layouts, up from two in Zen 1:

    * I/O Hub;
    * 8-core Chiplet;
    * 8-core SoC;
    * 4-core APU;

    ... if the consumer SKUs will remain monolithic.
     
    Lightman likes this.
  2. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    510
    Likes Received:
    124
    So, one thing I do after a major disclosure like this is to go back look at the old leaks, and find which one had nothing that disagrees with what was released, and find what other things they talked about.

    Many of the earliest sources on 8+1 design, from back when I and most others completely disregarded them, agree on a feature that AMD did not choose talk about now. That is, according to the leaks Rome has an eDRAM cache, and that it is 512MB.

    This agrees with something I saw on anandtech and twitter: The newest normal process from GloFo is the 12nm (which is pretty much a marketing name...), and AMD already sells products made with it. Why would they call the IO die 14nm? Because GloFo also has the 14nm HP, which can use eDRAM. According to the published density of the 14nm eDRAM cell, a 512MB cache would fit on a 440mm^2 die, if only barely.

    (edit: woops typo)
     
    #82 tunafish, Nov 7, 2018
    Last edited: Nov 7, 2018
    Kej, Alexko, Lightman and 3 others like this.
  3. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    573
    Likes Received:
    265
    With how AMD isn't discussing anything about the IO die, I think some secret sauce is likely to be in it and eDRAM could be a possibility.

    AMD specifically said that for 32 core Naples, that 4 memory channels were not enough and we saw that Threadripper 2 kind of hits a wall with due to memory in many workloads. They have chosen to keep 8 memory channels for Rome and outside of memory clock increases they do still need to feed 64 cores with around the same memory bandwidth as the 32 core parts. An L4 cache of some sort is very likely imo. Even if AMD chooses not to use eDram, I think they would still need some sort of SRAM cache.
     
    Kej, DavidGraham and Lightman like this.
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,919
    Likes Received:
    2,298
    Location:
    Well within 3d
    Is this GF's ST-RAM product? The last I saw, its marketed write endurance was on the order of 10^8 cycles, which is ten or so orders of magnitude below what some may argue is "unlimited".

    This sounds like IBM's 14nm process, which GF acquired. I'm not sure that's been offered to anyone else, and IBM's process is a complex many-layer SOI node.
    How much area goes into eDRAM in this scenario?

    A ~213mm2 Zeppelin die has ~88mm2 bound up in CCX area, with the rest being "other" including DDR, PCIe, fabric links, and other IO.
    DDR and PCIe alone may be 30-40mm2 per die, without ancillary logic or paths needed to connect them. 4 processor's worth of them could be 1/3 the die, and then there's the 8 IF blocks and some unknown fraction of the uncore the IO would need to control and connect.
    I guess some of the question is how much of the remaining chip area can be de-duplicated or simplified instead of put on the IO die. The data transport layer is currently implemented as a crossbar, which may change on the IO die because of the concentration of switch clients in one place. Naively plunking down the non-CCX area of 4 Zeppelin die would cover most of the IO die.
     
    Kej, CarstenS and Lightman like this.
  5. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,223
    Likes Received:
    318
    Location:
    Australia
    I doubt this very much, I/O doesn't scale and new processes are significantly more expensive per mm*2.

    Out of that I/O how much is critical to performance? SATA, USB, ethernet , NVME , PCI-E, none of those look important to me.
     
  6. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,440
    Likes Received:
    321
    Location:
    Varna, Bulgaria
    I did a quick sketch-up of all phy interfaces presumed for the I/O hub, sized from the 14nm Zeppelin layout and factoring in all the interfacing logic and pad stacking, not much space is left for a large eDRAM array.
     
    Kej and Lightman like this.
  7. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    2,706
    Likes Received:
    297
    I'm pretty concerned about the idea of so many different dies being needed.
    The key, audacious success of Zen1 was it used a single relatively small die to do everything from low-end consumer all the way up to 32 core monster server sockets.

    To now make multiple dies on different processes with a bunch of them incompatible with other market segments is throwing away that incredible efficiency.

    On the other hand yes you get a smaller 7nm chiplet & being made up mostly of copy/paste blocks I guess its probably relatively cheap to develop several sizes of IO chip.

    Just occurs to me: could the big one be designed with room to be able to be chopped into half/quarters to create the smaller versions?
    That way one tapeout could produce 3 sizes of IO chip, even off each wafer, that'd satisfy my desire for efficiency/simplicity.

    Nice.
    By rough eyeball 2 chiplets & 1/4 of the IO die isn't actually much bigger than 1 Zen1 chip, which ain't bad for twice the cores & double width FP units.
     
    iMacmatician and Lightman like this.
  8. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,440
    Likes Received:
    321
    Location:
    Varna, Bulgaria
    A scaled down I/O hub for consumer SKUs is possible, but the question remains how well that MCM approach will fit into the razor-thin margins of the Ryzen series, compared to a monolithic design.
    If the chiplet yields are good and EPYC/TR sales are consistent enough to capture a significant market share, the production volume could make it profitable to reuse the same architecture for the consumer segment.
     
  9. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,426
    Likes Received:
    357
    Would make sense, as ideally there is a L4/LLC used with NVDRAM to lower power draw and increase performance. I still think a single stack of HBM may have been a superior solution for capacity with MCM, but AMD did seem fond of their 14nm custom SRAM cell design for Zen. This may explain why.
     
  10. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,426
    Likes Received:
    357
    They still have APUs that could bypass the IO chiplet design. However, they really need more bandwidth in that segment anyways. With the heterogenous memory support AMD has been working on, there is likely another memory controllers in the mix somewhere. Even a single stack of HBM connected like a CCX chiplet over IF could be huge.

    Did AMD state the IF bandwidth on Rome anywhere?
     
    Lightman likes this.
  11. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,370
    Likes Received:
    787
    Because not using chiplets would mean using a different die, which would require designing it, and AMD usually tries to avoid that.
     
    hoom likes this.
  12. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,126
    Likes Received:
    3,762
    There's a rumor pointing to the IO die having 256MB as cache:



    Though I doubt it's L3.


    And why do you assume it doesn't make sense?
    Economy of scale is still a thing..
    Keep in mind that each 8-core chiplet carries 16 PCIe 4.0 lanes, which an I/O chip could easily convert into different combinations of up to 32 PCIe 3.0 lanes. I'd say if there were absolutely no plans of using the chiplets into anything other than EPYC, why would the chiplets have PCIe in there at all? It would always be cheaper to save die area on the chips being made on the more expensive process.

    Plus, the GF agreement still exists, so keeping them making 2 or 3 I/O dies plus some older GPUs (or IOs with GPUs?) may be the best possible outcome for every partner.

    That sounds a bit unfair to everything they presented to improve IPC. AVX2 performance is supposedly doubled.


    It's a rather safe bet to assume they don't have to repeat that same logic 4 times over, though..


    So far there's one small die being made on an expensive high-end process and one large die on a cheap low-end process.
    Only the cheap-ish die is incompatible with other market segments, and for those AMD can use an even cheaper die.
    Perhaps 7nm+ EUV will change this, but for now this seems like a great choice to me.
     
  13. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    510
    Likes Received:
    124
    32MB of L3 per die * 8 = 256MB of L3.
     
    CarstenS likes this.
  14. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,448
    Likes Received:
    702
    The 256MB, if true, is likely to be an aggregate cache memory number. I could imagine each core having a 1MB L2 cache and 1MB L3 victim cache slice for at total of 16MB cache on each chiplet and 128 MB cache in the IO chip, likely sliced as 16MB per DDR4 PHY

    Cheers
     
    #94 Gubbi, Nov 8, 2018
    Last edited: Nov 8, 2018
  15. Arzachel

    Newcomer

    Joined:
    Jul 23, 2013
    Messages:
    26
    Likes Received:
    21
    Doesn't that work out to less cache on chiplet than what's already on 2x Zen CCX?
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,919
    Likes Received:
    2,298
    Location:
    Well within 3d
    There's 4 IF links on the die, which AMD cites as allowing them to fully connect all dies in a package with a minimal number of package layers, despite the chips' orientations changing to keep the DDR PHY facing the outside of the package.
    If the second-gen fabric's protocols aren't too different, the home agents for managing coherence would still be tied to the memory controllers. That link is likely a good part of why the coherent fabric is bound to the memory controller's clock, though whether something like power consumption, removing a few clock domain crossings in an already longer-latency subsystem, or some kind of constraint imposed by AMD's generalized and topology-agnostic fabric is why that choice was made is unclear to me.
    I haven't seen AMD describe Zen's coherence protocols in detail, though I presumed that if it's something like prior multi-core methods, the inter-die links on the current EPYC MCM would allow direct data transfers from a cache holding a dirty copy to the requesting core without going through additional hops in a scenario where the home node is on a third chip--which is all the time given our understanding of Rome.
    It's not required that there be other links, although it could be beneficial.

    Was this stated by AMD? I saw some people drawing an inference that this was so, but the center of the IO die's right and left sides is taken up by IO blocks whose length is somewhat shorter than the DDR interfaces on the top and bottom. Perhaps that diagram is not representative, but if it is close to reality I'm not sure what other IO needs that much perimeter.
    Since this appears to be using the same socket connectivity as Naples, xGMI over a subset of the PCIe lanes remains the method for two-socket communication, which seems contrary to the goal of centralizing the coherent traffic in an IO chip only to add complexity back by mapping inter-socket traffic back to the chiplets--and there are twice as many of them than there are xGMI links.

    That depends on what could be considered optional while maintaining or improving upon the capabilities offered by Naples. There appear to be server/workstation products that have more SATA or USB links coming out of the socket than one Zen chip can provide, but potentially can be handled by the hardware equivalent of two southbridges.
    Around half of the non-CCX area of 4 Zen chips is likely not negotiable if the PHY and requisite controllers are in the IO die, and a significant fraction of the remainder could be needed to provide the same features before considering any enhancements.
    That could still leave area for storage in the IO chip, but I think it tends towards moderating expectations about what a conventional cache array could provide, or posits something more exotic to get such large amounts of caching in a more modest space/compression of the uncore's area consumption.
     
  17. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    510
    Likes Received:
    124
    8 Zen1 cores already have combined 16MB of L3. The shrink from GF 14nm -> TSMC 7nm shrinks SRAM arrays much more than it shrinks logic. I would be extremely surprised if the chiplet dies do not have 32MB of L3 per chiplet. Those put together are 256MB. Of course, a single core can only make use of 32MB max.

    On top of that, I expect the IO die to have another cache level, a L4 with 512MB. (Based on the earliest 8+1 leaks mentioning that.)

    Can you show this? Because I did a quick calculation and expect that there is >200mm^2 free.

    The IF links connecting the chiplets are likely 16bit PCI-E, running at much higher frequency than standard (since they only need to cover a distance of 2cm, with no sockets or connectors of any kind to add capacitance.) AMD did the same for the IF links of Zeppelin, but now they have PCI-E4, with double the throughput. They will likely get >50GB/s per direction from that, and that's enough. The rest of the PCI-E which provides the off-socket IF and connection to devices doesn't come from the IO die, but from the chiplets. That is, each chiplet has a total of 32 lanes of PCI-E. This nicely means that adding or removing chiplets doesn't change the amount of PCI-E lanes available in the system -- if you only fit 4 chiplets, you get a total of 64 lanes from the chiplets and 64 free from the IO die.

    So for memory and for connecting chiplets, the die loses 8x7.5mm^2 = 60mm^2 for DDR4 and 8*8.7mm^2 = 70mm^2 for the IF/PCIE. Assuming 440mm^2, 310mm^2 still remain. All the other IO and random stuff on the Zen package, excluding the IFOPs (since those will go on the chiplets) is <80mm^2, leaving more than a full Zeppelin die worth of space for cache.
     
    #97 tunafish, Nov 8, 2018
    Last edited: Nov 8, 2018
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,919
    Likes Received:
    2,298
    Location:
    Well within 3d
    My impression from sources like https://fuse.wikichip.org/news/1064/isscc-2018-amds-zeppelin-multi-chip-routing-and-packaging/4/ is that the on-package links for Zeppelin are 32-wide and running at a lower speed than PCIe. IFOP is described as using single-ended links drawing less than a quarter of a PCIe-based IFIS link's power.
    The bandwidth figures for the on-package links also match more cleanly with the power of two widths without the PCIe protocol's CRC overhead.

    Is there a source on the 32 lanes of PCIe per chiplet?
     
    Lightman likes this.
  19. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,982
    Likes Received:
    96
    I'm not quite convinced of this. In particular assuming there's L4, the benefits of having such large L3 may not be all that much, and I'd expect it to not increase (per core). 16MB extra L3 might not need THAT much area on 7nm, but it's still probably ~10mm^2.
    Seems rather unlikely to me. You'd need different routing on the package depending on how many chiplets are connected. Seems much simpler to me if you handle all the pcie from the io die as well - saving precious area on the 7nm chiplets.
     
    Anarchist4000 likes this.
  20. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,780
    Likes Received:
    443
    Location:
    Torquay, UK
    Zen 2 IF is 64b wide, or double from Zen 1.
    IF 2, at least for GPU to GPU is running at 100GB/s.

    There is a possibility that on package IF runs faster as was the case in Zen 1 for on-chip IF vs intra-chip coherent IF.
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...