AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

Discussion in 'PC Industry' started by ToTTenTranz, Oct 8, 2018.

  1. rcf

    rcf
    Regular Newcomer

    Joined:
    Nov 6, 2013
    Messages:
    317
    Likes Received:
    235
    But requiring the presence of an I/O chip makes no sense in mobile/desktop SKUs.
    It would make manufacturing more expensive (2 dies per CPU instead of 1) and require the existence of multiple I/O chip variants since the one on Epyc is overkill.
    The "dumb chiplets + I/O hub" scheme is worth it only on Epyc and Threadripper CPUs with lots of cores.
     
  2. Malo

    Malo YakTribe.games
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    6,502
    Likes Received:
    2,539
    Location:
    Pennsylvania
    Why does the 2 dies make it more expensive? For the package yes I guess it would be a bit more but why would it be significant? The I/O chip being made on a separate much cheaper process lowers cost and spreads the manufacturing between separate nodes where there are likely severe constraints on 7nm.
    AMD are pushing for lots of cores in the desktop. They're redefining what lots of cores is compared to Intel's stagnation in that department.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    Is the IO die in question an IO die used by all client products, or a variant for the G products? The cost is either two similar chips being engineered, or one incrementally larger one adding cost for all of them.
    One departure in this from other AMD proposals is that they've always had the GPU drive some kind of memory bus attached, which makes sense since GPU silicon needs tend to be closer to what the IO domain has than the CPU regions. Architecturally, they are better at utilizing DRAM on a sustained basis, and that same capability usually costs more if it's not on-die. Perhaps the strangled bandwidth of the socket makes that less important in this instance, but some of the video's rumors indicate the hardware is overspecced for that bandwidth already, leaving the question on what the inflection point may be for GPU silicon wastage and link capability versus 7nm dies and chip variants.

    AMD's modern APUs can gate most silicon besides the controller and memory. The power domains for Raven Ridge are set up to allow this.
    The efficacy of power gating of a GPU on-die would be compared to the power gating of a chiplet.
    There are elements neither can turn off completely, and so I'm not certain if the chiplet adds much in the idling scenario besides the link controllers and off-die interconnect that cannot be gated.
    I have some question about whether there's a control complex or series of dependences between the command processor on the chiplet and the ancillary hardware now moved to the IO die, but there could be methods to handle it.

    The rumored CU counts seem of dubious value with the DDR bandwidth available, and I don't know what to make of the video's using the same name for the Ryzen G products and a discrete product allegedly capable of hitting Vega 56 performance. There's flexibility with chiplets to an extent, but the range between 40GB/s bandwidth and link bandwidth versus matching a product with 512 GB/s seems to stretch what the silicon and link can achieve without some significant gaps in what can be reconciled on a supposedly compact chiplet.

    Having more than one die has happened before, such as with quad-core Conroe products. There is a yield and assembly cost to this, and three chips can add to it. It might depend on where AMD's projections are for volume and yield for an unknown set of chips. If there were a definite high volume of high-yield silicon for a given combination of features, this might lose out. However, if AMD's being pessimistic about the volume or manufacturability of a given graphics or processor SKU, this might be a sensible but still less than ideal decision.

    Performance or power-wise, I question the latency for the CPU and bandwidth for the GPU. An MCM will have a higher floor in terms of power consumption due the links and whether that is countered by the presence of 7nm on some of the chips is unclear. Cost-wise, I am curious how appealing this is at the cheapest and high volume SKUs. The supposedly debunked rumor of a 28nm bargain-basement single-chip product might make sense in this light, particularly if it was a contingency plan if Globalfoundries was somehow not capable of servicing that niche or wanted to hold that range hostage in terms of WSA negotiations at the nodes Zen was on.
     
  4. rcf

    rcf
    Regular Newcomer

    Joined:
    Nov 6, 2013
    Messages:
    317
    Likes Received:
    235
    The area of one die with all relevant controllers built-in would probably be smaller than one chiplet + one I/O hub combined. It's more expensive to integrate 2 dies in the package, if something goes wrong with one of them you may lose the other.
    Also, lower-end mobile/desktop CPUs sell in much larger quantities than server/workstation CPUs while having much lower profit margins, so every cost reduction in manufacturing should matter.
    Two dies would also result in higher memory latency even for low-end mobile/desktop CPUs with fewer cores, which I guess could affect gaming performance, for example.

    Sure, but only a tiny fraction of people really need 12, 16 or more cores in their desktops.
    And Amdahl's law will get us all sooner or later.
     
  5. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    511
    Likes Received:
    125
    One reason I'd really like to see a chiplet APU is that then that same GPU chiplet could be used in AIBs, and maybe as a base of an entire product line in different configurations.

    The big question I'd like to see answered is that if you start disaggregating the GPU the same way AMD seems to have disaggregated the server CPU with EPYC2, (that is, not just fancy crossfire but transparent to software) at what position specifically should you do the cut? Would it make sense to locate the ROPs with the memory controllers to reduce data movement between the chips? AFAICT, there is more traffic between ROPs and the RAM than there is between them and the CUs.
     
    Lightman likes this.
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    I think there are some trends that bring the amount of export bus traffic up. For example, DCC could compress a wavefront's pixel exports significantly before they materialize as DRAM accesses, and some of the best practices for DCC include having a shader write out on all channels even if redundant to make the compressor more efficient.
    https://gpuopen.com/dcc-overview/

    Late depth checks or culling that feeds off the ROP depth caches are in the ideal case happening for multiple culled primitives or pixels without going to memory. This may become more important in the future, as there are patents about leveraging the depth caches or the hierarchical caches based on them for even earlier culling of primitives in the front end. Overdraw scenarios may become more expensive since the overwrites in the ROP caches become chip link transactions.
    Whatever method the CUs use for arbitrating for the export bus and RBEs would show up externally if the bus became externally visible, for whatever increment of bandwidth the reservation and release process imposes.

    Moving the ROP caches to the die with memory controllers would allow their caches to still serve in their role of bandwidth amplifications from DRAM. However, if they operate just as efficiently in that regard, their DRAM bandwidth consumption remains constant while the export bus (64 (128?) bytes per shader engine) is now visible on the link between the chips. That amplification factor may also be a problem in the compressed memory case, if the link is sized to match the DRAM bus.

    I think that Vega's choice to bring the ROPs within the L2 may point to a decision to further encapsulate them. Besides helping avoid cache flushes when dealing with read after write hazards, I wonder if putting the incoherent ROP caches inside the hierarchy kept the Infinity Fabric simpler by removing a demanding memory client that did not play by the coherent fabric's rules.
     
    SpeedyGonzales, Lightman and Gubbi like this.
  7. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,457
    Likes Received:
    717
    The display engine is 5% of the 2200/2400G die or, ~10mm². I would expect it to be on all Ryzen IO dies. having two mask sets and stocking two different IO dies to save 10mm² seems unlikely.

    Wrt. inter-die links, do we know anything about the bandwidth ? TSMC demoed a low latency 2gbit/s / lane, 256 lane InFo link two years ago on their 16nm process using just 0.4 mW/gbit/s (0.4pj/bit). That's 64GB/s using just 0.4W. TSMC's capabilities have improved since then and their InFo packaging technology has matured.

    Cheers
     
    Anarchist4000 and Lightman like this.
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,457
    Likes Received:
    717
    I think they will structure it similar to Raven Ridge, where the GPU has a fair amount of L2 (1MB?). I could imagine a slice of cache on the IO die at each memory controller, acting as a memory side cache (like we saw in Intel's Crystal Well).

    A lot of assumptions are dependent on the bandwidth of the inter-die links. I hope (and expect) we get a significant increase.

    Cheers
     
  9. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    2,714
    Likes Received:
    302
    If the IO chip has a codename like 'balcony' I'm gonna throw stuff :runaway:
     
    AlBran likes this.
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    The most I've seen is that AMD stated it's on generation 2 of the fabric. I would expect at a minimum the link bandwidth doubles, or has to since an IFOP link currently only matches one DDR4 channel in terms of bandwidth. Rome's chiplet strategy would immediately strangle the architecture if the single link to the chip did not double.
    I think to some extent there would be a desire to do more than that, since EPYC marketing touted its overprovisioned link bandwidth, which implies there is some benefit in having a surplus of bandwidth available beyond the limits of the local channel pair. A GPU chiplet might want more, since it gets more fabric stops than a CCX, although the ceiling is the DDR channels in a Ryzen G.

    I haven't seen a recent reference to a TSMC interconnect, and I may have missed a paper or article on it. Many interconnect demonstrations can go years before they reach market, so it may still be on the way. I have not seen the specifics for this interconnect, such as the expected connection distance or package layout. Rome's package layout and single link per die to the IO chip makes it look like it's following paths similar to some of the longer ones in Naples, and the pictures so far make it seem like the substrate is similar. The chiplet strategy would make it likely the client products would be using the same connectivity.
     
    Lightman likes this.
  11. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,428
    Likes Received:
    357
    That would make sense. The concern is that the IO die prefers an older process which would be less than ideal for ROP and cache logic.

    If Navi looked like Rome, could we see three processing chiplets plus a front end/command processor chiplet? Or smaller IO die and fewer chiplet options? Use the same Navi chiplet for all Navi products. Then bin the chiplets for clockspeeds or possibly remove/substitute for compute only products? That could work around the serialization issues with higher clocks on a risky process. Apply Epyc's design to GPUs. Higher cost for consumer boards, but reusing the same small silicon chiplets may pay off in higher margin markets. Similar to how Threadripper uses the better chips.
     
  12. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,457
    Likes Received:
    717
    I was referring to this Hotchips 2016 presentation

    Clearly targeted at mobile applications with its low power consumption.

    Cheers
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    That does look like it targets very low power. I'm curious how that package type may react to the higher thermals of the AMD chiplets.
    That aside, slide 20 covers my question about trace length and proximity. The demonstration device had 0.55mm spacing between die, and the concept seems to target compact SOP products. I'm not sure about the trace matching, and whether this is more stringent than AMD's current package links. I don't recall AMD going into that kind of detail.
     
  14. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,457
    Likes Received:
    717
    Yes, they would need more power for longer traces.

    What I found most interesting is the sheer density of traces possible. On page 24, a graphic of the SOC and MEM packages is overlayed on a photo of the ball grid array (7x7mm package) . The SOC and MEM dies are less than 7mm² each and allows 256 single ended com lanes as well as 16 referential lanes (one for each 16-lane sub channel). Add ground and power as well as the fan-out to the ball grid and we're looking at roughly 500 connections in 7mm², or ~75/mm².

    This was state of the art three years ago, even if AMD's packaging partner is trailing this, they must be able to pack a hell of a lot of connections on organic substrates today.

    Cheers
     
    Lightman and AlBran like this.
  15. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    3,131
    Likes Received:
    1,685
    Lightman likes this.
  16. Theeoo

    Newcomer

    Joined:
    Nov 13, 2017
    Messages:
    114
    Likes Received:
    53
    AMD obviously believes we're hearding for a glorious multithreaded future, although I think I'm going to wait till Intel's show or no-show of Ice Lake in late 2019, at least for comparison purposes. If Intel fail to deliver I think they'll be in real trouble.
     
  17. AlBran

    AlBran Just Monika
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    19,657
    Likes Received:
    4,548
    Location:
    ಠ_ಠ
    Future, the cloud is.
     
  18. Magnum_Force

    Newcomer

    Joined:
    Mar 12, 2008
    Messages:
    93
    Likes Received:
    59
    I'm guessing that a GPU chiplet / IO die would be set up similar to the Xbox 360 Xenos and edram daughter die perhaps?
     
    SpeedyGonzales likes this.
  19. jlippo

    Veteran Regular

    Joined:
    Oct 7, 2004
    Messages:
    1,191
    Likes Received:
    250
    Location:
    Finland
    Or X360 GPU and CPU.
    In which case memory would be connected directly to IO/GPU chip and through there to CPU chiplet.
     
    #139 jlippo, Dec 10, 2018
    Last edited: Dec 17, 2018 at 12:44 PM
  20. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,597
    Likes Received:
    1,338
    New Benchmark Leak Reveals Ryzen 3 3000U Radeon Vega Mobile series
    https://www.guru3d.com/news-story/n...-ryzen-3-3000u-radeon-vega-mobile-series.html
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...