AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

Discussion in 'PC Industry' started by ToTTenTranz, Oct 8, 2018.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,919
    Likes Received:
    2,298
    Location:
    Well within 3d
    The GPU xGMI links are apparently running at 25 Gbps, going from the following video (after 21:00).
    found via reddit:


    Possibly, given the introduction of PCIe 4.0, there is something like the Synopsis multi-mode PHY used by Zen for its PCIe/xGMI purposes.
    The following, for example might be able to support PCIe 4.0 and a 25 Gbps connection as well.
    https://www.synopsys.com/dw/ipdir.php?ds=dwc_multi_protocol_25g_phy

    Another point further in the video is that the GPUs can be connected in a ring with comparatively low latency for a GPU, and that it's all encrypted. I'm not sure if that is carrying over from the MI60's implementing memory encryption, or if the links themselves can be encrypted.

    The latter point is something I'm curious about relative to EPYC and its encryption. The memory controllers decrypt memory as it is read in, and the isolation of unencrypted data in the caches is enforced by ASID--where a requester not from the same space misses in the cache goes to memory for decryption (or loading bad data?).
    However, that understandable way of handling things makes me curious when the data potentially moves off-chip, like in an MCM or over the xGMI links in a 2-socket system. Is the presumption that it's too difficult to intercept? Is a workaround like allocating a VM only to one chip so there's no transfer out possible? Does it become more of a question in Rome, where if data decrypts at the controller there is no avoiding an off-chip transfer?
    Could this mean there's been a change concerning what paths are encrypted/decrypted?
     
    AlBran and Lightman like this.
  2. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,426
    Likes Received:
    357
    Hypothetically, what if more L3 was in the chiplets(dense, less power, less latency, less IF bandwidth hit) and the supposed L4 was tags and memory management ala HBCC. Still seems a bit much, but we don't know what all is included there. Security processor, memory management, and prediction seem likely. Along with IO controllers for USB, network, and probably storage. Possibility of large memory capacity with NVRAM to consider as well.
     
    Lightman likes this.
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,448
    Likes Received:
    702
    In Zeppelin, the IF links runs at 4xDRAM command rate, 5.3Gbit/s for 2.66GHz DRR4. At that speed, power consumption is just 2 pico joules/bit which is 10-20% of what PCIe3 uses.

    I expect IF links to be wider and decoupled from DRAM speed in EPYC, something like 64 lanes in each direction running at 2GHz (with 4x datarate), that would yield 64GB/s in each direction.

    Cheers
     
    Lightman likes this.
  4. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,448
    Likes Received:
    702
    The problem with stuffing a large victim cache on a CCX is that the the rest of the system don't benefit from the extra L3. While your single core performance might suffer from moving cache from the CCX to the IO chip, overall system performance should improve because your bigger, shared, cache reduces memory traffic.

    In GF 14nm, you use around 2mm² per MB SRAM, so unless we're talking eDRAM it is a lot less.

    Cheers
     
  5. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,426
    Likes Received:
    357
    That may depend on IF link bandwidth and power efficiency. 7nm L3 over 14nm adding capacity with less power might overcome any shortfalls from hammering a faster/wider IF link. Assuming IF is a mesh, that may be a practical option with a cluster on die approach. Would need a wide CCX to IO connection with quarter(?) width IF links within the IO die. Not sure I like the idea, but could be a possibility.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,919
    Likes Received:
    2,298
    Location:
    Well within 3d
    From https://semiaccurate.com/2018/11/09/amds-rome-is-indeed-a-monster/, Rome is described as having one IF link per chiplet to the IO chip, and that all memory, PCIe, and xGMI connectivity are hosted on the central chip.

    64GB/s in each direction could generally align with the bandwidth demands of a single Zen CCX, given the L3's data fabric port is 32B in each direction and capped by the memory-determined fabric speed. The lower expectations for real-life usage likely factored into why there were two CCXs per die, although in Naples' case each of the three IFOP links could bring in as much bandwidth as the local DDR4 controllers, and then to a somewhat lesser extent the xGMI link. That's probably not something often done on a sustained basis, though generous provisioning for bandwidth was a marketing point for Naples.
    What that would mean for a chiplet with 8 cores, 256-bit AVX and 256-bit data paths may need more disclosures. If AMD's data paths widened from the L1 through to the CCX data port, the demand could increase even though 64GB in each direction falls short of what Naples could provide.

    This change in design could explain a shift to one 8-core CCX per chiplet to consolidate memory clients and reduce misses travelling over the link. AMD's protocols have historically not done much sharing of clean lines which creates additional bandwidth to memory unless there is a shared cache. A caveat to that is that AMD hasn't discussed some of the wrinkles to its MOESI protocols in later designs, or Zen's MDOEFSI protocol--although changes like the D state appear to be more concerned with repeated sharing of dirty lines.
    Sharing one L3 could also mean more cache slices to provide internal bandwidth and reduce bank conflicts that were likely more common with Zen's simple method of striping data across just four banks. The CCX in vector loads could suffer if its port remains 32B, although it wouldn't be unprecedented for an AVX 256 device to not widen the full path through the cache hierarchy.

    One link at 64GB/s in both directions, versus ~85GB/s in each direction for Naples could be debated in terms of what it offered in throughput versus latency and utilization.
    Depending on the IO chips internal interconnect, bisection bandwidth for the package could be much higher, though one chiplet would hit a more impassible ceiling in trying to use it.
    Congestion-wise, there is less flexibility to the chiplets. The IO chip might afford something with more flexibility on-die, however.

    As far as decoupling the fabric from the memory controller, it seems like AMD's latency has been rather poor even with the monolithic solution much less one with more asynchronous paths. Rome likely portends to be somewhat worse, unless AMD's improved fabric manages to bring its baseline of "as bad as being separate" to "as fast as being integrated" and then un-integrating them back to "it could be worse".
    The fabric's ability to transport is dependent on how quickly the home agents can process requests and broadcast snoops, and depending on how well AMD's fabric can handle congestion at the points of arbitration there may be excess power for limited gains or possibly pathological cases in the fabric if that portion of the network can be overloaded. AMD did promise second-gen Infinity Fabric, whatever that means.
     
  7. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    510
    Likes Received:
    124
    (oops nevermind)
     
    #107 tunafish, Nov 13, 2018
    Last edited: Nov 13, 2018
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,448
    Likes Received:
    702
    By tying IF speed to the memory controllers, you save some latency on memory requests, but you still cross clock domains (twice) going from one CCX to another. Also, you have less control of your IF bandwidth: Add a second DIMM on each channel and speed will degrade from 2.66GHz to 2.4 or 2.133 GHz (in the un-registered, non server world). Add crazy XMP DIMMs, and your IF now uses a lot more power than intended. All to save a handful of sub-ns cycles for the the asynchronous FIFOs.

    There isn't a release date yet, so I hope we get some details at ISSCC in february.

    Cheers
     
    AlBran likes this.
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,919
    Likes Received:
    2,298
    Location:
    Well within 3d
    Coherence probes and requests go to the coherence controller in the memory controller portion regardless, which means it would inject two clock boundary crossings as well.
    It would be introducing some amount of interface penalty with part of its penalty proportional to the lower-clocked DRAM domain.
    Single-chip, the uncached memory access scenario would mean the home agent's latency hit would be half of the total number of boundary crossings. In the case of a cache hit to another CCX on the same chip, the agent contributes about a third of the boundary crossings constituting the critical path of a coherent memory transaction.

    Having DRAM be slower than the fabric would mean the coherence controller's rate of processing probes and requests would lag the fabric's ability to transport requests to it, so the faster-running fabric would spend more cycles and power waiting on requests to be processed by the controller, the snoop filter checked, and probes generated.
    Having the DRAM clock higher than the fabric means the home node's spending more time and power waiting on requests and responses to get to it, and the DRAM's higher bandwidth and power consumption is wasted.
    If the power consumption and clock domain crossings worsen in a system that bottlenecks on the slower of the two sides of the clock boundary, opting to save boundary crossings and power doesn't seem as bad.

    AMD's choice in where it placed its home agents is a holdover from prior generations, and I think there is some sign of complexity within the uncore given that performance shows a preference for DRAM speeds at specific ranges that might be related to internal divisors or places where the IP is not fully-baked. Not knowing the full details of the fabric, the errors reported and instability related to fiddling with the fabric or some of these ratios may point to the fabric not being fully validated or underspecified for scenarios outside of the straightforward one currently in production. The abstraction and agnostic nature of the fabric seems to correlate with the higher than expected latency and power consumption, and that's when all fabric components are trying to stay within the same clock range.

    Some of the patents and unused firmware settings may also mean that AMD is using or is intending to use fewer synchronous fabric elements, where an immature implementation might be salvaged by slaving regions to a common clock, or a 2.0 version may relax the limits in place.
     
  10. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,448
    Likes Received:
    702

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...