AMD RyZen CPU Architecture for 2017

Discussion in 'PC Industry' started by fellix, Oct 20, 2014.

Tags:
  1. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    Agreed. I think the space, time and money should go to the background optimizing, refining and improving it so they will have not just an increase of performance and get out of all(or most) of the "cons/week points" but also to have a very robust base to build in for zen 3 increase in cores/features.
     
  2. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    2,966
    Likes Received:
    512
    Trying hard to understand why anyone would want this on? o_O
     
  3. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Infinity seems tied largely to PCIE signaling tech. So PCIE4 should up the bandwidth a bit, but primarily was setting the stage for multi-level signaling with PCIE5 in the near future. That should yield some large bandwidth increases as even optical interconnects were being explored. Some presentations on that at the last PCISIG conference.

    Memory systems IMHO will be the next focus. Using HBM or large caches close to the die to facilitate the nonvolatile memory technology that is starting to show up. Optane is very dense and energy efficient so long as you don't write to it. Caches could alleviate that while HBM for example could provide enormous bandwidth. Just imagine a single Epyc with 4 HBM stacks at 8GB each, if not more, providing 1TB/s of bandwidth for a database. That likely reduces strain on Infinity as well if used as a cache in addition to NVDIMMs as main memory.
     
  4. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    U mean an Epyc SKU with 2 or 3 dies and 1 or 1 HBM die as a last level cache? IDK we gonna see something like that but it could be really interesting to see for sure.

    And while we are on that, could the futures APU use this approach? APU+HBM on same package?
     
  5. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Something along those lines. Not sure it will be HBM, but from a design standpoint I laid out some solid reasoning for doing so. Space would be interesting, but they could put two Ryzens on an interposer with an HBM stack shared between them. Then scale from there with long strips.

    Really hoping it does as it would have a ton of potential. Could be a different Epyc variant for competing with AVX512 or in cases where that cache is useful.
     
  6. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,814
    Likes Received:
    5,915
    Location:
    ಠ_ಠ
    Spend more die area on L3 for 7nm?
     
  7. Cyan

    Cyan orange
    Legend Veteran

    Joined:
    Apr 24, 2007
    Messages:
    8,572
    Likes Received:
    2,293
    Lightman and Malo like this.
  8. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Certainly a possibility, however currently they are using one chip for everything. The HBM or external memory solution would be more flexible in regards to design. Not suggesting that 7nm parts won't have more L3 in addition to other changes. I could see favoring larger L2 and replacing L3 with tightly integrated HBM as the bandwidth isn't that different. Other possibility is replacing an entire CCX with stacked ram. Build HBM on top of Infinity as "nextgen" memory. Certainly possible with active interposers which AMD has been exploring.

    What I'd be really curious to see is if Nvidia makes a DGX1 with Epyc as opposed to Xeons with two PLXs. Should be faster and cheaper, not that price is much of a concern. Turn the market into a Mexican standoff.
     
  9. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    The problem is not forcibly the area, their L3 need to be connected fully between the dies .. and working as a single L3 ... ( it is not the case yet ) .. In fact im curious to see what they will implement for correct the L3 latency with Zen2. Because i know theres many many engineer in AMD who is working on this problem. time will tell what solution they will find.
     
    #2309 lanek, Jul 16, 2017
    Last edited: Jul 16, 2017
  10. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,814
    Likes Received:
    5,915
    Location:
    ಠ_ಠ
    hm... How does it compare to IBM's implementation/configuration for L3? (Power 7/8/9)
     
  11. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    774
    Likes Received:
    202
    Earlier this year, there was speculation mentioning three possibilities that reach 48 cores:
    1. 6 cores per CCX, 2 CCXs per die, 4 dies per CPU.
    2. 4 cores per CCX, 3 CCXs per die, 4 dies per CPU.
    3. 4 cores per CCX, 2 CCXs per die, 6 dies per CPU.
    What are the advantages and disadvantages of each possibility from gaming and server perspectives?
     
  12. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    The first solution seems the most "safe" and straightforward implementation, from infrastructure point of view. The rest would require significant interface and wiring overhead investment.

    By the way, factoring the memory-interfacing limitation of the SP3 socket, the six die option is out of consideration anyway.
     
  13. doob

    Regular

    Joined:
    May 21, 2005
    Messages:
    392
    Likes Received:
    4
    Does AMD still plan on using HBM only on APU's?

    Wouldn't there also be a huge benefit of making a server targeted product with say 6 cores per ccx, 2 ccx's per die with 2 dies per cpu and the remaining space reserved for HBM modules, 16 or 32GB of HBM acting as L4 cache would provide higher benefits instead of another 2 dies would it not?
     
  14. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    More wires, while difficult to achieve, might be worthwhile in a mesh. That should reduce the bandwidth burden per link on Infinity. I'd lean more towards even number of CCXs per die to possibly interface HBM (or similar tech) with 2 channels. That should lower latency and link constraints while allowing them to vary cache sizes.

    I'm not sure we know they are using HBM even for APUs. The old Zeppelin APUs seem more along the lines of CPU+discrete GPU with PCIE being an extension of Infinity. That said, integrating HBM onto a traditional APU or server CPU would solve a lot of bandwidth and socket limitation issues.

    It should still be situational, but yeah for many memory intensive applications and databases that bandwidth could be a game changer. DDR4-3200 (25.6GB/s) with 4 channels would be roughly half the bandwidth of a single stack of HBM. DDR5 would double that and be roughly equivalent, but that's not considering faster HBM. For a low end system, a single stack of HBM could allow for completely removing traditional memory. An 8GB stack of HBM should be sufficient for most simple desktop operations and have it's cost offset by the lack of memory, motherboard complexity, size, and probably even power.

    That being the case there has been a fair amount of database acceleration using GPUs. So in that case, half the CPU cores with a GPU and HBM attached would likely be a game changer. Say two CCXs and 1-2 Vega 11(?) with a single stack of HBM. Having seen 8-Hi stacks, I'd think 8GB of L4/LLC tripling (L4+DDR4) your effective memory bandwidth would be a solid step. Provided sufficient power, it should fit in the same socket as well. Ultimately it would seem a question of how much they want to differentiate their product stack. Have 40 or so chips like Intel, or reuse one chip for the entire stack like Ryzen. An APU could make that two chips while handling AVX512 and bandwidth intensive applications which tend to be parallel workloads anyways.

    Another possibility might be sacrificing PCIE lanes for memory channels. Have 8 or more channels per socket for interfacing a LOT of NVDIMMs. Haven't looked into how well NVDIMMs work with ECC/RAID for redundancy on what is likely a storage server.
     
  15. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    398
    I don't think AMD will go for more than 4 core per ccx since the design needs to go from notebook to servers. Maybe 3 CCX per die.
     
  16. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,301
    Likes Received:
    397
    Location:
    Australia
    @3dilettante
    AMD confirm singled ended not LVDS for the GMI links ( around 5:20) so that makes a lot more sense given the bandwidth number, clock rates and number of pins.
     
  17. Cyan

    Cyan orange
    Legend Veteran

    Joined:
    Apr 24, 2007
    Messages:
    8,572
    Likes Received:
    2,293
  18. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    12,796
    Likes Received:
    9,139
    Location:
    Cleveland
    @Cyan and everyone else...

    Please don't post pics of words when actual words are more suitable. The images don't always load or are removed.
     
    tinokun, hoom, Lightman and 3 others like this.
  19. Pressure

    Veteran Regular

    Joined:
    Mar 30, 2004
    Messages:
    1,355
    Likes Received:
    283
    That and videos. Besides, images consisting only of text is a waste of bandwidth and annoying to read on HiDPI monitors.
     
    BRiT likes this.
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    Per the article it's in the memory standard for the higher speeds, for stability purposes. For a wider range of module qualities, the standard is more concerned about getting as broad a level of compatibility and stability instead of performance.

    IBM's eDRAM cache is large but is subdivided into local slices. It depends on which generation whether it's private to a core or shared between two. Relative to a CCX, there's at least twice as much L3 per core than AMD.
    IBM's bandwidth between levels ranges from 2-4 times that of AMD, and unlike AMD the L3 has a more complex relationship in that it will eventually copy hot L2 lines into itself, and the chip overall has a complex coherence protocol and migration policies for cloning shared data between partitions.
    Latency-wise, the L1 and L2 in recent power chips is on the order of Intel's caches, and the local L3 is something like ~27 cycles versus AMD's ~30-40. Power 8 can seemingly muster this at significantly more than 4GHz, Power 9 is less documented, but seems to have 4 as starting point. The L3 is an EDRAM cache, so it may have non-best case latencies that differ from the SRAM-based L3 of Zen. DRAM can be more finicky in terms of its access patterns and when it is occupied with internal array maintenance like refresh.
    The cache hierarchy is dissimilar, with a write-through L1 and an L2-L3 relationship that is more inclusive. Unlike AMD, the Power L3 can cache L2 data and can participate in memory prefetch.
    Remote L3 access is pretty long with IBM, but Zen is equally poor. In terms of the bandwidth for those remote hits, AMD's fabric is missing a zero in the figure even on-die.
    IBM's die, power, and price for all that is typically not in the same realm.

    Some areas that do add up more favorably for EPYC are the DRAM and IO links per socket.

    If the differential PHY is running at 10.6 Gbps for xGMI, The Stilt's that GMI is twice as wide and half as fast gives each package link 32 signals in each direction, at a more sedate 5.8 Gbps.
    AMD's diagram in in its Processor Programming Reference document has 4 GMI controllers, though the MCM only uses 3 per die. (edit: 5.3, bad at math today)
     
    #2320 3dilettante, Jul 17, 2017
    Last edited: Jul 17, 2017
    TheAlSpark likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...