AMD RyZen CPU Architecture for 2017

Discussion in 'PC Industry' started by fellix, Oct 20, 2014.

Tags:
  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,239
    Likes Received:
    3,184
    Location:
    Finland
    InfinityFabric works at the same speed as the memory controllers, so half the memory speeds
     
  2. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    The links operate at that speed, but I'm guessing the controller is faster to accommodate a plurality of connections.
     
  3. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    314
    Likes Received:
    225
    Synthetic memory bandwidth benchmarks show that the DRAM bandwidth can be sustained though. I would rather guess that the memory controller is pipelined around clean/NACK cases, and forwarding of dirty lines would block the memory controller for extra cycles, resulting in the off-the-peak bandwidth in certain "inter-CCX" circumstances.

    It doesn't have to be a full crossbar between all participants either. Say perhaps 3-to-2 crossbar (CCX/IO to MCTs), so dirty lines must first go to the memory controller so as to make its way to the destination.
     
    #1563 pTmdfx, Mar 20, 2017
    Last edited: Mar 20, 2017
  4. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    948
    Likes Received:
    409
    Cyan, Lightman, hoom and 4 others like this.
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,402
    Likes Received:
    4,111
    Location:
    Well within 3d
    I wouldn't know about blocking the controller in a non-pipelined fashion. It would only get worse with Naples if the controller needs to service multiple snoops and each one takes it out of commission with a dead cycle.

    AMD's MOESI at least so far has relied on the memory controller as the arbiter, since it uses home snooping. However, the controller should have the raw bandwidth to handle a full 32B per cycle, given the raw throughput of dual 8-byte DDR channels.
    The tests trying to isolate cache transfer bandwidth should at least in theory try to avoid stressing the DRAM channel at the same time--although it could be they are written in a way that prompts a write to DRAM more often than not.

    One possible test would be to write a two-way inter-CCX test to see if the aggregate bandwidth rises.
    Another unknown, if the controller is the home node, is if the memory channels are running in ganged versus unganged mode.

    A single channel would own a cache line in unganged mode, and might get independent scheduling and home-node arbitration responsibilities from the other. That could potentially also split its bandwidth if the other channel is statically given half the memory controller block's resources.
     
  6. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    314
    Likes Received:
    225
    In Summit Ridge, it is possible to have a rather constant timing for snooping, since the memory controller needs to snoop locally — probably only one target (the foreign CCX). So ideally they might have built a memory controller that is pipelined deep/parallelized just enough to hide both DRAM and snoop latencies, based on the assumption of MOESI having clean lines always delivered from DRAM. Dirty lines however might result in a rather high variation, since the delivery might cause evictions that do not necessarily be clean or silent.

    If Naples gonna deploy directories e.g. like earlier Opteron chips, it might perhaps have predictable timing under similar circumstances.
     
    Heinrich4 likes this.
  7. ProspectorPete

    Regular Newcomer

    Joined:
    Feb 1, 2017
    Messages:
    414
    Likes Received:
    137


    This looks good is they were correct
     
    Lightman likes this.
  8. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    948
    Likes Received:
    409
    The erratic behavior of the games where some are better with 6 some with 4 and some with 8 is pretty interesting.

    Could it be technically possible to schedule core on demand? I mean if you have a program that its known to only use 4 core only schedule 4 core to that app and boost frequency. That could give you the best of both worlds.
     
  9. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,515
    Likes Received:
    936
    As a user, you can disable cores, so yes, you could do that.

    As a developer, you can use thread affinity to the same effect, but without having to disable cores. It's usually better to let the OS's scheduler do its thing, though.
     
  10. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,110
    Likes Received:
    2,579
    Location:
    Germany
    Did anyone do a decent die plot analysis yet?

    Here is one based on AMDs published die shot, but there are a lot of structures which are... strange to say the least:
    [​IMG]

    [1] I would connect to the memory controllers
    [2] Could be external inter-CCX-links (IF)
    [3] Might be the chipset connection (GMI)
    [4] I forgot..
    [5] PCI Express propbably. But the compartmentalization seems strange. (Theres one single block at the bottom as well.

    We have to account for the various SOC elements on the die as well.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,402
    Likes Received:
    4,111
    Location:
    Well within 3d
    A possible wrinkle based on earlier Opterons is that the L3 can service a remote request if it has a clean line in the Exclusive state:
    https://people.freebsd.org/~lstewart/articles/cache-performance-x86-2009.pdf (tables 2,3,4).
    That may still apply with Zen, since this is an on-die but still remote snoop.

    It would take a test that specifically profiled different cache states and caches to tease this out for Zen. If this still happens in Zen, this would appear to make the DRAM access or a dirty eviction unnecessary--making it a question of what the controller is capable of handling.

    Hypertransport Assist was a probe filter, however, and also did not have full coverage of a multi-socket system. I'm curious if Zen expands this to handle Naples in one socket, and what happens at two sockets. One difference now is that the L3 is no longer associated with the northbridge and that may have implications when it comes to putting cores to sleep if the filter/directory still uses the L3 for holding its tables.
     
  12. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    [1] are basically SRAM blocks (each about [strike]1[/strike] 2 MB in size? edit: looks to be higher density than L3 to me and there are exactly 512 SRAM banks in there, so probably 2MB if they didn't use larger SRAM cells than for L3 [unlikely]) and could also be part of some memory structure for filtering coherence traffic (like the old HTAssist but using a separate memory structure instead of using a part of the L3).
    [2] I would agree, although there is a fourth structure identical to the ones you labeled as "2" in the very top left corner (just split into two rows, but otherwise identical to the other three instances). And all consist of two identical halfs (which may indicate the interfaces can be split into half interfaces as possible with the cHT interfaces of the old Opterons)
    [edit]
    This could actually make sense, if each CCX has in principle three interfaces for creating a mesh between the CCXs. The on die connection between the CCX would be the third interface in addition to the two external ones (off die but on package) for each CCX. Naples with 4 dies could look like a cube in that case. Each corner is a CCX, the 4 vertical edges are the on die connections and the top an bottom faces are connected by the two external interfaces from each CCX (one face can omit two edges and connect the diagonals instead to reduce the average number of hops for communicating). If that are actually 4 interfaces per CCX (each visible half of the 4 structures labeled "2" can establish an independent link), one could almost fully connect the cube (5 of the other 7 CCX can be reached in a single hop).
    [/edit]
    [3] As the chipset uses basically a 4x PCIe interface, I wouldn't expect a separate PHY for it, especially as the 4 lanes normally used to connect to the chipset are usable as normal PCIe lanes in case of the X300 "chipset" (which uses an LPC interface to connect to the CPU as there are no bandwidth hungry parts in it; X300 basically just provides the connections to the BIOS/UEFI, AC97 codec, timers, legacy interfaces, the optional TPM module, and is a hardware dongle to enable OC).
    My guess for "3" would be the 4 USB 3.1 Gen1 PHYs capable of 5GT/s (and therefore a bit smaller than the 8GT/s PCIe PHYs).

    [5] Yes. The split in two blocks with 16 lanes each could be a layout optimization for the Naples MCM.
    One idea for the additional small block at the bottom (which would indicate the Zeppelin die features actually 34 PCIe lanes) could be that the PCIe PHYs are switchable between SATA and PCIe mode and AMD wanted to provide a minimum of SATA ports also when Naples uses all available PCIe lanes for something else (and it was easier to duplicate a single type of PHYs than to integrate specialized SATA PHYs in addition). And it would fit with the two available SATA ports Ryzen provides. But who knows?
     
    #1572 Gipsel, Mar 21, 2017
    Last edited: Mar 21, 2017
    Heinrich4, BRiT and Lightman like this.
  13. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,110
    Likes Received:
    2,579
    Location:
    Germany
    [1] From what I gathered, AMD said they concurrently snoop the other CCX's L3-cache and start a memory request, the latter of which is cancelled if the/an other L3 holds the necessary data. Maybe those SRAM cells are indeed just buffers in order not to completely trash the memory controllers with potentially useless lookups.

    [2] Nice catch with the fourth IF-„ring-stop“ [2] - if that's what it is.

    [3] Could be that it is USB. But..

    [5] Interestingly, Summit Ridge is said to only sport 24 PCIe-gen3-lanes, but if I count correctly, we have 34 of them, with a 2-lane block being very separate. From the layout, it looks like we have 6× 4-lane-ports and 5× 2-lane-ports. Might also be for use in data center environments where you might want to connect a lot of M2-SSDs.

    - Then we have 3 copies of two blocks between the CCXs (next to the left CCX)
    - 2 copies of 1 block/structure right above the lowermost IF-stop.
     
    Cyan likes this.
  14. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    948
    Likes Received:
    409
    First leak of zen APU

    http://www.eteknix.com/amd-zen-4-core-8-thread-raven-ridge-apu-benchmark-leaked/

    Looks like the CPU is on pair with the 6600 at least in this particular test.




    Very interesting behavior of the CPU when using lower performance VGA(1070/1076) Also this test show how little performance is gain running more cores. But with the 1500 giving you all what you need for less than an used and burned 4770k the amount of 4/8 chips will increase and dev. will have a real reason to develop for them.
     
    #1574 xEx, Mar 22, 2017
    Last edited: Mar 22, 2017
  15. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    3,075
    Likes Received:
    596
    With Naples having 128 lanes why wouldn't Summit Ridge have 32?
    Wasn't there a similar situation of a lone PCIE/HT block winding up being for factory test on an older AMD CPU?
     
  16. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,110
    Likes Received:
    2,579
    Location:
    Germany
    Indeed.
     
  17. Heinrich4

    Regular

    Joined:
    Aug 11, 2005
    Messages:
    596
    Likes Received:
    9
    Location:
    Rio de Janeiro,Brazil
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,402
    Likes Received:
    4,111
    Location:
    Well within 3d
    The apparent size of those arrays points to a capacity on the order of the probe filter in Hypertransport Assist, although the filter's role isn't to spare the memory controller. In the previous chips with MOESI, the memory controller is the home agent that does the snooping and arbitrates conflicts.
    I think it shouldn't take that capacity of SRAM to spare the memory controller itself, since the caches that can hit memory should be relatively small in the number of misses they can have in-flight.

    However, if a filter is what those arrays are for, the question as to whether a memory access or CCX forward is needed would be known when the controller checks the filter. That could be a single-chip latency optimization to avoid a serial lookup in the (memory clocked?) table and just sending a probe to the other CCX while adding a request to the memory channel's queue.
     
  19. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    948
    Likes Received:
    409
  20. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    948
    Likes Received:
    409
    Benchs with 3600 ram(I have to admit that even I am surprise, hope its not fake):

    http://i.imgur.com/tvtkbtb.jpg

    The image was too big to be directly posted so left the link instead.

    source vid:

     
    ImSpartacus and BRiT like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...