AMD RyZen CPU Architecture for 2017

Infinity is a mesh, so that was my assumption as well. Link approaches/exceeds L3 bandwidth as nodes/links are added. AMD mentioned routing around congested links though. These results would imply links can't be bonded together as the nodes should scale to 4-8 CCXs or more for APUs. So what limits the links, protocol?
InfinityFabric works at the same speed as the memory controllers, so half the memory speeds
 
It could also be a question of whether or not something blows up. Something that crosses so many functional blocks and clock regions could have timing issues that prompted them to back off. One item to note is that the 1:1 ratio is between the fabric and the DDR clock, not the IMC. Going to 1:1 actually creates a mismatch with the controller itself. It appears to be a straightforward ratio to maintain, but still tightens margin for error or potentially adds a ceiling to the DRAM clock if the uncore cannot match it.

I saw a link on Ars technica concerning the fabric width:
https://www.reddit.com/r/Amd/comments/5zr8lv/i_asked_amd_a_followup_question_about_infinity/

This is a 256-bit bidirectional crossbar. The inter-CCX bandwidth numbers given earlier don't act like it's 256 bits of bandwidth per clock in each direction, however.
Synthetic memory bandwidth benchmarks show that the DRAM bandwidth can be sustained though. I would rather guess that the memory controller is pipelined around clean/NACK cases, and forwarding of dirty lines would block the memory controller for extra cycles, resulting in the off-the-peak bandwidth in certain "inter-CCX" circumstances.

It doesn't have to be a full crossbar between all participants either. Say perhaps 3-to-2 crossbar (CCX/IO to MCTs), so dirty lines must first go to the memory controller so as to make its way to the destination.
 
Last edited:
Synthetic memory bandwidth benchmarks show that the DRAM bandwidth can be sustained though. I would rather guess that the memory controller is pipelined around clean/NACK cases, and forwarding of dirty lines would block the memory controller for extra cycles, resulting in the off-the-peak bandwidth in certain "inter-CCX" circumstances.
I wouldn't know about blocking the controller in a non-pipelined fashion. It would only get worse with Naples if the controller needs to service multiple snoops and each one takes it out of commission with a dead cycle.

AMD's MOESI at least so far has relied on the memory controller as the arbiter, since it uses home snooping. However, the controller should have the raw bandwidth to handle a full 32B per cycle, given the raw throughput of dual 8-byte DDR channels.
The tests trying to isolate cache transfer bandwidth should at least in theory try to avoid stressing the DRAM channel at the same time--although it could be they are written in a way that prompts a write to DRAM more often than not.

One possible test would be to write a two-way inter-CCX test to see if the aggregate bandwidth rises.
Another unknown, if the controller is the home node, is if the memory channels are running in ganged versus unganged mode.

A single channel would own a cache line in unganged mode, and might get independent scheduling and home-node arbitration responsibilities from the other. That could potentially also split its bandwidth if the other channel is statically given half the memory controller block's resources.
 
I wouldn't know about blocking the controller in a non-pipelined fashion. It would only get worse with Naples if the controller needs to service multiple snoops and each one takes it out of commission with a dead cycle.
In Summit Ridge, it is possible to have a rather constant timing for snooping, since the memory controller needs to snoop locally — probably only one target (the foreign CCX). So ideally they might have built a memory controller that is pipelined deep/parallelized just enough to hide both DRAM and snoop latencies, based on the assumption of MOESI having clean lines always delivered from DRAM. Dirty lines however might result in a rather high variation, since the delivery might cause evictions that do not necessarily be clean or silent.

If Naples gonna deploy directories e.g. like earlier Opteron chips, it might perhaps have predictable timing under similar circumstances.
 
The erratic behavior of the games where some are better with 6 some with 4 and some with 8 is pretty interesting.

Could it be technically possible to schedule core on demand? I mean if you have a program that its known to only use 4 core only schedule 4 core to that app and boost frequency. That could give you the best of both worlds.
 
As a user, you can disable cores, so yes, you could do that.

As a developer, you can use thread affinity to the same effect, but without having to disable cores. It's usually better to let the OS's scheduler do its thing, though.
 
Did anyone do a decent die plot analysis yet?

Here is one based on AMDs published die shot, but there are a lot of structures which are... strange to say the least:
W2bQbq0.jpg


[1] I would connect to the memory controllers
[2] Could be external inter-CCX-links (IF)
[3] Might be the chipset connection (GMI)
[4] I forgot..
[5] PCI Express propbably. But the compartmentalization seems strange. (Theres one single block at the bottom as well.

We have to account for the various SOC elements on the die as well.
 
In Summit Ridge, it is possible to have a rather constant timing for snooping, since the memory controller needs to snoop locally — probably only one target (the foreign CCX). So ideally they might have built a memory controller that is pipelined deep/parallelized just enough to hide both DRAM and snoop latencies, based on the assumption of MOESI having clean lines always delivered from DRAM. Dirty lines however might result in a rather high variation, since the delivery might cause evictions that do not necessarily be clean or silent.
A possible wrinkle based on earlier Opterons is that the L3 can service a remote request if it has a clean line in the Exclusive state:
https://people.freebsd.org/~lstewart/articles/cache-performance-x86-2009.pdf (tables 2,3,4).
That may still apply with Zen, since this is an on-die but still remote snoop.

It would take a test that specifically profiled different cache states and caches to tease this out for Zen. If this still happens in Zen, this would appear to make the DRAM access or a dirty eviction unnecessary--making it a question of what the controller is capable of handling.

If Naples gonna deploy directories e.g. like earlier Opteron chips, it might perhaps have predictable timing under similar circumstances.
Hypertransport Assist was a probe filter, however, and also did not have full coverage of a multi-socket system. I'm curious if Zen expands this to handle Naples in one socket, and what happens at two sockets. One difference now is that the L3 is no longer associated with the northbridge and that may have implications when it comes to putting cores to sleep if the filter/directory still uses the L3 for holding its tables.
 
Did anyone do a decent die plot analysis yet?

Here is one based on AMDs published die shot, but there are a lot of structures which are... strange to say the least:
W2bQbq0.jpg


[1] I would connect to the memory controllers
[2] Could be external inter-CCX-links (IF)
[3] Might be the chipset connection (GMI)
[4] I forgot..
[5] PCI Express propbably. But the compartmentalization seems strange. (Theres one single block at the bottom as well.

We have to account for the various SOC elements on the die as well.
[1] are basically SRAM blocks (each about [strike]1[/strike] 2 MB in size? edit: looks to be higher density than L3 to me and there are exactly 512 SRAM banks in there, so probably 2MB if they didn't use larger SRAM cells than for L3 [unlikely]) and could also be part of some memory structure for filtering coherence traffic (like the old HTAssist but using a separate memory structure instead of using a part of the L3).
[2] I would agree, although there is a fourth structure identical to the ones you labeled as "2" in the very top left corner (just split into two rows, but otherwise identical to the other three instances). And all consist of two identical halfs (which may indicate the interfaces can be split into half interfaces as possible with the cHT interfaces of the old Opterons)
[edit]
This could actually make sense, if each CCX has in principle three interfaces for creating a mesh between the CCXs. The on die connection between the CCX would be the third interface in addition to the two external ones (off die but on package) for each CCX. Naples with 4 dies could look like a cube in that case. Each corner is a CCX, the 4 vertical edges are the on die connections and the top an bottom faces are connected by the two external interfaces from each CCX (one face can omit two edges and connect the diagonals instead to reduce the average number of hops for communicating). If that are actually 4 interfaces per CCX (each visible half of the 4 structures labeled "2" can establish an independent link), one could almost fully connect the cube (5 of the other 7 CCX can be reached in a single hop).
[/edit]
[3] As the chipset uses basically a 4x PCIe interface, I wouldn't expect a separate PHY for it, especially as the 4 lanes normally used to connect to the chipset are usable as normal PCIe lanes in case of the X300 "chipset" (which uses an LPC interface to connect to the CPU as there are no bandwidth hungry parts in it; X300 basically just provides the connections to the BIOS/UEFI, AC97 codec, timers, legacy interfaces, the optional TPM module, and is a hardware dongle to enable OC).
My guess for "3" would be the 4 USB 3.1 Gen1 PHYs capable of 5GT/s (and therefore a bit smaller than the 8GT/s PCIe PHYs).

[5] Yes. The split in two blocks with 16 lanes each could be a layout optimization for the Naples MCM.
One idea for the additional small block at the bottom (which would indicate the Zeppelin die features actually 34 PCIe lanes) could be that the PCIe PHYs are switchable between SATA and PCIe mode and AMD wanted to provide a minimum of SATA ports also when Naples uses all available PCIe lanes for something else (and it was easier to duplicate a single type of PHYs than to integrate specialized SATA PHYs in addition). And it would fit with the two available SATA ports Ryzen provides. But who knows?
 
Last edited:
[1] From what I gathered, AMD said they concurrently snoop the other CCX's L3-cache and start a memory request, the latter of which is cancelled if the/an other L3 holds the necessary data. Maybe those SRAM cells are indeed just buffers in order not to completely trash the memory controllers with potentially useless lookups.

[2] Nice catch with the fourth IF-„ring-stop“ [2] - if that's what it is.

[3] Could be that it is USB. But..

[5] Interestingly, Summit Ridge is said to only sport 24 PCIe-gen3-lanes, but if I count correctly, we have 34 of them, with a 2-lane block being very separate. From the layout, it looks like we have 6× 4-lane-ports and 5× 2-lane-ports. Might also be for use in data center environments where you might want to connect a lot of M2-SSDs.

- Then we have 3 copies of two blocks between the CCXs (next to the left CCX)
- 2 copies of 1 block/structure right above the lowermost IF-stop.
 
First leak of zen APU

http://www.eteknix.com/amd-zen-4-core-8-thread-raven-ridge-apu-benchmark-leaked/

Looks like the CPU is on pair with the 6600 at least in this particular test.



Very interesting behavior of the CPU when using lower performance VGA(1070/1076) Also this test show how little performance is gain running more cores. But with the 1500 giving you all what you need for less than an used and burned 4770k the amount of 4/8 chips will increase and dev. will have a real reason to develop for them.
 
Last edited:
we have 34 of them, with a 2-lane block being very separate
With Naples having 128 lanes why wouldn't Summit Ridge have 32?
Wasn't there a similar situation of a lone PCIE/HT block winding up being for factory test on an older AMD CPU?
 
[1] From what I gathered, AMD said they concurrently snoop the other CCX's L3-cache and start a memory request, the latter of which is cancelled if the/an other L3 holds the necessary data. Maybe those SRAM cells are indeed just buffers in order not to completely trash the memory controllers with potentially useless lookups.
The apparent size of those arrays points to a capacity on the order of the probe filter in Hypertransport Assist, although the filter's role isn't to spare the memory controller. In the previous chips with MOESI, the memory controller is the home agent that does the snooping and arbitrates conflicts.
I think it shouldn't take that capacity of SRAM to spare the memory controller itself, since the caches that can hit memory should be relatively small in the number of misses they can have in-flight.

However, if a filter is what those arrays are for, the question as to whether a memory access or CCX forward is needed would be known when the controller checks the filter. That could be a single-chip latency optimization to avoid a serial lookup in the (memory clocked?) table and just sending a probe to the other CCX while adding a request to the memory channel's queue.
 
Back
Top