AMD RyZen CPU Architecture for 2017

That sounds a bit dubious. The inter-die (intra-package) GMI links are allegedly synced to the memory frequency (as the on die infinity fabric links) and provide 256Bit bandwidth per clock (1.33 GHz at 2.66GT/s memory) per direction. That is totally expected. Having the exact same bandwidth as provided by the ports to the intra-die infinity fabric makes a lot of sense.
The external (between sockets) xGMI links are supposed to be higher in frequency and lower in width to provide also the same bandwidth. That is something I would shoot for too. But we know, Epyc uses the PCIe lanes (64 in total, probably 4x16, i.e., each die from one package uses either 16 lanes to connect to exactly one die in the other package [basically setting up a cube as topology where the "faces" in each package may be fully connected] with a lesser probability of 8x8 meaning each die in one package connects to two other dies in the other package) for xGMI, which are basically capable of pushing 8 GT/s over PCIe slot connectors, equalling 16 GB/s for an x16 link. One may expect they can hit higher speeds when just connecting the two sockets. But each xGMI link (assuming there are 4 in total between the dies) would not be able to provide anywhere close to 42,6GB/s (there was recently a demo of a combined PCIe CCIX PHY pushing 25GT/s over a normal PCIe slot, but that was under lab conditions, I wouldn't expect Epyc to run their PCIe-PHYs at up to 21.3 GT/s). The maximum one may allow for it may be a doubling of the signaling rate between the sockets (I would deem that pretty aggressive already), giving 32GB/s per x16 link. If AMD could do north of 20GT/s, they would probably have said so. Or at least I would expect to tell us. Or the topology between the dies in a 2S system looks a bit different. But bundling all 32 lanes each from two dies to get two 32x links between the sockets appears weird (especially as two x16 parts per die are physically separated) and I wouldn't see a clear advantage, albeit it's not impossible (and the required 10.6 GT/s to get 42.6GB/s on a x32 link appear to be totally doable).

And having only 42 GB/s per direction over all 64 PCIe lanes together appears to be too low to me and would probably limit scaling.

So just want to point out the SERDERS interface AMD is using is rated for 12.5Ghz and they dont have to just use pci* on it, who knows how high it can be "pushed" in a controlled environment. The other thing to consider is that a 16x PCI-E block on zepplin is 3x4+2x2 interfaces so there would be quite a few extra pairs that might potentially be able to be used but i think this is unlikely. I also took the stilt as meaning per Zepplin not for the entire socket in terms of GMIx, but i have a hard time reconciling how it gets to 42.6GB/s for GMIx.

But the stilt knows his stuff and has ties to AMD so until otherwise known i'll take him at his word........

Just thinking about it, the biggest block the 12G PHY supports is quad channel, that quad channel can run 100GBASE-CR10 which is 10 lanes @ 10.3125ghz ( im assuming this can be done because it doesn't use differential pairs) , So if they can vary the clock rate to a multiple of memory clock 42.6GB/s would be possible with a max of around 60GB/s if pushing all the way to 12.5ghz .

*Supports PCI Express 3.1, SATA 6G, Ethernet 40GBASE-KR4, 10GBASE-KR, 10GBASE-KX4, 1000BASE-KX, 40GBASE-CR4, 100GBASE-CR10, XFI, SFI (SFF-8431), QSGMII, and SGMII
 
So just want to point out the SERDERS interface AMD is using is rated for 12.5Ghz and they dont have to just use pci* on it, who knows how high it can be "pushed" in a controlled environment.
For quick reference, I will link what I think the source for the PHY description is: https://support.amd.com/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf
This sounds like a subset of the Synopsys DesignWare Enterprise 12G product: https://www.synopsys.com/dw/ipdir.php?ds=dwc_ether_enterprise12g

That might not be receptive to being pushed beyond 12.5, although if the bandwidth figure give by The Stilt is the aggregate unidirectional xGMI bandwidth for a Zeppelin die, it wouldn't need to be pushed out of spec.
Math-wise, I think the description of the Ryzen data fabric as a 256-bit bidirectional crossbar running at 1/2 the DRAM rating can fit with the 42.6 GB/s bandwidth number.
2666 DDR -> 1333 DF clock *256 bit = 42.66 GB/s
32 lanes of PHY (1/8 DF width) at 8x the DF clock gives 10.6 Gbps and that is 42.6 GB/s out of one die.

It's quite possible with Naples that no single chip can dedicate all its PHY to xGMI, so half the interface from 4 chips would give 85.3 GB/s.
One additional possibility is that this split is built into how Zeppelin's data fabric works, and that's why an inter-CCX cache transfer benchmark with a stream of transactions with the same start and end points doesn't yield the link bandwidth that a 256-bit crossbar would suggest.

Possible reasons why the aggregate unidirectional bandwidth for the MCM might be twice the 42.6 number:
If the server diagram is accurate, there are 4 xGMI links between the sockets: https://www.servethehome.com/second...ctures-platform-block-diagram-112-pcie-lanes/
If the leaked HPC APU slide is accurate, two Zeppelin chips connect to a Greenland GPU with 4 GMI links (DDR4 3200 yields 25.6 GB/s per link, 100 GB/s aggregate).
The Stilt indicates that GMI and xGMI are iso-bandwidth, with differences in width*clock and link integrity measures.

One issue in this overall scenario is that if the PHY for the xGMI links cannot be pushed, DDR 3200 doesn't quite fit under the 12.5 Gbs ceiling without running something out of spec or at a different clock.
The APU's math still barely works out with the 12.5 max, but it is also stated that the links are GMI rather than xGMI and the APU has a full 64 lanes of external PCIe. There's no external product requirement that GMI match the 12G PHY's specs, or it can get under it again since The Stilt said GMI is twice as wide but clocked half as fast as xGMI.
Naples the server MCM may not officially be able to scale to 3200 without some adjustment, however.

Perhaps Vega's unused perimeter has some extra IO?
 
Last edited:
Asus released an image of their x399 board

1496150690319.jpg


https://rog.asus.com/articles/maxim...ming-for-amds-monster-ryzen-threadripper-cpu/

I'm guessing the release of these CPUs in late Summer will probably mean more mature BIOS from the board partners. Hopefully there won't be many problems running 8 sticks at launch :p
 
This is fine for most enterprise applications, but Ryzen performance clearly points out that consumer applications (including games) like a big shared LLC much more than slightly lower LLC latency. Future programmers need to be thinking more about memory locality. Big shared LLC is starting to become too expensive (and it gets slower as you add more cores and more capacity).
http://man7.org/linux/man-pages/man3/numa.3.html

https://msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx
Most of those issues should be resolved by a NUMA aware OS. Allocating the stack per node for lowest latency and heap interleaved (SUMA for higher node counts) for maximum bandwidth would be a simple strategy. Might be worth investigating bandwidth and latency of each pool. That may be OS dependent and a programmer may need to optimize some tasks for that. Given the libraries I linked above, the programmer would have control.

With a mesh like Infinity, a large shared LLC should be doable with sufficiently large network where aggregate link bandwidth to each node overtakes that of the LLC. Possibly Threadripper or Naples. Maybe Ryzen with high memory and low core clocks.

The mesh vs ring will change some of the observations of the blog you linked. Specifically bandwidth figures, as the link bandwidth scales with the mesh.
 
So just want to point out the SERDERS interface AMD is using is rated for 12.5Ghz and they dont have to just use pci* on it, who knows how high it can be "pushed" in a controlled environment. The other thing to consider is that a 16x PCI-E block on zepplin is 3x4+2x2 interfaces so there would be quite a few extra pairs that might potentially be able to be used but i think this is unlikely. I also took the stilt as meaning per Zepplin not for the entire socket in terms of GMIx, but i have a hard time reconciling how it gets to 42.6GB/s for GMIx.
*Supports PCI Express 3.1, SATA 6G, Ethernet 40GBASE-KR4, 10GBASE-KR, 10GBASE-KX4, 1000BASE-KX, 40GBASE-CR4, 100GBASE-CR10, XFI, SFI (SFF-8431), QSGMII, and SGMII
For quick reference, I will link what I think the source for the PHY description is: https://support.amd.com/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf
This sounds like a subset of the Synopsys DesignWare Enterprise 12G product: https://www.synopsys.com/dw/ipdir.php?ds=dwc_ether_enterprise12g
Nice find!
Just thinking about it, the biggest block the 12G PHY supports is quad channel, that quad channel can run 100GBASE-CR10 which is 10 lanes @ 10.3125ghz ( im assuming this can be done because it doesn't use differential pairs)
Actually, 100GBASE-CR10 uses differential signaling on 10 pairs (per direction) with slightly above 10GT/s. I doubt this can be done with just 4 lanes. Probably one has to bundle more (exactly ten, I suppose) for that.
One issue in this overall scenario is that if the PHY for the xGMI links cannot be pushed, DDR 3200 doesn't quite fit under the 12.5 Gbs ceiling without running something out of spec or at a different clock.
The APU's math still barely works out with the 12.5 max, but it is also stated that the links are GMI rather than xGMI and the APU has a full 64 lanes of external PCIe. There's no external product requirement that GMI match the 12G PHY's specs, or it can get under it again since The Stilt said GMI is twice as wide but clocked half as fast as xGMI.
Naples the server MCM may not officially be able to scale to 3200 without some adjustment, however.
As Naples is speced with a maximum of DDR4-2667 and server systems use OC-RAM very rarely, I guess they are okay.
And the GMI PHYs are looking vastly different than the the 34 lanes of the 12G PHYs. For sure they have different limits.
 
That'd mean the higher end 16 core would cost around $999. I'm thinking AMD cut the prices on the 1800x/1700x not only because they can (good yields?) and to be competitive against Intel but also to make space for Threadripper.
 
There is plenty of room in the +$500 segment. They just need to reposition their chips to be more favorable against Intels stack in the "lower" end.
 
There is plenty of room in the +$500 segment. They just need to reposition their chips to be more favorable against Intels stack in the "lower" end.

But obviously you can't be selling the 1800x and the lower end Threadripper at the same price. I'm thinking $500 will be the starting point for TR all the way up to $999.
 
It's a pretty good processor :yep2:

I've been running all cores at 3.9GHz with 1.34 load voltage and temps don't go over 72C at full load (stress testing with Prime95, which is an unrealistic workload scenario to begin with) with the stock cooler (although I did re-paste using Arctic Mx4). Try to find a BIOS with agesa 1.0.0.6 for your motherboard, it's very useful when trying to run memory at rated speeds.
 
Last edited:
It's a pretty good processor :yep2:

I've been running all cores at 3.9GHz with 1.34 load voltage and temps don't go over 72C at full load with the stock cooler (although I did re-paste using Arctic Mx4). Try to find a BIOS with agesa 1.0.0.6 for your motherboard, it's very useful when trying to run memory at rated speeds.
checking daily here, 'cos I still have a 3200MHz memory module running at 2666MHz but I guess the next bios update for the Msi B350M Gaming Pro should be out next week or so.
 
But obviously you can't be selling the 1800x and the lower end Threadripper at the same price. I'm thinking $500 will be the starting point for TR all the way up to $999.

Maybe they could? the x399 motherboards (and thus the total package) will be a lot more expensive.
 
But obviously you can't be selling the 1800x and the lower end Threadripper at the same price. I'm thinking $500 will be the starting point for TR all the way up to $999.
Due to the sockets they're locked to different platforms, so they could overlap prices if they really wanted. Motherboard, RAM, cooling, etc will make the Threadripper more expensive.

Makes sense if they are dumping dice to harvest more good ones for Naples. Assuming Naples is the same chip. Not to mention sales volume with those prices.
 
I went with Asus Prime X370-Pro and G.Skill RipJaws V F4-3200C14D 2x8GB (Samsung B-die) memories to make sure I can hit 3200MHz memory. Cooling will be done by BeQuiet Silent Loop 240mm AIO and rest of the components will come from my current i7 4790k rig.
BTW I already found beta 0801 BIOS with AGESA 1.0.0.6 for my board. Fun starts tomorrow as I received my board today, but have no time to build it yet! :D:runaway::yes:
 
Back
Top