AMD RyZen CPU Architecture for 2017

Not only that, the dice likely consist of the top binned parts AMD created. Unlike a large monolithic die that is held back by the weakest link. They may actually exceed any part we've seen so far.

Unlike the single-die variants, Threadripper will be using at least some of the interfaces that are unused on the die. The need would be more pressing for optimizations for coherence, if some were left off for the purposes of latency in the single-die case.
Some of the TDP is going to be taken up by the uncores being more heavily utilized, and AMD's NUMA-like penalties look to be compounded with actual NUMA for the memory controllers and IO. The on-package GMI links would also add some amount of power and probably some latency versus a monolithic die.

In terms of getting good yields for the individual chips, AMD's method should help. Even if AMD has been stockpiling chips that get past the rather consistent ceiling overclocking tends to hit, it is potentially mitigating the chance of a weakest link that might be found on one chip with an architectural dependence Ryzen cannot remove, given its sensitivity to memory latency, data fabric tweaks, and CCX management.
Those sensitive points now get doubled and thrown over external links.

Coupled with Intel's more flexible turbo and acknowledging variability with the preferred cores, the extra binning flexibility may not provide much differentiation in terms of achievable performance.
There's a chance that an Intel monolithic core might have all good cores, can mitigate some slower cores by preferring the better ones, or could inactivate some weak cores. There's no chance Ryzen cores can be manufactured without an uncore and off-die interconnect.
 
Im prettty sure you have allready see it on other places, but well.

Asus zenith extreme..

if you ask what is the additional DIMM on the right, it is for house two M2 SSDs addon cards.

index.php


http://www.guru3d.com/news-story/computex-2017-asus-shows-x399-amd-zenith-extreme-motherboard.html
 
The multi-threading advantage over Intel SKUs remains with Ryzen vs Skylake-X:

7800x (Skylake-X 6 core 12 thread) at 5.7GHz
image_id_1849916.jpeg


Ryzen 1600x at 5.150GHz
image_id_1848696.png


10% increase in clock speed yields "just" 5% increase in CB15 MT.
 
We don't know for sure that Ryzen mobile -versions are identical to desktops even if the name is

It's probably lower clocked with lower voltage but it says that it's an 8 core processor, so I assume it must be similar to the desktop part, other than clocks and voltage of course.
 
It's probably lower clocked with lower voltage but it says that it's an 8 core processor, so I assume it must be similar to the desktop part, other than clocks and voltage of course.
True enough, especially considering how well Ryzen scales down with voltage
 
64 PCIe lanes? That's new. The phases on the right are hidden though, the ones on the top are definitely 8 (the C6H has 6 on the top).
 
In terms of getting good yields for the individual chips, AMD's method should help. Even if AMD has been stockpiling chips that get past the rather consistent ceiling overclocking tends to hit, it is potentially mitigating the chance of a weakest link that might be found on one chip with an architectural dependence Ryzen cannot remove, given its sensitivity to memory latency, data fabric tweaks, and CCX management.
Those sensitive points now get doubled and thrown over external links.
Still too early to tell on some of those points. As I recall there were some initial issues with Ryzen that I believe AMD stated would be resolved with later steppings. Tweaks that should apply to Threadripper or Naples. Otherwise, yeah we're in agreement here. Depending on how they configure it, something like Intel's Cluster on Die should improve latency and bandwidth for some workloads. Ryzen being a different topology could benefit more from that configuration.

Coupled with Intel's more flexible turbo and acknowledging variability with the preferred cores, the extra binning flexibility may not provide much differentiation in terms of achievable performance.
There's a chance that an Intel monolithic core might have all good cores, can mitigate some slower cores by preferring the better ones, or could inactivate some weak cores. There's no chance Ryzen cores can be manufactured without an uncore and off-die interconnect.
http://wccftech.com/asus-teases-5ghz-overclocks-amds-ryzen-threadripper-cpus/
Take that with a grain of salt as they may just be showing their display, but ASUS would seem to think a 5GHz Threadripper is possible. That's likely an overclock, but Threadripper is technically an enthusiast part so someone will overclock it.
 
A 130W TDP Threadripper is just to R7 1700s on a die. I expect a 3GHz base clock and a 1800X-esque 4GHz boost clock (the higher TDP allows it).
I would actually expect higher base clocks as well as a higher TDP. What is the problem with 160W or even 180W TDP for the 12 and 16 core versions? Intel's Skylake-X variants with more than 12 cores likely feature a 160W TDP or something in that range as well (some Broadwell-EP versions did that already), to mitigate power contraints of the baseclocks. And one can fit a pretty big cooler on that huge LGA4094 socket. Cooling won't be a real issue. AMD should have competetive or even higher baseclocks than the larger Skylake-X models (12 cores and above). After all, the 7820X (8 cores) has 3.6GHz base, the 7900X (10 core) a 3.3GHz base clock in a 140W TDP. The larger models probably won't clock higher (data is still TBA, intel is probably busy figuring out, how far they can push it at what TDP), but AMD could actually achieve 3.3GHz base even with the 16 core model (if they opt for the 180W TDP).
Will be interesting to see if AMD chooses to enable the on-die LDO regulators to increase efficiency (and therefore create some headroom for boosting) in medium threaded load situations. In lightly threaded ones the TDP wouldn't be an issue. And the XFR is probably good for up to 4 cores (1 per CCX as in Ryzen).
 
Last edited:
http://wccftech.com/asus-teases-5ghz-overclocks-amds-ryzen-threadripper-cpus/
Take that with a grain of salt as they may just be showing their display, but ASUS would seem to think a 5GHz Threadripper is possible. That's likely an overclock, but Threadripper is technically an enthusiast part so someone will overclock it.

It's a PR tease, which probably is meant to generate buzz than promise anything substantive.
Clocks to that range for Intel and for the 1800X involved LN2.

LN2 seems to get Intel's monolithic chips to higher clocks, questions as to whether something else wouldn't bottleneck the cores at that speed aside. The DRAM timing and speed tweaks for existing Ryzen chips show a measurable benefit to the lower latency at sub 4GHz CPU clocks. The memory bottleneck would be more noticeable at the higher clocks, and I suspect it's less likely that the fabric and memory latencies can be pushed as low with the MCM.
 
also on ryzen:
https://forums.anandtech.com/thread...and-discussion.2499879/page-252#post-38915785
GMI and xGMI are basically the same link, but with different type of configuration (width), speed and error tolerance.

GMI is the internal form (inter-die) and has bandwidth of 42.6GB/s per direction at 2666MHz MEMCLK.
xGMI is the external form (inter-node, i.e. EPYC only), has the same bandwidth at the same MEMCLK but different speed, width and error tolerance.
Click to expand...
SP3r2 which is used by TR doesn't support 2P and therefore it has no xGMI. Meanwhile SP3 Epyc parts do support 2P and that means xGMI is present as well.
It appears that both SP3 and SP3r2 use the same socket, which means that a huge amount of the pins will be unused on SP3r2 (four memory channels, 64x PCIe links, xGMI, etc).

xGMI is half the width of GMI, but operates at twice the speed.
That sounds a bit dubious. The inter-die (intra-package) GMI links are allegedly synced to the memory frequency (as the on die infinity fabric links) and provide 256Bit bandwidth per clock (1.33 GHz at 2.66GT/s memory) per direction. That is totally expected. Having the exact same bandwidth as provided by the ports to the intra-die infinity fabric makes a lot of sense.
The external (between sockets) xGMI links are supposed to be higher in frequency and lower in width to provide also the same bandwidth. That is something I would shoot for too. But we know, Epyc uses the PCIe lanes (64 in total, probably 4x16, i.e., each die from one package uses either 16 lanes to connect to exactly one die in the other package [basically setting up a cube as topology where the "faces" in each package may be fully connected] with a lesser probability of 8x8 meaning each die in one package connects to two other dies in the other package) for xGMI, which are basically capable of pushing 8 GT/s over PCIe slot connectors, equalling 16 GB/s for an x16 link. One may expect they can hit higher speeds when just connecting the two sockets. But each xGMI link (assuming there are 4 in total between the dies) would not be able to provide anywhere close to 42,6GB/s (there was recently a demo of a combined PCIe CCIX PHY pushing 25GT/s over a normal PCIe slot, but that was under lab conditions, I wouldn't expect Epyc to run their PCIe-PHYs at up to 21.3 GT/s). The maximum one may allow for it may be a doubling of the signaling rate between the sockets (I would deem that pretty aggressive already), giving 32GB/s per x16 link. If AMD could do north of 20GT/s, they would probably have said so. Or at least I would expect to tell us. Or the topology between the dies in a 2S system looks a bit different. But bundling all 32 lanes each from two dies to get two 32x links between the sockets appears weird (especially as two x16 parts per die are physically separated) and I wouldn't see a clear advantage, albeit it's not impossible (and the required 10.6 GT/s to get 42.6GB/s on a x32 link appear to be totally doable).

And having only 42 GB/s per direction over all 64 PCIe lanes together appears to be too low to me and would probably limit scaling.
 
The external (between sockets) xGMI links are supposed to be higher in frequency and lower in width to provide also the same bandwidth.
Are we 100% sure xGMI is only between sockets and not clusters? That might make more sense with the mesh topology.

My thinking was a 5 node(4 cores plus link) cluster with the link bonding xGMI channels. Only GMI would have dedicated links, xGMI being a bit more flexible. PCIe being the obvious bottleneck, sharing lanes should sustain higher bandwidth. It stands to reason that not all links will be active simultaneously. Also checks with their routing around bottlenecks mentioned in an EETimes article.

As for the speed, what about a multi-level modulated signal over the PCIe lanes?
https://www.google.com/patents/US9338040
Broadcom patent, but AMD could have done something similar. Idea originated around 2012. FINFETs with their higher beta could start making use of that type of signaling.
 
Back
Top