AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

The GPU xGMI links are apparently running at 25 Gbps, going from the following video (after 21:00).
found via reddit:

Possibly, given the introduction of PCIe 4.0, there is something like the Synopsis multi-mode PHY used by Zen for its PCIe/xGMI purposes.
The following, for example might be able to support PCIe 4.0 and a 25 Gbps connection as well.
https://www.synopsys.com/dw/ipdir.php?ds=dwc_multi_protocol_25g_phy

Another point further in the video is that the GPUs can be connected in a ring with comparatively low latency for a GPU, and that it's all encrypted. I'm not sure if that is carrying over from the MI60's implementing memory encryption, or if the links themselves can be encrypted.

The latter point is something I'm curious about relative to EPYC and its encryption. The memory controllers decrypt memory as it is read in, and the isolation of unencrypted data in the caches is enforced by ASID--where a requester not from the same space misses in the cache goes to memory for decryption (or loading bad data?).
However, that understandable way of handling things makes me curious when the data potentially moves off-chip, like in an MCM or over the xGMI links in a 2-socket system. Is the presumption that it's too difficult to intercept? Is a workaround like allocating a VM only to one chip so there's no transfer out possible? Does it become more of a question in Rome, where if data decrypts at the controller there is no avoiding an off-chip transfer?
Could this mean there's been a change concerning what paths are encrypted/decrypted?
 
Hypothetically, what if more L3 was in the chiplets(dense, less power, less latency, less IF bandwidth hit) and the supposed L4 was tags and memory management ala HBCC. Still seems a bit much, but we don't know what all is included there. Security processor, memory management, and prediction seem likely. Along with IO controllers for USB, network, and probably storage. Possibility of large memory capacity with NVRAM to consider as well.
 
My impression from sources like https://fuse.wikichip.org/news/1064/isscc-2018-amds-zeppelin-multi-chip-routing-and-packaging/4/ is that the on-package links for Zeppelin are 32-wide and running at a lower speed than PCIe.

In Zeppelin, the IF links runs at 4xDRAM command rate, 5.3Gbit/s for 2.66GHz DRR4. At that speed, power consumption is just 2 pico joules/bit which is 10-20% of what PCIe3 uses.

I expect IF links to be wider and decoupled from DRAM speed in EPYC, something like 64 lanes in each direction running at 2GHz (with 4x datarate), that would yield 64GB/s in each direction.

Cheers
 
8 Zen1 cores already have combined 16MB of L3. The shrink from GF 14nm -> TSMC 7nm shrinks SRAM arrays much more than it shrinks logic. I would be extremely surprised if the chiplet dies do not have 32MB of L3 per chiplet. Those put together are 256MB. Of course, a single core can only make use of 32MB max.
The problem with stuffing a large victim cache on a CCX is that the the rest of the system don't benefit from the extra L3. While your single core performance might suffer from moving cache from the CCX to the IO chip, overall system performance should improve because your bigger, shared, cache reduces memory traffic.

On top of that, I expect the IO die to have another cache level, a L4 with 512MB. (Based on the earliest 8+1 leaks mentioning that.)
In GF 14nm, you use around 2mm² per MB SRAM, so unless we're talking eDRAM it is a lot less.

Cheers
 
The problem with stuffing a large victim cache on a CCX is that the the rest of the system don't benefit from the extra L3. While your single core performance might suffer from moving cache from the CCX to the IO chip, overall system performance should improve because your bigger, shared, cache reduces memory traffic.
That may depend on IF link bandwidth and power efficiency. 7nm L3 over 14nm adding capacity with less power might overcome any shortfalls from hammering a faster/wider IF link. Assuming IF is a mesh, that may be a practical option with a cluster on die approach. Would need a wide CCX to IO connection with quarter(?) width IF links within the IO die. Not sure I like the idea, but could be a possibility.
 
In Zeppelin, the IF links runs at 4xDRAM command rate, 5.3Gbit/s for 2.66GHz DRR4. At that speed, power consumption is just 2 pico joules/bit which is 10-20% of what PCIe3 uses.

I expect IF links to be wider and decoupled from DRAM speed in EPYC, something like 64 lanes in each direction running at 2GHz (with 4x datarate), that would yield 64GB/s in each direction.

Cheers
From https://semiaccurate.com/2018/11/09/amds-rome-is-indeed-a-monster/, Rome is described as having one IF link per chiplet to the IO chip, and that all memory, PCIe, and xGMI connectivity are hosted on the central chip.

64GB/s in each direction could generally align with the bandwidth demands of a single Zen CCX, given the L3's data fabric port is 32B in each direction and capped by the memory-determined fabric speed. The lower expectations for real-life usage likely factored into why there were two CCXs per die, although in Naples' case each of the three IFOP links could bring in as much bandwidth as the local DDR4 controllers, and then to a somewhat lesser extent the xGMI link. That's probably not something often done on a sustained basis, though generous provisioning for bandwidth was a marketing point for Naples.
What that would mean for a chiplet with 8 cores, 256-bit AVX and 256-bit data paths may need more disclosures. If AMD's data paths widened from the L1 through to the CCX data port, the demand could increase even though 64GB in each direction falls short of what Naples could provide.

This change in design could explain a shift to one 8-core CCX per chiplet to consolidate memory clients and reduce misses travelling over the link. AMD's protocols have historically not done much sharing of clean lines which creates additional bandwidth to memory unless there is a shared cache. A caveat to that is that AMD hasn't discussed some of the wrinkles to its MOESI protocols in later designs, or Zen's MDOEFSI protocol--although changes like the D state appear to be more concerned with repeated sharing of dirty lines.
Sharing one L3 could also mean more cache slices to provide internal bandwidth and reduce bank conflicts that were likely more common with Zen's simple method of striping data across just four banks. The CCX in vector loads could suffer if its port remains 32B, although it wouldn't be unprecedented for an AVX 256 device to not widen the full path through the cache hierarchy.

One link at 64GB/s in both directions, versus ~85GB/s in each direction for Naples could be debated in terms of what it offered in throughput versus latency and utilization.
Depending on the IO chips internal interconnect, bisection bandwidth for the package could be much higher, though one chiplet would hit a more impassible ceiling in trying to use it.
Congestion-wise, there is less flexibility to the chiplets. The IO chip might afford something with more flexibility on-die, however.

As far as decoupling the fabric from the memory controller, it seems like AMD's latency has been rather poor even with the monolithic solution much less one with more asynchronous paths. Rome likely portends to be somewhat worse, unless AMD's improved fabric manages to bring its baseline of "as bad as being separate" to "as fast as being integrated" and then un-integrating them back to "it could be worse".
The fabric's ability to transport is dependent on how quickly the home agents can process requests and broadcast snoops, and depending on how well AMD's fabric can handle congestion at the points of arbitration there may be excess power for limited gains or possibly pathological cases in the fabric if that portion of the network can be overloaded. AMD did promise second-gen Infinity Fabric, whatever that means.
 
As far as decoupling the fabric from the memory controller, it seems like AMD's latency has been rather poor even with the monolithic solution much less one with more asynchronous paths.

By tying IF speed to the memory controllers, you save some latency on memory requests, but you still cross clock domains (twice) going from one CCX to another. Also, you have less control of your IF bandwidth: Add a second DIMM on each channel and speed will degrade from 2.66GHz to 2.4 or 2.133 GHz (in the un-registered, non server world). Add crazy XMP DIMMs, and your IF now uses a lot more power than intended. All to save a handful of sub-ns cycles for the the asynchronous FIFOs.

There isn't a release date yet, so I hope we get some details at ISSCC in february.

Cheers
 
By tying IF speed to the memory controllers, you save some latency on memory requests, but you still cross clock domains (twice) going from one CCX to another.
Coherence probes and requests go to the coherence controller in the memory controller portion regardless, which means it would inject two clock boundary crossings as well.
It would be introducing some amount of interface penalty with part of its penalty proportional to the lower-clocked DRAM domain.
Single-chip, the uncached memory access scenario would mean the home agent's latency hit would be half of the total number of boundary crossings. In the case of a cache hit to another CCX on the same chip, the agent contributes about a third of the boundary crossings constituting the critical path of a coherent memory transaction.

Also, you have less control of your IF bandwidth: Add a second DIMM on each channel and speed will degrade from 2.66GHz to 2.4 or 2.133 GHz (in the un-registered, non server world). Add crazy XMP DIMMs, and your IF now uses a lot more power than intended. All to save a handful of sub-ns cycles for the the asynchronous FIFOs.
Having DRAM be slower than the fabric would mean the coherence controller's rate of processing probes and requests would lag the fabric's ability to transport requests to it, so the faster-running fabric would spend more cycles and power waiting on requests to be processed by the controller, the snoop filter checked, and probes generated.
Having the DRAM clock higher than the fabric means the home node's spending more time and power waiting on requests and responses to get to it, and the DRAM's higher bandwidth and power consumption is wasted.
If the power consumption and clock domain crossings worsen in a system that bottlenecks on the slower of the two sides of the clock boundary, opting to save boundary crossings and power doesn't seem as bad.

AMD's choice in where it placed its home agents is a holdover from prior generations, and I think there is some sign of complexity within the uncore given that performance shows a preference for DRAM speeds at specific ranges that might be related to internal divisors or places where the IP is not fully-baked. Not knowing the full details of the fabric, the errors reported and instability related to fiddling with the fabric or some of these ratios may point to the fabric not being fully validated or underspecified for scenarios outside of the straightforward one currently in production. The abstraction and agnostic nature of the fabric seems to correlate with the higher than expected latency and power consumption, and that's when all fabric components are trying to stay within the same clock range.

Some of the patents and unused firmware settings may also mean that AMD is using or is intending to use fewer synchronous fabric elements, where an immature implementation might be salvaged by slaving regions to a common clock, or a 2.0 version may relax the limits in place.
 
The current L3 hosts shadow tags for all the local L2s, which allows it to be responsible for filtering probes to them. The complexity of the checks would go down if the L3 dispensed with the mirrored structures, although inclusion would put pressure on the relationship between the number of ways in the L3 versus the number of ways in all the caches hanging from it. Whether the L2s experience some changes may go to how the L1 data and instruction caches change. The L1 instruction path is apparently going to be enhanced, part of which may include upping the associativity of the L1I, which may have impacts further out.

If the split L3 being listed in the screenshot is accurate, it leaves some question on the effects of snooping between them. If using the Zen hierarchy, inter-CCX probes would need to leave the CPU die, go to the IO die, go to the relevant controller, then follow the path back to the other CCX. Fingers would be crossed that the responding CCX can direct its data to the requester without going through those hops again.
Having some amount of local checking on possibly faster silicon would seem beneficial, but would be a change in the memory subsystem's coherence control.
 
Is it possible that those 8C chiplets have integrated DDR4 controllers (that happen to be disabled on Epyc)?
In other words, could one of those 8C chiplets be used alone, without the presence of the I/O hub, to make desktop Ryzen CPUs with 8 cores or less?
 
Is it possible that those 8C chiplets have integrated DDR4 controllers (that happen to be disabled on Epyc)?
In other words, could one of those 8C chiplets be used alone, without the presence of the I/O hub, to make desktop Ryzen CPUs with 8 cores or less?
Perhaps some complicated arrangement could do this, at a significant area cost. 15-20mm2 of a chiplet would be unused, which if the estimates from the pictures are accurate could translate into a fifth or more of each die. Without the southbridge and PCIe complex on the IO die, the chiplet would also be missing all of the SOC features of the socket it resides on.

video snipped
Perhaps there's a table of contents somewhere of the meaty portions of the video. I guess the most relevant are the various MCMs listed from 3:00-10:00.
One item I suppose we'll have to wait and see about is the TDP of the higher bins versus their core count and highest turbo range. Those are operating in a range where it feels somewhat optimistic to hit. The rather broad range of 2x or more variation in chip quality might not be out of the ordinary now, but if the ~5.0 GHz range can be hit without something like further microarchitecture changes or further stretching of TDP versus the boost/turbo power bands is unclear.

I'm torn about the idea of having the clients dis-integrated from their IO and memory controllers. The possible latency penalties are more tolerable in a server, and the overall consistency and scaling for 64 cores can give better loaded latencies that wouldn't apply in a 6-8 core product. The 8-core products that split across chiplets may see latency differences more significant than any chips in their range.
Another quibble with the drawings are that I would expect an IO die with a single Zen's loadout shouldn't be that outsized versus a CPU chiplet, though as drawings I wouldn't expect accurate scale.

The 16-core rumors seem to be situating them in a dual-channel socket, which I'd expect to notice the constraint keenly. Simultaneously, the best clock speeds are only on the most crowded multicores.

The GPU chiplet seems interesting and also problematic. What bandwidth would it get versus a CPU, given that Raven Ridge gave its GPU two fabric ports for 11 CUs? What power constraints are there for some of the power-gating measures that could idle all but memory and ancillary display logic. That's more of a mobile optimization, but one made difficult if there's a wide MCM link and two dies kept active to flip display buffer contents.
The whole GPU/IO die allocation of silicon becomes a question. If there is an IO die with graphics output and display controllers, can it be hosted separately from the GPU--which the video seems to be assigning the same name as a discrete (GDDR6) Navi product discussed near the end of the video.
 
Perhaps there's a table of contents somewhere of the meaty portions of the video. I guess the most relevant are the various MCMs listed from 3:00-10:00.

Yeah I think that's it

123l8ivj.png


This looks reasonable other than the prices(which seem a bit low). I'm thinking of getting the 3850X but it's probably a bit too much :D
 
The turbo speeds for the G variants seem significantly lower than would be expected if extrapolating from Raven Ridge. AMD's modern APUs had their max clocks set based on the scenario where either the CPU or GPU could get most of the power budget, as a consequence of their power management being responsive enough to scale clock and voltage independently in narrow time frames.

Raven Ridge has turbo clocks the same or within .1 or .2 GHz of most Zen CPUs of equivalent TDP.

Another question mark about the GPU chiplet among the other ones is what the IO, CPU, and GPU chiplet mixes would be. If the base IO chiplet does not have graphics disiplay IO and its related controllers, does that mean there will be two IO chip variants? Having two negates some of the savings of the dis-integration scheme since there's another die variation anyway. What's odd given the use of the same name for a discrete GPU (seems to me that he's being sloppy in using it) is that a discrete would have a significant amount of the IO and host logic anyway, so couldn't the CUs be added to a graphics-enabled IO chip, saving the effort of a GPU chiplet?
 
The turbo speeds for the G variants seem significantly lower than would be expected if extrapolating from Raven Ridge.

Especially considering the 2400G boosts to 3.9GHz with 65W tdp, while the 3600G only boosts to 3.9GHz with 95 tdp.

Either the final boost clocks haven't been decided upon yet, and the above are placeholder numbers, - or the entire thing is a fake.

The GPU chiplet thing does make sense to me though. The IO chip should have the graphics display engine, so you can power the GPU chiplet down completely while still refreshing the screen; This has been an Achilles heel of AMDs mobile solutions vs Intel since .... forever.

Cheers
 
Back
Top