AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

The worst case scenario for the entire Zen 2 lineup is that AMD must maintain four layouts, up from two in Zen 1:

* I/O Hub;
* 8-core Chiplet;
* 8-core SoC;
* 4-core APU;

... if the consumer SKUs will remain monolithic.
 
So, one thing I do after a major disclosure like this is to go back look at the old leaks, and find which one had nothing that disagrees with what was released, and find what other things they talked about.

Many of the earliest sources on 8+1 design, from back when I and most others completely disregarded them, agree on a feature that AMD did not choose talk about now. That is, according to the leaks Rome has an eDRAM cache, and that it is 512MB.

This agrees with something I saw on anandtech and twitter: The newest normal process from GloFo is the 12nm (which is pretty much a marketing name...), and AMD already sells products made with it. Why would they call the IO die 14nm? Because GloFo also has the 14nm HP, which can use eDRAM. According to the published density of the 14nm eDRAM cell, a 512MB cache would fit on a 440mm^2 die, if only barely.

(edit: woops typo)
 
Last edited:
This agrees with something I saw on anandtech and twitter: The newest normal process from GloFo is the 12nm (which is pretty much a marketing name...), and AMD already sells products made with it. Why would they call the IO die 14nm? Because GloFo also has the 14nm HP, which can use eDRAM. According to the published density of the 14nm eDRAM cell, a 512MB cache would fit on a 440mm^2 die, if only barely.
With how AMD isn't discussing anything about the IO die, I think some secret sauce is likely to be in it and eDRAM could be a possibility.

AMD specifically said that for 32 core Naples, that 4 memory channels were not enough and we saw that Threadripper 2 kind of hits a wall with due to memory in many workloads. They have chosen to keep 8 memory channels for Rome and outside of memory clock increases they do still need to feed 64 cores with around the same memory bandwidth as the 32 core parts. An L4 cache of some sort is very likely imo. Even if AMD chooses not to use eDram, I think they would still need some sort of SRAM cache.
 
Really well. They actually do much better at IO than finfets, their deficits are on the logic side. I commented elsewhere that since GloFo is dabbling in SRAM-like MRAM, which would have a density ~4-5 times better than SRAM on the same process, an IO die made using that for cache would be a really neat fit for the technology. It would, of course, also depend on multiple different unproven tech, so not likely as something you risk your main new product introduction on.
Is this GF's ST-RAM product? The last I saw, its marketed write endurance was on the order of 10^8 cycles, which is ten or so orders of magnitude below what some may argue is "unlimited".

This agrees with something I saw on anandtech and twitter: The newest normal process from GloFo is the 12nm (which is pretty much a marketing name...), and AMD already sells products made with it. Why would they call the IO die 14nm? Because GloFo also has the 14nm HP, which can use eDRAM. According to the published density of the 14nm eDRAM cell, a 512MB cache would fit on a 440mm^2 die, if only barely.
This sounds like IBM's 14nm process, which GF acquired. I'm not sure that's been offered to anyone else, and IBM's process is a complex many-layer SOI node.
How much area goes into eDRAM in this scenario?

Then if there's no SRAM, why is that thing so huge?
It's not like it's using 45nm or even 28nm. It's using 14nm.
A ~213mm2 Zeppelin die has ~88mm2 bound up in CCX area, with the rest being "other" including DDR, PCIe, fabric links, and other IO.
DDR and PCIe alone may be 30-40mm2 per die, without ancillary logic or paths needed to connect them. 4 processor's worth of them could be 1/3 the die, and then there's the 8 IF blocks and some unknown fraction of the uncore the IO would need to control and connect.
I guess some of the question is how much of the remaining chip area can be de-duplicated or simplified instead of put on the IO die. The data transport layer is currently implemented as a crossbar, which may change on the IO die because of the concentration of switch clients in one place. Naively plunking down the non-CCX area of 4 Zeppelin die would cover most of the IO die.
 
There was a reason why all that is part of the IO die got integrated into the CPU over the past 30 years, to go backward to that is just going to make things cost more and perform less.
I doubt this very much, I/O doesn't scale and new processes are significantly more expensive per mm*2.

Out of that I/O how much is critical to performance? SATA, USB, ethernet , NVME , PCI-E, none of those look important to me.
 
I did a quick sketch-up of all phy interfaces presumed for the I/O hub, sized from the 14nm Zeppelin layout and factoring in all the interfacing logic and pad stacking, not much space is left for a large eDRAM array.
 
I'm pretty concerned about the idea of so many different dies being needed.
The key, audacious success of Zen1 was it used a single relatively small die to do everything from low-end consumer all the way up to 32 core monster server sockets.

To now make multiple dies on different processes with a bunch of them incompatible with other market segments is throwing away that incredible efficiency.

On the other hand yes you get a smaller 7nm chiplet & being made up mostly of copy/paste blocks I guess its probably relatively cheap to develop several sizes of IO chip.

Just occurs to me: could the big one be designed with room to be able to be chopped into half/quarters to create the smaller versions?
That way one tapeout could produce 3 sizes of IO chip, even off each wafer, that'd satisfy my desire for efficiency/simplicity.

EPYC Naples and EPYC Rome side-by-side:
Nice.
By rough eyeball 2 chiplets & 1/4 of the IO die isn't actually much bigger than 1 Zen1 chip, which ain't bad for twice the cores & double width FP units.
 
On the other hand yes you get a smaller 7nm chiplet & being made up mostly of copy/paste blocks I guess its probably relatively cheap to develop several sizes of IO chip.

Just occurs to me: could the big one be designed with room to be able to be chopped into half/quarters to create the smaller versions?
That way one tapeout could produce 3 sizes of IO chip, even off each wafer, that'd satisfy my desire for efficiency/simplicity.
A scaled down I/O hub for consumer SKUs is possible, but the question remains how well that MCM approach will fit into the razor-thin margins of the Ryzen series, compared to a monolithic design.
If the chiplet yields are good and EPYC/TR sales are consistent enough to capture a significant market share, the production volume could make it profitable to reuse the same architecture for the consumer segment.
 
With how AMD isn't discussing anything about the IO die, I think some secret sauce is likely to be in it and eDRAM could be a possibility.
Would make sense, as ideally there is a L4/LLC used with NVDRAM to lower power draw and increase performance. I still think a single stack of HBM may have been a superior solution for capacity with MCM, but AMD did seem fond of their 14nm custom SRAM cell design for Zen. This may explain why.
 
A scaled down I/O hub for consumer SKUs is possible, but the question remains how well that MCM approach will fit into the razor-thin margins of the Ryzen series, compared to a monolithic design.
If the chiplet yields are good and EPYC/TR sales are consistent enough to capture a significant market share, the production volume could make it profitable to reuse the same architecture for the consumer segment.
They still have APUs that could bypass the IO chiplet design. However, they really need more bandwidth in that segment anyways. With the heterogenous memory support AMD has been working on, there is likely another memory controllers in the mix somewhere. Even a single stack of HBM connected like a CCX chiplet over IF could be huge.

Did AMD state the IF bandwidth on Rome anywhere?
 
I don't know why people expect that consumer Ryzen will use chiplets.

There is no reason to believe that consumer ryzen 3000 series will look anything like this. If I had to bet, I would expect it to be a traditional CPU because when you aren't dealing with this many cores, you lose all the advantage of splitting the die into these chiplets.

Because not using chiplets would mean using a different die, which would require designing it, and AMD usually tries to avoid that.
 
There's a rumor pointing to the IO die having 256MB as cache:


Though I doubt it's L3.


That worked for zen 1, doesn't mean it makes sense for zen 2.
And why do you assume it doesn't make sense?
Economy of scale is still a thing..
Keep in mind that each 8-core chiplet carries 16 PCIe 4.0 lanes, which an I/O chip could easily convert into different combinations of up to 32 PCIe 3.0 lanes. I'd say if there were absolutely no plans of using the chiplets into anything other than EPYC, why would the chiplets have PCIe in there at all? It would always be cheaper to save die area on the chips being made on the more expensive process.

Plus, the GF agreement still exists, so keeping them making 2 or 3 I/O dies plus some older GPUs (or IOs with GPUs?) may be the best possible outcome for every partner.

What zen 2 really does well is the scale from 32 to 64 cores.
That sounds a bit unfair to everything they presented to improve IPC. AVX2 performance is supposedly doubled.


Naively plunking down the non-CCX area of 4 Zeppelin die would cover most of the IO die.
It's a rather safe bet to assume they don't have to repeat that same logic 4 times over, though..


To now make multiple dies on different processes with a bunch of them incompatible with other market segments is throwing away that incredible efficiency.
So far there's one small die being made on an expensive high-end process and one large die on a cheap low-end process.
Only the cheap-ish die is incompatible with other market segments, and for those AMD can use an even cheaper die.
Perhaps 7nm+ EUV will change this, but for now this seems like a great choice to me.
 
The 256MB, if true, is likely to be an aggregate cache memory number. I could imagine each core having a 1MB L2 cache and 1MB L3 victim cache slice for at total of 16MB cache on each chiplet and 128 MB cache in the IO chip, likely sliced as 16MB per DDR4 PHY

Cheers
 
Last edited:
The 256MB, if true, is likely to be an aggregate cache memory number. I could imagine each core having a 1MB L2 cache and 1MB L3 victim cache slice for at total of 16MB cache on each chiplet and 128 MB cache in the IO chip, likely sliced as 16MB per DDR4 PHY

Cheers

Doesn't that work out to less cache on chiplet than what's already on 2x Zen CCX?
 
Are there any specifics on infinity fabric link bandwidth ?

Ryzen dies had 3 IF links each 2x32 bits wide (running at 4x DRAM command speed). Given the topology of EPYC 2 each chiplet only needs one IF link, so I'd expect it to be at least twice as wide as a single link. I'd also expect the operating frequency to be decoupled from DRAM command rate (because that was never really a good idea). Ie. a 2x64 lane link running at 2GHZ (8GT/s rate) it would have 64GB/s bandwidth in each direction.

Cheers
There's 4 IF links on the die, which AMD cites as allowing them to fully connect all dies in a package with a minimal number of package layers, despite the chips' orientations changing to keep the DDR PHY facing the outside of the package.
If the second-gen fabric's protocols aren't too different, the home agents for managing coherence would still be tied to the memory controllers. That link is likely a good part of why the coherent fabric is bound to the memory controller's clock, though whether something like power consumption, removing a few clock domain crossings in an already longer-latency subsystem, or some kind of constraint imposed by AMD's generalized and topology-agnostic fabric is why that choice was made is unclear to me.
I haven't seen AMD describe Zen's coherence protocols in detail, though I presumed that if it's something like prior multi-core methods, the inter-die links on the current EPYC MCM would allow direct data transfers from a cache holding a dirty copy to the requesting core without going through additional hops in a scenario where the home node is on a third chip--which is all the time given our understanding of Rome.
It's not required that there be other links, although it could be beneficial.

Keep in mind that each 8-core chiplet carries 16 PCIe 4.0 lanes, which an I/O chip could easily convert into different combinations of up to 32 PCIe 3.0 lanes. I'd say if there were absolutely no plans of using the chiplets into anything other than EPYC, why would the chiplets have PCIe in there at all? It would always be cheaper to save die area on the chips being made on the more expensive process.
Was this stated by AMD? I saw some people drawing an inference that this was so, but the center of the IO die's right and left sides is taken up by IO blocks whose length is somewhat shorter than the DDR interfaces on the top and bottom. Perhaps that diagram is not representative, but if it is close to reality I'm not sure what other IO needs that much perimeter.
Since this appears to be using the same socket connectivity as Naples, xGMI over a subset of the PCIe lanes remains the method for two-socket communication, which seems contrary to the goal of centralizing the coherent traffic in an IO chip only to add complexity back by mapping inter-socket traffic back to the chiplets--and there are twice as many of them than there are xGMI links.

It's a rather safe bet to assume they don't have to repeat that same logic 4 times over, though..
That depends on what could be considered optional while maintaining or improving upon the capabilities offered by Naples. There appear to be server/workstation products that have more SATA or USB links coming out of the socket than one Zen chip can provide, but potentially can be handled by the hardware equivalent of two southbridges.
Around half of the non-CCX area of 4 Zen chips is likely not negotiable if the PHY and requisite controllers are in the IO die, and a significant fraction of the remainder could be needed to provide the same features before considering any enhancements.
That could still leave area for storage in the IO chip, but I think it tends towards moderating expectations about what a conventional cache array could provide, or posits something more exotic to get such large amounts of caching in a more modest space/compression of the uncore's area consumption.
 
The 256MB, if true, is likely to be an aggregate cache memory number. I could imagine each core having a 1MB L2 cache and 1MB L3 victim cache slice for at total of 16MB cache on each chiplet and 128 MB cache in the IO chip, likely sliced as 16MB per DDR4 PHY

8 Zen1 cores already have combined 16MB of L3. The shrink from GF 14nm -> TSMC 7nm shrinks SRAM arrays much more than it shrinks logic. I would be extremely surprised if the chiplet dies do not have 32MB of L3 per chiplet. Those put together are 256MB. Of course, a single core can only make use of 32MB max.

On top of that, I expect the IO die to have another cache level, a L4 with 512MB. (Based on the earliest 8+1 leaks mentioning that.)

I did a quick sketch-up of all phy interfaces presumed for the I/O hub, sized from the 14nm Zeppelin layout and factoring in all the interfacing logic and pad stacking, not much space is left for a large eDRAM array.

Can you show this? Because I did a quick calculation and expect that there is >200mm^2 free.

The IF links connecting the chiplets are likely 16bit PCI-E, running at much higher frequency than standard (since they only need to cover a distance of 2cm, with no sockets or connectors of any kind to add capacitance.) AMD did the same for the IF links of Zeppelin, but now they have PCI-E4, with double the throughput. They will likely get >50GB/s per direction from that, and that's enough. The rest of the PCI-E which provides the off-socket IF and connection to devices doesn't come from the IO die, but from the chiplets. That is, each chiplet has a total of 32 lanes of PCI-E. This nicely means that adding or removing chiplets doesn't change the amount of PCI-E lanes available in the system -- if you only fit 4 chiplets, you get a total of 64 lanes from the chiplets and 64 free from the IO die.

So for memory and for connecting chiplets, the die loses 8x7.5mm^2 = 60mm^2 for DDR4 and 8*8.7mm^2 = 70mm^2 for the IF/PCIE. Assuming 440mm^2, 310mm^2 still remain. All the other IO and random stuff on the Zen package, excluding the IFOPs (since those will go on the chiplets) is <80mm^2, leaving more than a full Zeppelin die worth of space for cache.
 
Last edited:
My impression from sources like https://fuse.wikichip.org/news/1064/isscc-2018-amds-zeppelin-multi-chip-routing-and-packaging/4/ is that the on-package links for Zeppelin are 32-wide and running at a lower speed than PCIe. IFOP is described as using single-ended links drawing less than a quarter of a PCIe-based IFIS link's power.
The bandwidth figures for the on-package links also match more cleanly with the power of two widths without the PCIe protocol's CRC overhead.

Is there a source on the 32 lanes of PCIe per chiplet?
 
8 Zen1 cores already have combined 16MB of L3. The shrink from GF 14nm -> TSMC 7nm shrinks SRAM arrays much more than it shrinks logic. I would be extremely surprised if the chiplet dies do not have 32MB of L3 per chiplet. Those put together are 256MB. Of course, a single core can only make use of 32MB max.
I'm not quite convinced of this. In particular assuming there's L4, the benefits of having such large L3 may not be all that much, and I'd expect it to not increase (per core). 16MB extra L3 might not need THAT much area on 7nm, but it's still probably ~10mm^2.
The rest of the PCI-E which provides the off-socket IF and connection to devices doesn't come from the IO die, but from the chiplets. That is, each chiplet has a total of 32 lanes of PCI-E. This nicely means that adding or removing chiplets doesn't change the amount of PCI-E lanes available in the system -- if you only fit 4 chiplets, you get a total of 64 lanes from the chiplets and 64 free from the IO die.
Seems rather unlikely to me. You'd need different routing on the package depending on how many chiplets are connected. Seems much simpler to me if you handle all the pcie from the io die as well - saving precious area on the 7nm chiplets.
 
Zen 2 IF is 64b wide, or double from Zen 1.
IF 2, at least for GPU to GPU is running at 100GB/s.

There is a possibility that on package IF runs faster as was the case in Zen 1 for on-chip IF vs intra-chip coherent IF.
 
Back
Top