AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

Perhaps some complicated arrangement could do this, at a significant area cost. 15-20mm2 of a chiplet would be unused, which if the estimates from the pictures are accurate could translate into a fifth or more of each die. Without the southbridge and PCIe complex on the IO die, the chiplet would also be missing all of the SOC features of the socket it resides on.

But requiring the presence of an I/O chip makes no sense in mobile/desktop SKUs.
It would make manufacturing more expensive (2 dies per CPU instead of 1) and require the existence of multiple I/O chip variants since the one on Epyc is overkill.
The "dumb chiplets + I/O hub" scheme is worth it only on Epyc and Threadripper CPUs with lots of cores.
 
It would make manufacturing more expensive (2 dies per CPU instead of 1) and require the existence of multiple I/O chip variants since the one on Epyc is overkill.
Why does the 2 dies make it more expensive? For the package yes I guess it would be a bit more but why would it be significant? The I/O chip being made on a separate much cheaper process lowers cost and spreads the manufacturing between separate nodes where there are likely severe constraints on 7nm.
The "dumb chiplets + I/O hub" scheme is worth it only on Epyc and Threadripper CPUs with lots of cores.
AMD are pushing for lots of cores in the desktop. They're redefining what lots of cores is compared to Intel's stagnation in that department.
 
The GPU chiplet thing does make sense to me though. The IO chip should have the graphics display engine, so you can power the GPU chiplet down completely while still refreshing the screen; This has been an Achilles heel of AMDs mobile solutions vs Intel since .... forever.
Is the IO die in question an IO die used by all client products, or a variant for the G products? The cost is either two similar chips being engineered, or one incrementally larger one adding cost for all of them.
One departure in this from other AMD proposals is that they've always had the GPU drive some kind of memory bus attached, which makes sense since GPU silicon needs tend to be closer to what the IO domain has than the CPU regions. Architecturally, they are better at utilizing DRAM on a sustained basis, and that same capability usually costs more if it's not on-die. Perhaps the strangled bandwidth of the socket makes that less important in this instance, but some of the video's rumors indicate the hardware is overspecced for that bandwidth already, leaving the question on what the inflection point may be for GPU silicon wastage and link capability versus 7nm dies and chip variants.

AMD's modern APUs can gate most silicon besides the controller and memory. The power domains for Raven Ridge are set up to allow this.
The efficacy of power gating of a GPU on-die would be compared to the power gating of a chiplet.
There are elements neither can turn off completely, and so I'm not certain if the chiplet adds much in the idling scenario besides the link controllers and off-die interconnect that cannot be gated.
I have some question about whether there's a control complex or series of dependences between the command processor on the chiplet and the ancillary hardware now moved to the IO die, but there could be methods to handle it.

The rumored CU counts seem of dubious value with the DDR bandwidth available, and I don't know what to make of the video's using the same name for the Ryzen G products and a discrete product allegedly capable of hitting Vega 56 performance. There's flexibility with chiplets to an extent, but the range between 40GB/s bandwidth and link bandwidth versus matching a product with 512 GB/s seems to stretch what the silicon and link can achieve without some significant gaps in what can be reconciled on a supposedly compact chiplet.

But requiring the presence of an I/O chip makes no sense in mobile/desktop SKUs.
It would make manufacturing more expensive (2 dies per CPU instead of 1) and require the existence of multiple I/O chip variants since the one on Epyc is overkill.
The "dumb chiplets + I/O hub" scheme is worth it only on Epyc and Threadripper CPUs with lots of cores.

Having more than one die has happened before, such as with quad-core Conroe products. There is a yield and assembly cost to this, and three chips can add to it. It might depend on where AMD's projections are for volume and yield for an unknown set of chips. If there were a definite high volume of high-yield silicon for a given combination of features, this might lose out. However, if AMD's being pessimistic about the volume or manufacturability of a given graphics or processor SKU, this might be a sensible but still less than ideal decision.

Performance or power-wise, I question the latency for the CPU and bandwidth for the GPU. An MCM will have a higher floor in terms of power consumption due the links and whether that is countered by the presence of 7nm on some of the chips is unclear. Cost-wise, I am curious how appealing this is at the cheapest and high volume SKUs. The supposedly debunked rumor of a 28nm bargain-basement single-chip product might make sense in this light, particularly if it was a contingency plan if Globalfoundries was somehow not capable of servicing that niche or wanted to hold that range hostage in terms of WSA negotiations at the nodes Zen was on.
 
Why does the 2 dies make it more expensive? For the package yes I guess it would be a bit more but why would it be significant? The I/O chip being made on a separate much cheaper process lowers cost and spreads the manufacturing between separate nodes where there are likely severe constraints on 7nm.

The area of one die with all relevant controllers built-in would probably be smaller than one chiplet + one I/O hub combined. It's more expensive to integrate 2 dies in the package, if something goes wrong with one of them you may lose the other.
Also, lower-end mobile/desktop CPUs sell in much larger quantities than server/workstation CPUs while having much lower profit margins, so every cost reduction in manufacturing should matter.
Two dies would also result in higher memory latency even for low-end mobile/desktop CPUs with fewer cores, which I guess could affect gaming performance, for example.

AMD are pushing for lots of cores in the desktop. They're redefining what lots of cores is compared to Intel's stagnation in that department.
Sure, but only a tiny fraction of people really need 12, 16 or more cores in their desktops.
And Amdahl's law will get us all sooner or later.
 
One reason I'd really like to see a chiplet APU is that then that same GPU chiplet could be used in AIBs, and maybe as a base of an entire product line in different configurations.

The big question I'd like to see answered is that if you start disaggregating the GPU the same way AMD seems to have disaggregated the server CPU with EPYC2, (that is, not just fancy crossfire but transparent to software) at what position specifically should you do the cut? Would it make sense to locate the ROPs with the memory controllers to reduce data movement between the chips? AFAICT, there is more traffic between ROPs and the RAM than there is between them and the CUs.
 
I think there are some trends that bring the amount of export bus traffic up. For example, DCC could compress a wavefront's pixel exports significantly before they materialize as DRAM accesses, and some of the best practices for DCC include having a shader write out on all channels even if redundant to make the compressor more efficient.
https://gpuopen.com/dcc-overview/

Late depth checks or culling that feeds off the ROP depth caches are in the ideal case happening for multiple culled primitives or pixels without going to memory. This may become more important in the future, as there are patents about leveraging the depth caches or the hierarchical caches based on them for even earlier culling of primitives in the front end. Overdraw scenarios may become more expensive since the overwrites in the ROP caches become chip link transactions.
Whatever method the CUs use for arbitrating for the export bus and RBEs would show up externally if the bus became externally visible, for whatever increment of bandwidth the reservation and release process imposes.

Moving the ROP caches to the die with memory controllers would allow their caches to still serve in their role of bandwidth amplifications from DRAM. However, if they operate just as efficiently in that regard, their DRAM bandwidth consumption remains constant while the export bus (64 (128?) bytes per shader engine) is now visible on the link between the chips. That amplification factor may also be a problem in the compressed memory case, if the link is sized to match the DRAM bus.

I think that Vega's choice to bring the ROPs within the L2 may point to a decision to further encapsulate them. Besides helping avoid cache flushes when dealing with read after write hazards, I wonder if putting the incoherent ROP caches inside the hierarchy kept the Infinity Fabric simpler by removing a demanding memory client that did not play by the coherent fabric's rules.
 
Is the IO die in question an IO die used by all client products, or a variant for the G products? The cost is either two similar chips being engineered, or one incrementally larger one adding cost for all of them.

The display engine is 5% of the 2200/2400G die or, ~10mm². I would expect it to be on all Ryzen IO dies. having two mask sets and stocking two different IO dies to save 10mm² seems unlikely.

Wrt. inter-die links, do we know anything about the bandwidth ? TSMC demoed a low latency 2gbit/s / lane, 256 lane InFo link two years ago on their 16nm process using just 0.4 mW/gbit/s (0.4pj/bit). That's 64GB/s using just 0.4W. TSMC's capabilities have improved since then and their InFo packaging technology has matured.

Cheers
 
Moving the ROP caches to the die with memory controllers would allow their caches to still serve in their role of bandwidth amplifications from DRAM. However, if they operate just as efficiently in that regard, their DRAM bandwidth consumption remains constant while the export bus (64 (128?) bytes per shader engine) is now visible on the link between the chips. That amplification factor may also be a problem in the compressed memory case, if the link is sized to match the DRAM bus.

I think they will structure it similar to Raven Ridge, where the GPU has a fair amount of L2 (1MB?). I could imagine a slice of cache on the IO die at each memory controller, acting as a memory side cache (like we saw in Intel's Crystal Well).

A lot of assumptions are dependent on the bandwidth of the inter-die links. I hope (and expect) we get a significant increase.

Cheers
 
Wrt. inter-die links, do we know anything about the bandwidth ?
The most I've seen is that AMD stated it's on generation 2 of the fabric. I would expect at a minimum the link bandwidth doubles, or has to since an IFOP link currently only matches one DDR4 channel in terms of bandwidth. Rome's chiplet strategy would immediately strangle the architecture if the single link to the chip did not double.
I think to some extent there would be a desire to do more than that, since EPYC marketing touted its overprovisioned link bandwidth, which implies there is some benefit in having a surplus of bandwidth available beyond the limits of the local channel pair. A GPU chiplet might want more, since it gets more fabric stops than a CCX, although the ceiling is the DDR channels in a Ryzen G.

TSMC demoed a low latency 2gbit/s / lane, 256 lane InFo link two years ago on their 16nm process using just 0.4 mW/gbit/s (0.4pj/bit). That's 64GB/s using just 0.4W. TSMC's capabilities have improved since then and their InFo packaging technology has matured.
I haven't seen a recent reference to a TSMC interconnect, and I may have missed a paper or article on it. Many interconnect demonstrations can go years before they reach market, so it may still be on the way. I have not seen the specifics for this interconnect, such as the expected connection distance or package layout. Rome's package layout and single link per die to the IO chip makes it look like it's following paths similar to some of the longer ones in Naples, and the pictures so far make it seem like the substrate is similar. The chiplet strategy would make it likely the client products would be using the same connectivity.
 
The big question I'd like to see answered is that if you start disaggregating the GPU the same way AMD seems to have disaggregated the server CPU with EPYC2, (that is, not just fancy crossfire but transparent to software) at what position specifically should you do the cut? Would it make sense to locate the ROPs with the memory controllers to reduce data movement between the chips? AFAICT, there is more traffic between ROPs and the RAM than there is between them and the CUs.
That would make sense. The concern is that the IO die prefers an older process which would be less than ideal for ROP and cache logic.

If Navi looked like Rome, could we see three processing chiplets plus a front end/command processor chiplet? Or smaller IO die and fewer chiplet options? Use the same Navi chiplet for all Navi products. Then bin the chiplets for clockspeeds or possibly remove/substitute for compute only products? That could work around the serialization issues with higher clocks on a risky process. Apply Epyc's design to GPUs. Higher cost for consumer boards, but reusing the same small silicon chiplets may pay off in higher margin markets. Similar to how Threadripper uses the better chips.
 
I was referring to this Hotchips 2016 presentation

Clearly targeted at mobile applications with its low power consumption.

Cheers

That does look like it targets very low power. I'm curious how that package type may react to the higher thermals of the AMD chiplets.
That aside, slide 20 covers my question about trace length and proximity. The demonstration device had 0.55mm spacing between die, and the concept seems to target compact SOP products. I'm not sure about the trace matching, and whether this is more stringent than AMD's current package links. I don't recall AMD going into that kind of detail.
 
Yes, they would need more power for longer traces.

What I found most interesting is the sheer density of traces possible. On page 24, a graphic of the SOC and MEM packages is overlayed on a photo of the ball grid array (7x7mm package) . The SOC and MEM dies are less than 7mm² each and allows 256 single ended com lanes as well as 16 referential lanes (one for each 16-lane sub channel). Add ground and power as well as the fan-out to the ball grid and we're looking at roughly 500 connections in 7mm², or ~75/mm².

This was state of the art three years ago, even if AMD's packaging partner is trailing this, they must be able to pack a hell of a lot of connections on organic substrates today.

Cheers
 
AMD obviously believes we're hearding for a glorious multithreaded future, although I think I'm going to wait till Intel's show or no-show of Ice Lake in late 2019, at least for comparison purposes. If Intel fail to deliver I think they'll be in real trouble.
 
One reason I'd really like to see a chiplet APU is that then that same GPU chiplet could be used in AIBs, and maybe as a base of an entire product line in different configurations.

The big question I'd like to see answered is that if you start disaggregating the GPU the same way AMD seems to have disaggregated the server CPU with EPYC2, (that is, not just fancy crossfire but transparent to software) at what position specifically should you do the cut? Would it make sense to locate the ROPs with the memory controllers to reduce data movement between the chips? AFAICT, there is more traffic between ROPs and the RAM than there is between them and the CUs.

I'm guessing that a GPU chiplet / IO die would be set up similar to the Xbox 360 Xenos and edram daughter die perhaps?
 
New Benchmark Leak Reveals Ryzen 3 3000U Radeon Vega Mobile series
Some interesting information got posted today, it involves the mobile series 3000 processors. That means the procs have Vega cores in them for graphics, much like the Athlon 200GE.

The information is interesting as well as it is confusing as we doubt it is 7nm based. The leak was posted by TUM_APISAK with the following Geekbench scores:


What's interesting is that these procs are tagged Raven Ridge, and that already is an 14nm existing architecture. For example, the Athlon 200GE is based on RR (I am actually currently testing this proc). So in short, that means no 7nm for APUs. You can bet that the Ryzen 3000U series APUs will end up in devices like ultra-portables, notebooks and small devices.
https://www.guru3d.com/news-story/n...-ryzen-3-3000u-radeon-vega-mobile-series.html
 
Back
Top