AMD RyZen CPU Architecture for 2017

I do believe they should focus on single threaded performance, NUMA (Infinity Fabric) and higher clock speeds for now. The software just isn't there yet for more cores, except in the server and research space. Adding more cores would also just take up more space, which they instead could use to improve a single CCX.

Agreed. I think the space, time and money should go to the background optimizing, refining and improving it so they will have not just an increase of performance and get out of all(or most) of the "cons/week points" but also to have a very robust base to build in for zen 3 increase in cores/features.
 
Whether or not AMD chooses to expand the CCX layout (or cram more quad-core clusters on die), they have to bump up the Infinity Fabric, probably make it asynchronous, so at least the inter-CCX link can work at higher clock-rate, not dependant on the MC speed.
Infinity seems tied largely to PCIE signaling tech. So PCIE4 should up the bandwidth a bit, but primarily was setting the stage for multi-level signaling with PCIE5 in the near future. That should yield some large bandwidth increases as even optical interconnects were being explored. Some presentations on that at the last PCISIG conference.

I do believe they should focus on single threaded performance, NUMA (Infinity Fabric) and higher clock speeds for now. The software just isn't there yet for more cores, except in the server and research space. Adding more cores would also just take up more space, which they instead could use to improve a single CCX.
Memory systems IMHO will be the next focus. Using HBM or large caches close to the die to facilitate the nonvolatile memory technology that is starting to show up. Optane is very dense and energy efficient so long as you don't write to it. Caches could alleviate that while HBM for example could provide enormous bandwidth. Just imagine a single Epyc with 4 HBM stacks at 8GB each, if not more, providing 1TB/s of bandwidth for a database. That likely reduces strain on Infinity as well if used as a cache in addition to NVDIMMs as main memory.
 
Infinity seems tied largely to PCIE signaling tech. So PCIE4 should up the bandwidth a bit, but primarily was setting the stage for multi-level signaling with PCIE5 in the near future. That should yield some large bandwidth increases as even optical interconnects were being explored. Some presentations on that at the last PCISIG conference.


Memory systems IMHO will be the next focus. Using HBM or large caches close to the die to facilitate the nonvolatile memory technology that is starting to show up. Optane is very dense and energy efficient so long as you don't write to it. Caches could alleviate that while HBM for example could provide enormous bandwidth. Just imagine a single Epyc with 4 HBM stacks at 8GB each, if not more, providing 1TB/s of bandwidth for a database. That likely reduces strain on Infinity as well if used as a cache in addition to NVDIMMs as main memory.

U mean an Epyc SKU with 2 or 3 dies and 1 or 1 HBM die as a last level cache? IDK we gonna see something like that but it could be really interesting to see for sure.

And while we are on that, could the futures APU use this approach? APU+HBM on same package?
 
U mean an Epyc SKU with 2 or 3 dies and 1 or 1 HBM die as a last level cache? IDK we gonna see something like that but it could be really interesting to see for sure.
Something along those lines. Not sure it will be HBM, but from a design standpoint I laid out some solid reasoning for doing so. Space would be interesting, but they could put two Ryzens on an interposer with an HBM stack shared between them. Then scale from there with long strips.

And while we are on that, could the futures APU use this approach? APU+HBM on same package?
Really hoping it does as it would have a ton of potential. Could be a different Epyc variant for competing with AVX512 or in cases where that cache is useful.
 
NVidia recommends the Ryzen CPUs for their Battlebox PCs.

Whats_App_Image_2017-07-15_at_23.09.39.jpg


http://www.pcworld.com/article/3198...e-gtx-battlebox-pcs-finally-embraces-amd.html

I like it when companies play it nice.
 
Spend more die area on L3 for 7nm?
Certainly a possibility, however currently they are using one chip for everything. The HBM or external memory solution would be more flexible in regards to design. Not suggesting that 7nm parts won't have more L3 in addition to other changes. I could see favoring larger L2 and replacing L3 with tightly integrated HBM as the bandwidth isn't that different. Other possibility is replacing an entire CCX with stacked ram. Build HBM on top of Infinity as "nextgen" memory. Certainly possible with active interposers which AMD has been exploring.

I like it when companies play it nice.
What I'd be really curious to see is if Nvidia makes a DGX1 with Epyc as opposed to Xeons with two PLXs. Should be faster and cheaper, not that price is much of a concern. Turn the market into a Mexican standoff.
 
Spend more die area on L3 for 7nm?

The problem is not forcibly the area, their L3 need to be connected fully between the dies .. and working as a single L3 ... ( it is not the case yet ) .. In fact im curious to see what they will implement for correct the L3 latency with Zen2. Because i know theres many many engineer in AMD who is working on this problem. time will tell what solution they will find.
 
Last edited:
If AMD keep the same design for the 7nm shrink the most logical thing to do is to move to 6 cores per CCX, this is how their lineup would look assuming 6 cores per ccx:
Earlier this year, there was speculation mentioning three possibilities that reach 48 cores:
  1. 6 cores per CCX, 2 CCXs per die, 4 dies per CPU.
  2. 4 cores per CCX, 3 CCXs per die, 4 dies per CPU.
  3. 4 cores per CCX, 2 CCXs per die, 6 dies per CPU.
What are the advantages and disadvantages of each possibility from gaming and server perspectives?
 
The first solution seems the most "safe" and straightforward implementation, from infrastructure point of view. The rest would require significant interface and wiring overhead investment.

By the way, factoring the memory-interfacing limitation of the SP3 socket, the six die option is out of consideration anyway.
 
Does AMD still plan on using HBM only on APU's?

Wouldn't there also be a huge benefit of making a server targeted product with say 6 cores per ccx, 2 ccx's per die with 2 dies per cpu and the remaining space reserved for HBM modules, 16 or 32GB of HBM acting as L4 cache would provide higher benefits instead of another 2 dies would it not?
 
The first solution seems the most "safe" and straightforward implementation, from infrastructure point of view. The rest would require significant interface and wiring overhead investment.
More wires, while difficult to achieve, might be worthwhile in a mesh. That should reduce the bandwidth burden per link on Infinity. I'd lean more towards even number of CCXs per die to possibly interface HBM (or similar tech) with 2 channels. That should lower latency and link constraints while allowing them to vary cache sizes.

Does AMD still plan on using HBM only on APU's?
I'm not sure we know they are using HBM even for APUs. The old Zeppelin APUs seem more along the lines of CPU+discrete GPU with PCIE being an extension of Infinity. That said, integrating HBM onto a traditional APU or server CPU would solve a lot of bandwidth and socket limitation issues.

Wouldn't there also be a huge benefit of making a server targeted product with say 6 cores per ccx, 2 ccx's per die with 2 dies per cpu and the remaining space reserved for HBM modules, 16 or 32GB of HBM acting as L4 cache would provide higher benefits instead of another 2 dies would it not?
It should still be situational, but yeah for many memory intensive applications and databases that bandwidth could be a game changer. DDR4-3200 (25.6GB/s) with 4 channels would be roughly half the bandwidth of a single stack of HBM. DDR5 would double that and be roughly equivalent, but that's not considering faster HBM. For a low end system, a single stack of HBM could allow for completely removing traditional memory. An 8GB stack of HBM should be sufficient for most simple desktop operations and have it's cost offset by the lack of memory, motherboard complexity, size, and probably even power.

That being the case there has been a fair amount of database acceleration using GPUs. So in that case, half the CPU cores with a GPU and HBM attached would likely be a game changer. Say two CCXs and 1-2 Vega 11(?) with a single stack of HBM. Having seen 8-Hi stacks, I'd think 8GB of L4/LLC tripling (L4+DDR4) your effective memory bandwidth would be a solid step. Provided sufficient power, it should fit in the same socket as well. Ultimately it would seem a question of how much they want to differentiate their product stack. Have 40 or so chips like Intel, or reuse one chip for the entire stack like Ryzen. An APU could make that two chips while handling AVX512 and bandwidth intensive applications which tend to be parallel workloads anyways.

Another possibility might be sacrificing PCIE lanes for memory channels. Have 8 or more channels per socket for interfacing a LOT of NVDIMMs. Haven't looked into how well NVDIMMs work with ECC/RAID for redundancy on what is likely a storage server.
 
I don't think AMD will go for more than 4 core per ccx since the design needs to go from notebook to servers. Maybe 3 CCX per die.
 
Trying hard to understand why anyone would want this on? o_O
Per the article it's in the memory standard for the higher speeds, for stability purposes. For a wider range of module qualities, the standard is more concerned about getting as broad a level of compatibility and stability instead of performance.

hm... How does it compare to IBM's implementation/configuration for L3? (Power 7/8/9)
IBM's eDRAM cache is large but is subdivided into local slices. It depends on which generation whether it's private to a core or shared between two. Relative to a CCX, there's at least twice as much L3 per core than AMD.
IBM's bandwidth between levels ranges from 2-4 times that of AMD, and unlike AMD the L3 has a more complex relationship in that it will eventually copy hot L2 lines into itself, and the chip overall has a complex coherence protocol and migration policies for cloning shared data between partitions.
Latency-wise, the L1 and L2 in recent power chips is on the order of Intel's caches, and the local L3 is something like ~27 cycles versus AMD's ~30-40. Power 8 can seemingly muster this at significantly more than 4GHz, Power 9 is less documented, but seems to have 4 as starting point. The L3 is an EDRAM cache, so it may have non-best case latencies that differ from the SRAM-based L3 of Zen. DRAM can be more finicky in terms of its access patterns and when it is occupied with internal array maintenance like refresh.
The cache hierarchy is dissimilar, with a write-through L1 and an L2-L3 relationship that is more inclusive. Unlike AMD, the Power L3 can cache L2 data and can participate in memory prefetch.
Remote L3 access is pretty long with IBM, but Zen is equally poor. In terms of the bandwidth for those remote hits, AMD's fabric is missing a zero in the figure even on-die.
IBM's die, power, and price for all that is typically not in the same realm.

Some areas that do add up more favorably for EPYC are the DRAM and IO links per socket.

@3dilettante
AMD confirm singled ended not LVDS for the GMI links ( around 5:20) so that makes a lot more sense given the bandwidth number, clock rates and number of pins.

If the differential PHY is running at 10.6 Gbps for xGMI, The Stilt's that GMI is twice as wide and half as fast gives each package link 32 signals in each direction, at a more sedate 5.8 Gbps.
AMD's diagram in in its Processor Programming Reference document has 4 GMI controllers, though the MCM only uses 3 per die. (edit: 5.3, bad at math today)
 
Last edited:
Back
Top