AMD RyZen CPU Architecture for 2017

Anarchist4000 · Mar 7, 2017

3dilettante said:
Wasn't 90 Hz settled upon initially because of latency and persistence back before there was certainty low-latency warp support would be available on a large chunk of the graphics hardware.

I thought it was simply the maximum bandwidth of the display connector given available technology.

3dilettante said:
The nature of the L3 and the communications between the CCXs is still unclear to me.

I've been envisioning the L3 as a read/write cache for a single local (CCX) memory controller. That controller servicing requests over the interconnect, providing a uniform interface for whatever memory was attached. L3 is unified, but with multiple pools having different latencies corresponding to an address range. The current Ryzens consisting of just two nodes and a single(oops?) 32B/clock link.

That seems rather inefficient until nodes are added if that link size is consistent. Four CCX (16 core) Naples would have 3x the apparent link bandwidth assuming even distribution of data.

Two nodes would be the worst configuration for a mesh interconnect. N-1 scaling and single CCX not having the interconnect.

Jawed said:
Is it quite likely that AMD will introduce a "workstation" socket? e.g. quad memory channel, cut-down server socket supporting 16 cores in its first iteration? Would that socket be more attractive to "balanced gamer-and-productivity" type users. Which appears to be a substantial portion of the users that AMD is targetting with Ryzen.

Naples had a 4 channel socket coming later this year. What seems likely is they add HBM2 on AM4 destined chips/APUs, bypassing socket pin restrictions entirely. Two channels with a 8/16GB L4 (HBCC? 512GB/s plus two channels) should be sufficient for AM4 needs.

pTmdfx · Mar 7, 2017

3dilettante said:
[...]The L3's ability to conserve or magnify bandwidth for commonly shared lines is limited.[...]

MOE would seem to matter the most for the inter-CCX scenario.
Exclusive is funny in that by its name it would seemingly be unlikely to be found in the L3, for most of its lifetime--being exclusive to a requesting core. The instant another core tries to read it, it goes to shared and would then seemingly drop from the L3 and from snooping for the reasons noted above (and to avoid any confusing cases of an E state in more than one cache).[...]

Ideally, contended lines would be evicted to the L3, at which point the mostly exclusive properties would be useful. Moreover, E in L3 seems possible in the sense that uncontended clean lines in E state being evicted from the cores.

As an extension to MOESI, lines in E state might possibly respond with data and ownership transfer too (say if requests snoop other L2s inside CCX before going to the system), instead of transitioning to S state (which would require an invalidation broadcast when further transitioning to M/E) or not responding at all. But we can never know precisely until AMD updates their developer manuals which is still not yet in the public...

Malo · Mar 7, 2017

digitalwanderer said:
I do trickle down upgrades to all the computers in my house, so I try and keep as much cross compatibility as possible.

Pretty much why I'm an early Ryzen adopter. Bought a cheap Dell i5 PC for my kids for xmas, threw a 1050 in it but now I'm giving it to my in-laws and giving my 3570k current setup to my kids gaming PC. The in-laws have been nagging me (via my wife and step-daughter) to do something about their POS really old dell thing, so Ryzen it is with my 970 till Vega comes out.

Malo · Mar 7, 2017

tongue_of_colicab said:
I've got a i7 4770 now but replacing that with the latest i7 equivalent would offer little real world extra performance so usually by the time I want to upgrade there are so many advances in other areas that not going with a completely new system doesn't make sense to me.

I think your 4770 could last you another couple of years.

3dilettante · Mar 7, 2017

Anarchist4000 said:
]
I've been envisioning the L3 as a read/write cache for a single local (CCX) memory controller. That controller servicing requests over the interconnect, providing a uniform interface for whatever memory was attached. L3 is unified, but with multiple pools having different latencies corresponding to an address range.

The descriptions of the elsewhere seem to make it out to be localized to the L3, not being an LLC and not part of the northbridge. I'm not sure what behaviors or tests would indicate that the L3s are that closely linked to the memory controller.

pTmdfx said:
Ideally, contended lines would be evicted to the L3, at which point the mostly exclusive properties would be useful. Moreover, E in L3 seems possible in the sense that uncontended clean lines in E state being evicted from the cores.

Contended lines would usually be mean there are live copies in one L2 or the other. I noted in passing that an exclusive line upon reading would convert to a shared status, making it drop from the L3 and/or snooping in general. In the case of an E line being evicted and then read back in by another CPU, the L3 would face the choice of invalidating its copy upon servicing the request or change it to shared--which in normal MOESI renders a line silent.

As an extension to MOESI, lines in E state might possibly respond with data and ownership transfer too (say if requests snoop other L2s inside CCX before going to the system), instead of transitioning to S state (which would require an invalidation broadcast when further transitioning to M/E) or not responding at all.

Exclusive is defined as the line being unique, which would normally preclude the L3 keeping the line in that status after responding to a read. Perhaps the math generally works out that a line in E status is either going to be written soon or satisfies enough demand for when it satisfies one additional read request before dropping to shared. Since the CCX probes both the L2 shadow tags and L3, there's no good way to trust an E line in one cache if there can be an S copy in the other.

I can see a more in-line path for Modified and Owned lines evicting to the local L3. A modified line could drop to the L3 in the same status, and transition to O as needed. This sort of transition would allow for a mostly exclusive L3 without interfering as much with MOESI. This at least makes some sense for the prior CPUs that had MOESI but dropped the L3, although now that the L3 is supposedly not optional perhaps it is different.
An Owned line might be able to drop back to the L3 as well due to an L2 eviction and run as normal.

What might make this problematic is when it comes time to transition dirty lines in preparation for a request from another CCX. Depending on how the code works, it might explain why memory write bandwidth aligns with the inter-CCX bandwidth, if a transition for dirty lines throws them back to the controller/DRAM.
Plain MOESI may highlight the discrepancy in bandwidth within a CCX and the external fabric. MOESI seems to have limited amplification for sharing clean lines, and there are certain paths for sharing dirty lines that leave no amplification opportunities on the far side of an inter-CCX transfer (read places source line to O, remote cache line to S). I don't know if AMD's L3 arrangement may allow an Owned line to shift to Modified after a write probe without forcing a write to memory, but that might be problematic outside of the more tightly integrated local CCX L3 and nearby shadow tags.

Perhaps some of the problematic workloads have a lot of easily forwarded clean lines which MOESI doesn't utilize the CCX data paths as well for.
Alternately they could write to producer-consumer buffers sized to some fraction of a larger LLC rather than a set of smaller L3s or L2s, leading to more writebacks+invalidates and not utilizing internal CCX bandwidth well.

I may have missed the benchmarking of cache ping-pong latency or synchronization latency, which might be another factor for the workloads that show unusually low performance.

Anarchist4000 · Mar 7, 2017

3dilettante said:
The descriptions of the elsewhere seem to make it out to be localized to the L3, not being an LLC and not part of the northbridge. I'm not sure what behaviors or tests would indicate that the L3s are that closely linked to the memory controller.

The controller location was an assumption and they could be their own nodes. Link bandwidth does scale to the memory after all. However, channels/bandwidth should scale with CCXs and a minimum number of nodes would be ideal in a mesh. CCXs appear to behave like independent CPUs, so each would have it's own memory controller. To test, dropping to a single channel and measuring latency from each CCX should work. Won't be much, but should be a small difference and provide some clarification on NoC topology.

Either way could work, but mapping individual L3s to controllers would simplify the design at scale. Avoid searching an increasing number of nodes for cache lines. I have no evidence to back this up beyond it following a traditional memory mapped cluster design. The current model they are presenting might differ from the ideal (Naples) hardware implementation. It seems likely they're partitioning the cache in different ways, improving results. Theory being that as the chip becomes larger, bandwidth would scale along with the interconnect. A single large L3 being more energy efficient (avoiding DRAM) and performant than multiple smaller caches duplicating data for some workloads. Independent L3s for independent/single threads, shared L3 for many related threads. Compiling for encoding benchmarks might show this. In the case of the current chips, the interconnect to cache bandwidth ratio seems woefully inadequate. Something an engineer wouldn't have overlooked on the architecture. Scaling with lots of threads seems to have been the design goal here. What I suggested was the most optimal way of implementing this I can think of.

xEx · Mar 7, 2017

Btw can we say cpu bottleneck if the cpu is not opening at its full potential and it'd the software that can't extract more performance out of it?

Enviado desde mi HTC One mediante Tapatalk

sebbbi · Mar 7, 2017

xEx said:
Btw can we say cpu bottleneck if the cpu is not opening at its full potential and it'd the software that can't extract more performance out of it?

We also say "GPU bottleneck" when the geometry unit is the bottleneck and the compute units are partly idling. Fury X for example would have much better gaming performance if the compute units weren't idling most of the time. Now it actually shows pretty decent numbers at high resolutions, but the 4 GB memory is a problem at 4K in many new games.

Some web site (can't remember) recorded CPU core utilization while gaming. Intel quads reached very good utilization numbers. Ryzen and 6900K had roughly 50% utilization. There's certainly room for scaling, but multi-core programming isn't easy and game engines have huge code bases. Change takes time. At least now there's incentive, as there's cheaper 8-core CPUs available for consumers.

Ryzen also has additional difficulty compared to 6900K regarding to games, as it has split L3 caches. Games commonly use task graph schedulers with work stealing. This technique inherently shuffles jobs between cores. Shared L3 cache is perfect for this. On Ryzen, all jobs that read or modify same data should be running on the same 4-core compute cluster. AMD Jaguar console CPU has similar arrangement (two clusters of 4 cores without shared LLC). Console developers are already used to this. They just need to bring similar scheduling code to the PC build. However on PC you can't hard code thread mappings, as there are so many CPU models (including future). Fortunately Windows has APIs to query this info.

This function can be used to query CPU info, including shared CPUs (HT/SMT) and shared caches. I would be interested to see what results you get with Ryzen:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms683194(v=vs.85).aspx

Rodéric · Mar 7, 2017

That L3 cache behavior is reminescent of NUMA architectures and not something I'm enthusiast about.

lanek · Mar 7, 2017

hoom said:
I like the idea but building my own PCs since '97 I've never yet done a CPU upgrade without buying a new Mobo so frankly new socket doesn't bother me (& heresy as it may be if it gave price/performance advantage I actually wouldn't mind soldered-in CPU)

When im in the same position, this was mainly due to the fact i have need each time upgrade the motherboard due to non compatibility of the socket. Even with the LGA2011 ( who will be replaced by the 2066 in Q 2017), you have 3 versions, 2011v1-2011v2-2011v3. in consumers market every generation, you have one new: 1151-1155-1156. So untill you are on the same generation ( you will need to change motherboard. Offcourse DDR3 to DDR4 change on CPU is a normal reason to need to upgrade motherboard.

Now if motherboard system bring something in the table ( features as new USB version, new Sata, Nvme etc ). And personnally as i was OC a lot my systems, i was mostly change motherboard for better stability in OC anyway..

But lets be honest, im pretty sure motherboard maker are all happy to sell new motherboard too this way, instead of wait that peoples feel the need to have new features.

Gubbi · Mar 7, 2017

Rodéric said:
That L3 cache behavior is reminescent of NUMA architectures and not something I'm enthusiast about.

The cost of a big shared LLC, in latency and coherency bandwidth, increases unacceptably with core count. Intel offers Cluster-On-Die for their large core count Xeons, precisely because of this.

You're going to have to deal with NUMA anyway in MCMs and multi socket machines. Not going NUMA for high core counts is ignoring reality/physics.

That said, AMD has aimed low with just four cores for a shared L3, on top of that, Ryzen seem to have a weak inter-CCX fabric, which boggles the mind given how cheap on-die bandwidth is. This makes it much more sensitive to pathological scheduling cases.

Sort of reminds me when I upgraded from an Athlon 64 to an Athlon X2 (both 2.2GHz), and lost 20% performance in games because all games were single threaded and the Windows scheduler bounced the one active process from one core to the other every single time slice.

Cheers

sebbbi · Mar 7, 2017

Rodéric said:
That L3 cache behavior is reminescent of NUMA architectures and not something I'm enthusiast about.

Intel has spent lots of resources to improve their LLC + uncore since Nehalem. These 8-core Ryzen uncores seem pretty simple compared to modern Intel designs. It is pretty hard to design a LLC + uncore that supports 20+ CPU cores without choking. It can't scale indefinitely. Xeon Phi for example only has 512 KB L2 cache per core, but no shared LLC on chip. You can configure the HMB (MCDRAM) as cache to the slower DDR4 memory. But this is more about saving DDR4 bandwidth than minimizing the latency.

This is a good presentation with lots of good info, if someone is interested about Intel Xeon uncore + LLC:
http://frankdenneman.nl/2016/07/08/numa-deep-dive-part-2-system-architecture/

sebbbi · Mar 7, 2017

https://www.starwindsoftware.com/blog/numa-and-cluster-on-die

Cluster-on-Die (COD) is ideal for highly NUMA optimized workloads.

Basically, with MCC and HCC CPUs Intel started to run into the same issue of “too many CPUs on the same shared bus”. So it went further to split sockets into logical domains.

Therefore, MCC and HCC have two memory controllers per CPU socket whereas LCC CPUs has one only.

COD can be enabled on MCC and HCC due to double number of memory controllers.

The NUMA node with enabled COD is split into two NUMA domain and then – each owns half of the cores, memory channels and last level cache. Each domain, which includes memory controller and cores, is called a cluster.

With Cluster on Die each Memory controller now serves only half of the memory access requests thus increasing memory bandwidth and reducing the memory access latency. And obviously two memory controllers can serve twice as much memory operations.

AMD 8-core Ryzen LLC config seems highly similar to Intel's when Intel Cluster-on-Die (COD) is enabled. COD reduces latency and offers best performance for workloads that don't share data across clusters (or cores). But it hurts performance for non-NUMA optimized workloads (clusters share data frequently). 8-core Ryzen is basically running in COD mode all the time.

Gubbi · Mar 7, 2017

sebbbi said:
AMD 8-core Ryzen LLC config seems highly similar to Intel's when Intel Cluster-on-Die (COD) is enabled. COD reduces latency and offers best performance for workloads that don't share data across clusters (or cores). But it hurts performance for non-NUMA optimized workloads (clusters share data frequently). 8-core Ryzen is basically running in COD mode all the time.

It should be said that it is Intel's Xeons,which have a dual ring L3 systems, that offers COD, - and up to 12 cores per cluster. They also offer much higher bandwidth on the LLC ring bus.

IMO, this will hurt AMD in workloads with varying amounts of active threads. The OS scheduler will have an impossible task. If so, I'd expect Ryzen 2 to have a much stronger inter-CCX fabric.

Cheers

sebbbi · Mar 7, 2017

Gubbi said:
It should be said that it is Intel's Xeons,which have a dual ring L3 systems, that offers COD, - and up to 12 cores per cluster. They also offer much higher bandwidth on the LLC ring bus.

Narrow bus isn't a big problem for the current 8-core (two CCX) Ryzen. Naples (32 cores, 64 threads) however is an entirely different beast. We can only hope that it's got a much more sophisticated inter-CCX fabric.

Gubbi said:
I'd expect Ryzen 2 to have a much stronger inter-CCX fabric.

This seems to be a given. Intel changed their fabric topology and properties multiple times when they started to scale beyond 4-cores. Current one is the result of many iterations. AMD clearly is behind in this area.

Alexko · Mar 7, 2017

As far as I understand, Naples is just 4 "Ryzen" dies…

Malo · Mar 7, 2017

snarfbot · Mar 7, 2017

i got the 1700 its very nice, running it on the stock cooler atm while waiting for a mounting bracket in the mail, its pretty snazzy looking though.

looks like the future

hoom · Mar 7, 2017

sebbbi said:
This function can be used to query CPU info, including shared CPUs (HT/SMT) and shared caches. I would be interested to see what results you get with Ryzen:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms683194(v=vs.85).aspx

Thats what CoreInfo uses, there's already a few posts about that & Ryzen seems to give wonky results with each thread reporting separate dedicated L1, L2 & L3.

DavidGraham · Mar 7, 2017

sebbbi said:
AAA game developers will certainly ensure that their game takes advantage of Intel's latest and greatest. 4 core will not be the most important enthusiast CPU in the near future. Both Intel and AMD are pushing higher core counts.

Analysis of History doesn't agree with this conclusion. First Dual Core CPUs were released in 2005, first Quad Core got released in 2008, it took nearly 13 years to get games running somewhat effectively on 4 cores. Despite multiple console generations and the affordability of 3 and 4 core CPUs. Nowadays, most games run OK on Core i3 (2 cores). They run better on Core i3 with HT of course, but they still do OK without it.

Judging from History, running games on more than 4 cores is going to be a painful long process, my i7 3770 (8 threads) is almost never bogged down at 100% during any game. It's a rare sight to see it reach even 70%~80%. To this day we have big games releases that run mainly on a single CPU thread, and are still limited by it. No amount of market share or cores or console state or even policy changed this outcome.

sebbbi said:
Ryzen is currently faster in Battlefield 1 (multiplayer) and Mafia 3

BF multiplayer is tested on empty maps, so that's hardly an indication of anything, In fact SP testing is much more credible than MP on empty maps.

sebbbi said:
and ties with 7700K in Mirror's Edge and For Honor.
http://www.techspot.com/review/1348-amd-ryzen-gaming-performance/

From that very link, Ryzen is not equal in Mirror's Edge nor For Honor. It's 8% and 2% behind respectively. And in case of For Honor, even a Core i3 is almost equal to the 7700K, clearly this game should have been tested on 720p to relieve the GPU of it's bottleneck. Ryzen is also massively behind in UE4 titles. Which is never a good indication.

sebbbi said:
AMD 8-cores are sold out already.

So what? Pre orders are not an indication of adoption rate at all, I myself know 4 people who cancelled their pre-orders due to gaming performance. They will now wait until things are much more clear.

sebbbi said:
For VR gaming I would buy a 7700K at the moment. VR has so many question marks in the future. We don't even know what kind of hardware and software solutions the next gen VR headsets are going to have.

That sounds like the 7700K is an all around safer solution right now and in the future as well.

xEx said:
And ryzen is not more expensive than the 7700k its a little bit cheaper, the 1700 is 330.

The 1700 is much slower than the 7700K at stock clocks, despite a difference of 30$ in price.

either way if ppl dont like ryzen then why the 6900k won so many awards when it came out having same performance for more than 3 times the price?

The 6900K didn't win any gaming award, those were given for the 7700K.

AMD RyZen CPU Architecture for 2017

Anarchist4000

pTmdfx

Malo

Yak Mechanicum

Malo

Yak Mechanicum

3dilettante

Anarchist4000

xEx

sebbbi

Rodéric

a.k.a. Ingenu

lanek

Gubbi

sebbbi

sebbbi

Gubbi

sebbbi

Alexko

Malo

Yak Mechanicum

snarfbot

hoom

DavidGraham

Similar threads