AMD RyZen CPU Architecture for 2017

sebbbi · Mar 3, 2017

There are multiple reasons that might explain worse than expected performance in some games.

Generic hyperthreading / SMT related issues:
- Intel had similar issues with Nehalem/Sandy/Ivy in games and applications. Reviewers suggested disabling hyperthreading. Intel did some HW changes to improve the situation and Windows scheduling was improved. But there are still cases where Intel chips show reduced performance when HT is active.
- Some reviews show big performance boost by tweaking power saving options. This affects how Windows schedules threads (could affect whether the CPU fills both SMT threads of a single core first, or fill each core with one thread first). AMD seems to take a slightly bigger hit from SMT than Intel.

AMD SMT cores are mapped differently than Intel:
- Some websites claim than Intel logical core mapping is: thread 1 of every CPU 1,2,3..,8 and thread 2 of every CPU 9, 10, 11... 16.
- AMD Ryzen logical cores are apparently mapped sequentially (one core at a time): CPU1 = 1,2, CPU2 = 3,4... CPU8 = 15,16.
- This causes problems in game engines that core lock their 6-8 worker threads (assuming console port). A game engine like this would only use 3 or 4 cores (both SMT threads on each) on AMD 8-core Ryzen. This would explain huge gains seen by some reviewers when disabling hyperthreading.

AMD have separate L3 caches for 4 core clusters:
- Apparently Windows doesn't know about this and migrates threads repeatedly between clusters. This is practically equivalent to L3 flush. Intel has shared L3 cache between all cores, and is not affected.
- CPU driver and/or Windows scheduler patch could reduce this problem.
- But many game engines are simply designed to do lots of parallel for loops, where the workload is split to all cores, and then the results are returned to a single core. There is nothing the OS can do to help this scenario. It can't analyze the memory access pattern of each core.
- AMD Jaguar has similar LLC cache design. Both 4 core clusters have their own L2 cache. It is best to keep communication between these two clusters as limited as possible. Many console game engines have already designed to work around this limitation. But on consoles, all threads are locked to a core. Core locking threads on PC is a double edged sword (has potential problems). In this case, core locking would be preferable, but you need different thread mappings for Ryzen than Jaguar (as you have 16 logical cores vs 8). Hopefully AMD releases a best practices guide for Ryzen caches and logical core mappings. It should be relatively easy to patch a game engine thread scheduler to support Ryzen.

Malo · Mar 3, 2017

sebbbi said:
Apparently Windows doesn't know about this and migrates threads repeatedly between clusters. This is practically equivalent to L3 flush. Intel has shared L3 cache between all cores, and is not affected.
- CPU driver and/or Windows scheduler patch could reduce this problem.

And it turns out this is the case, it's simply the worst thing AMD could have done, not worked with MS before the release of Ryzen. This design decision would have been finalized year(s) back.

sebbbi · Mar 3, 2017

Malo said:
And it turns out this is the case, it's simply the worst thing AMD could have done, not worked with MS before the release of Ryzen. This design decision would have been finalized year(s) back.

Same happened with Bulldozer. IIRC Bulldozer got 5%+ perf boost from the new scheduler/driver. This isn't something new

Added link (Left for Dead 2 got +10%):
http://www.anandtech.com/show/5448/the-bulldozer-scheduling-patch-tested/4

Malo · Mar 3, 2017

sebbbi said:
Same happened with Bulldozer. IIRC Bulldozer got 5%+ perf boost from the new scheduler/driver. This isn't something new

That's even worse then

All speculation at this point I should add....

3dilettante · Mar 3, 2017

sebbbi said:
- Some reviews show big performance boost by tweaking power saving options. This affects how Windows schedules threads (could affect whether the CPU fills both SMT threads of a single core first, or fill each core with one thread first). AMD seems to take a slightly bigger hit from SMT than Intel.

OS Core parking might behave differently, depending on the order and what cores go quiet. AMD's turbo also appears to be pretty dependent on how many cores are active. AMD's own boost slide shows a drop to 3.7 GHz from the nominal 4.0 GHz if more than two cores are active, not including the additional 100 MHz lost from XFR above that turbo range. Dynamic adjustments based on core draw for the 1800X appears to not have much authority until the 3.7-3.6 GHz range. The precision of AMD's methods, coupled with how very little range they are being given, sounds like this design is at the upper part of its range.

- AMD Jaguar has similar LLC cache design. Both 4 core clusters have their own L2 cache. It is best to keep communication between these two clusters as limited as possible.

AMD's prior architectures also had a very outdated interconnect. One would hope Zen has found a way to do better than Jaguar in this regard.

Rootax · Mar 3, 2017

Malo said:
That's even worse then

All speculation at this point I should add....

Not speculation, hardware.fr explains the situation here if you can read french (or google translate it) :

http://www.hardware.fr/articles/956-22/retour-sous-systeme-memoire.html

A 8 core Ryzen is composed by 2 CCX. But the communication between the CCX are super slow (22gb/s). If windows move a thread from one core to another, and they're not on the same CCX, then the performance tanks because this slow link is used (my english is to limited to translate the full thing, sorry). Plus the L3 cache is slow&weird...

Anarchist4000 · Mar 3, 2017

Gubbi said:
If the two CCXs are only tied together by a 22GB/s link, that's just appalling.

3dilettante said:
AMD's prior architectures also had a very outdated interconnect. One would hope Zen has found a way to do better than Jaguar in this regard.

I'm having a hard time rationalizing this with what we've heard of Infinity Fabric. Each link being only 22GB/s scaled by nodes on the mesh? Naples presumably 66GB/s with 4 CCXs? Vega and APUs something else entirely? Surely a design with only two nodes could widen that link a bit considering the possibilities and reports we've had. This looks like they're connecting all the nodes with x16 PCIE links on the same die.

Malo said:
So how is the 6-core R5 going to work? What method is used to disable 2 cores of a CCX?

3+3 most likely to balance the power/TDP as much as reasonably possible. Cache will be interesting because in theory a CCX and Vega CU cluster* are roughly equivalent nodes capable of interacting.

sebbbi said:
Hopefully AMD releases a best practices guide for Ryzen caches and logical core mappings.

http://schedule.gdconf.com/session/optimizing-for-amd-ryzen-cpu-presented-by-amd
I don't suppose you know anyone that attended this event or if this topic was covered? Beyond lots of threads, there's only so many ways to optimize for a CPU that come to mind. Affinity being one of them. Googling doesn't return any slides or presentations.

xEx · Mar 3, 2017

Malo said:
And it turns out this is the case, it's simply the worst thing AMD could have done, not worked with MS before the release of Ryzen. This design decision would have been finalized year(s) back.

Yes but AMD had no previous experience its not like they have done anything like this before, I would assume it would have been easier for AMD if they had experience in implementing new architectures and know before hand that windows doesn't know how to use them.

If only AMD had implement a kind of CMT design in the past so they knew what they had to do with MS before this launch. such tragedy

swaaye · Mar 3, 2017

sebbbi said:
Same happened with Bulldozer. IIRC Bulldozer got 5%+ perf boost from the new scheduler/driver. This isn't something new

One might also be reminded also of AMD Phenom's core parking and thread scheduling problems. In XP you have to just disable CnQ or performance can drop off a cliff when a thread goes to an 800 MHz core. AFAIK in Vista and later the cores don't park separately anymore. Phenom II has the same problems. What OS were they developing for exactly?!?!

Or Athlon 64 X2's problems with the timestamp counter (AMD Dual Core Optimizer eventually arrives).

K7 processors don't work with most Voodoo 2 drivers.

I think there's only one set that works. The horrors.

Windows 95 also needs a patch to work with K6 processors above a certain speed.

3dilettante · Mar 3, 2017

Rootax said:
Not speculation, hardware.fr explains the situation here if you can read french (or google translate it) :

http://www.hardware.fr/articles/956-22/retour-sous-systeme-memoire.html

A 8 core Ryzen is composed by 2 CCX. But the communication between the CCX are super slow (22gb/s). If windows move a thread from one core to another, and they're not on the same CCX, then the performance tanks because this slow link is used (my english is to limited to translate the full thing, sorry). Plus the L3 cache is slow&weird...

A single CCX presents about 10MB of data that can be shared with the other CCX, worst case. That's if every single L2 and L3 line is dirty and capable of responding to snoops. Shared and invalid lines cannot traverse this link if the protocol is MOESI, which was stated elsewhere.
Let's assume that a whole CCX has all its threads context switch out to the other CCX, which is doing the same thing.
A windows thread time slice was indicated to be about 20 milliseconds.
If every 20 milliseconds the CCXs had to switch their cache contents completely, that's about 4.4% of the capacity of that link.

The link appears to be sufficient for that, and that requires everything in the cache be used in a way that requires a transfer.

Some other items:
L3's effectiveness for saving bandwidth is worse for being split (miss rate proportional to square root of capacity, I think)--unless it's something like two very independent workloads like separate VMs.
It's possible that sharing could involve a memory writeback, depending on how this is handled.

itsmydamnation · Mar 3, 2017

i dont know if its been posted, but best review by far IMO
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/

So im holding off a month to see how things go,
if we can get scheduler improvements + precision timer issue fixed + continued IMC/ memory optimization and if they can add a SMT disable option into RYZEN master then i think gaming performance in-particular will look better.

people already claiming newest BIOS's helping alot with performance.
https://twitter.com/BitsAndChipsEng/status/837775386377347072

i see a lot of reviews ran their memory @ 2400 CAS 15/16 and didn't list what the Intel system run at. From my own limited DDR3 testing it matters a lot to games.

All in All it seems like AMD released a week or two to early.

Anarchist4000 · Mar 3, 2017

3dilettante said:
Let's assume that a whole CCX has all its threads context switch out to the other CCX, which is doing the same thing.

A group of four cores and a shared L3 cache form a CPU complex. The L3 cache for a CPU complex is a mostly exclusive victim cache for the private L2 caches and is implemented as four slices. This 8MB cache is 16-way set associative, and data is spread across it using address interleaving on low-order bits rather than using a hash function. This approach should improve locality at the cost of creating address conflicts. The L3 cache can send 32 bytes per clock to each L2 cache in the CPU cluster. Larger chips will have multiple clusters, which will communicate through a coherent fabric that’s 32 bytes wide. AMD withheld additional details about the L3 cache and overall SoC architecture.
http://www.linleygroup.com/mpr/article.php?id=11666

Is the interconnect link above or below the L3? I know it's a victim cache, but what if it's closely associated to a memory controller? L3's would never switch contents as they'd only hold data in their address range. Feeding private L2s over the link. Link bandwidth is kind of crappy(still representative of memory bandwidth) until it starts serving multiple nodes.

Malo · Mar 3, 2017

These are exactly what I've been looking for. Ryzen 1700 overclocked to 3.9Ghz and tested at 1080p and 720p against a 7700k at 5Ghz

1080p benchmarks

720p benchmarks

1700 performing really well IMO.

ProspectorPete · Mar 3, 2017

My personal take home message is Ryzen is already worth it today, and it will be even more valuable as time passes. I don't care about windows lagging behind; let's be real, it's windows. AMD did a really good job here.

3dilettante · Mar 3, 2017

Anarchist4000 said:
Is the interconnect link above or below the L3? I know it's a victim cache, but what if it's closely associated to a memory controller? L3's would never switch contents as they'd only hold data in their address range. Feeding private L2s over the link. Link bandwidth is kind of crappy(still representative of memory bandwidth) until it starts serving multiple nodes.

The interconnect is outside of the CCX, and that description is limited to what is happening inside of a single L3. The inter-CCX link's bandwidth is particularly poor relative to the CCX. Within the CCX, the L3 services at the rate of 4x32 bytes per clock, while any striping across L3s would mean half of all accesses would be going though something like 1x6bytes.

DavidGraham · Mar 4, 2017

Malo said:
These are exactly what I've been looking for. Ryzen 1700 overclocked to 3.9Ghz and tested at 1080p and 720p against a 7700k at 5Ghz
1700 performing really well IMO.

This about as serious in benchmarking as .. I don't know how to describe it, it's useless!
Most runs in the video are not identical, camera positions are not the same at all, settings unknown, don't think he repeated the runs to take the average, don't think he did any frame analysis, his OC'ed GPU is bogged down at 99% (and changing clocks repeatedly on the fly), results are worthless really.

In one other video he mentioned his GPU is at 99% because he disabled V.Sync!! he says that if Half Life 2 ran on a TitanX, the GPU usage would still be at 99%!!! Really there isn't much I can say about Youtubers doing hardware analysis, it could barely work in a GPU scenario, but with a CPU, HELL NO!

xEx · Mar 4, 2017

Regardless ryzen benchs one things I can't undertand is why ppl compare the 1800x with i7 in games and conclude the later is bettter and is mor worth the money...sure it push more FPS but the 1800x compites(tries to appeals for the same users) to the 6900k which won multiples awards and every single thing that is true for the 6900k is also true for de 1800x but in this case the AMD is bad.

Can someone explain me?

Malo · Mar 4, 2017

xEx said:
Regardless ryzen benchs one things I can't undertand is why ppl compare the 1800x with i7 in games and conclude the later is bettter and is mor worth the money...sure it push more FPS but the 1800x compites(tries to appeals for the same users) to the 6900k which won multiples awards and every single thing that is true for the 6900k is also true for de 1800x but in this case the AMD is bad.

Can someone explain me?

Because for a pure gamer, the 7700k is $150 less than the 1800X and trounces it in gaming. So in that respect yes the 1800X doesn't appear to be a very good investment for an enthusiast gamer looking for the highest fps in current games.

Malo · Mar 4, 2017

DavidGraham said:
This about as serious in benchmarking as .. I don't know how to describe it, it's useless!
Most runs in the video are not identical, camera positions are not the same at all, settings unknown, don't think he repeated the runs to take the average, don't think he did any frame analysis, his OC'ed GPU is bogged down at 99% (and changing clocks repeatedly on the fly), results are worthless really.

In one other video he mentioned his GPU is at 99% because he disabled V.Sync!! he says that if Half Life 2 ran on a TitanX, the GPU usage would still be at 99%!!! Really there isn't much I can say about Youtubers doing hardware analysis, it could barely work in a GPU scenario, but with a CPU, HELL NO!

Yep he's hardly a professional reviewer, merely providing comparisons with his 2 rigs at 1080p and 720p. What's mainly interesting to me is seeing the usage of the CPUs as he's doing the runs and how close the 1700 overclocked gets at 720p (the true cpu test here) compared to the 5Ghz 7700k. I haven't seen anything for the 1700 yet so I thought I'd post these for any interested people.

xEx · Mar 4, 2017

My point wasnt about how good or bad it is compare to the 7700k my point is that its exactly the same for the 6900k that is more than 2 times the cost and no one is concern about.

AMD RyZen CPU Architecture for 2017

sebbbi

Malo

Yak Mechanicum

sebbbi

Malo

Yak Mechanicum

3dilettante

Rootax

Anarchist4000

xEx

swaaye

Entirely Suboptimal

3dilettante

itsmydamnation

Anarchist4000

Malo

Yak Mechanicum

ProspectorPete

3dilettante

DavidGraham

xEx

Malo

Yak Mechanicum

Malo

Yak Mechanicum

xEx

Similar threads