AMD RyZen CPU Architecture for 2017

I wonder how they manage to avoid partial flushes when they have two SMT threads running on the same core. Let's say both SMT threads are doing heavy non-temporal writes (to their own locations obviously). One of the thread stalls because of a cache miss (very common), but hasn't completed a whole line of NT writes. The other thread continues without stalls. I would assume that this soon leads to the partially written NT line to be evicted. Then this happens, the evicted line needs to be combined with existing contents (as not all bits of the line were overwritten). When the thread later recovers, it will start writing from halfway of NT line, and again the data needs to be combined with existing memory contents as the whole NT line wasn't written. This would be awfully slow. This is why I would assume that each SMT thread has their own ("L0") write combine buffers. I am just trying to find out whether there's a difference between Intel and AMD in this regard. This is AMDs first SMT design after all. It might be that they have overlooked something.

Intel's optimization document recommends paying attention to the number of independent buffers being written to and trying to segregate different types of write traffic. The software is expected to do what it can to increase the chances that a full run of non-temporal writes will occur before eviction or some other condition for writeback is met. That means trying to avoid regular loads in the middle of filling a buffer, and capping the number of independent buffers being written to in an inner loop. For prior generations, Intel only guaranteed 4 such arrays for simultaneous use. (3.6.10)

If the code is multithreaded, section 8.5.5 includes a recommendation of using a small buffer in writeback memory to coalesce write-combine partial writes into a single final copy to WC memory rather than engaging in a stream of partial writes that might be interrupted. Section 7.7.1 advocates a similar use of small buffers and final full-line WC copies.

The details for write-combining are more sparse as time goes on, but the overall tone is that software is responsible for avoiding problems with partial writes or thrashing.

As to how AMD is different, it's unclear. AMD's older cores had a more detailed list on how they might flush their buffers, whereas Intel does not. The lack of transparency was part of the motivation of some earlier references to those encouraging avoiding write-combining or non-temporal instructions, since the exact behaviors are so variable and not always communicated.
I did see a rather oblique tweet to the effect that maybe some of the uncached operations under discussion for Zen's performance in certain games might not be as uncached as some are assuming, but I did not see additional detail.
 
Havok FX was originally going to be a GPU solver. But we know what happened. Nowadays Havok FX is a CPU solver. But Havok FX is a very fast CPU solver for visual physics. Scales well to large core counts.

I don't know what has happened since Microsoft bought Havok. DirectPhysics trademark could mean that Microsoft is using Havok know-how to introduce a public GPU API. It will likely support both CPU and GPU physics simulation. This would be good news for GPU physics in the long run. But also potentially good news for multi-core CPU physics (8+ cores with AVX2+ is plenty fast).

DirectPhysics:
http://wccftech.com/microsoft-trademarks-direct-physics-dx12/
For visual physics isn't that the point of FleX/Flow that is also more hardware independent?
Genearally on the CPU side PhysX 3.4 seems to be 50% to 100% better than 3.3, maybe making it more competitive?
I appreciate FleX 1.1 and PhysX 3.4 are not directly comparable as one is more ideal in terms of a gameplay affecting physics and the other particle physics simulation efffects; I assume callbacks a weakness of FleX and also with notable difference and limitation compared to the PhysX Rigid Body solver?.
But then FleX can interface-integrate with PhysX (something UE4 has done).
Cheers
 
Last edited:
If the code is multithreaded, section 8.5.5 includes a recommendation of using a small buffer in writeback memory to coalesce write-combine partial writes into a single final copy to WC memory rather than engaging in a stream of partial writes that might be interrupted. Section 7.7.1 advocates a similar use of small buffers and final full-line WC copies.
This is pretty much what I have always preferred. Generate chunk of data in cached memory (temp buffer in stack = L1 cache) and then copy that to write-combined buffer using a special memcpy (full wide aligned vector writes). I don't personally use non-temporal writes to CPU readable buffers. Write-combining is only used when moving data from CPU to GPU. I have burned my fingers a few times with write-combining memory in the past. Last gen in-order cores required sequential filling of write-combine buffers, but it was sometimes very hard to convince the compiler to do what you wanted. Fortunately modern OoO cores are much better in this regard, but I am still a bit worried about SMT and hyperthreading, as these techniques practically double the number of unfinished ("open") write-combine lines.

However Ryzen game performance problems aren't caused by SMT + write-combining. DirectX 11 games seem to suffer the most, and most DirectX 11 games have a single render thread that does all the write-combined buffer updates. So there's no SMT involved in write-combining.

I am still trying to understand whether write-combining is one of the main causes of Ryzen's relatively slow gaming performance (compared to other applications) or do most modern games simply have too little parallelism. In the case of DirectX 11, the render thread is clearly a big suspect. I have ported multiple games from console to PC, and it is common that all the other threads finish way sooner than render thread, as PC CPUs are so much faster. But the render thread doesn't get any faster as DirectX 11 + driver overhead on PC is significantly higher compared to direct console APIs. Thus the battle is all about single core CPU performance. And Kaby Lake obviously wins this battle (higher IPC & higher clock rate).
 
Last edited:
If it helps at all, the 3 game tests I've done that show CPU and thread util (2 are DX11 and one is using Vulkan api)

An update on what I've been running, I've found that dropping from 3.8 to 3.7 needs significantly less core voltage. Been running stable at 3.7 with just 1.225v in comparison to 3.8 at 1.3v. Memory I've found that I can get up to 3500 MT/s at CL14 with soc voltage at 1.2v.

120 watt maximum core+soc combined
3ufjyh.png


And thermals are looking very good (this is with the stock heatsink that comes with the 1700).
 
Last edited:
This is pretty much what I have always preferred. Generate chunk of data in cached memory (temp buffer in stack = L1 cache) and then copy that to write-combined buffer using a special memcpy (full wide aligned vector writes). I don't personally use non-temporal writes to CPU readable buffers. Write-combining is only used when moving data from CPU to GPU. I have burned my fingers a few times with write-combining memory in the past. Last gen in-order cores required sequential filling of write-combine buffers, but it was sometimes very hard to convince the compiler to do what you wanted. Fortunately modern OoO cores are much better in this regard, but I am still a bit worried about SMT and hyperthreading, as these techniques practically double the number of unfinished ("open") write-combine lines.

However Ryzen game performance problems aren't caused by SMT + write-combining. DirectX 11 games seem to suffer the most, and most DirectX 11 games have a single render thread that does all the write-combined buffer updates. So there's no SMT involved in write-combining.

I am still trying to understand whether write-combining is one of the main causes of Ryzen's relatively slow gaming performance (compared to other applications) or do most modern games simply have too little parallelism. In the case of DirectX 11, the render thread is clearly a big suspect. I have ported multiple games from console to PC, and it is common that all the other threads finish way sooner than render thread, as PC CPUs are so much faster. But the render thread doesn't get any faster as DirectX 11 + driver overhead on PC is significantly higher compared to direct console APIs. Thus the battle is all about single core CPU performance. And Kaby Lake obviously wins this battle (higher IPC & higher clock rate).
Sebbi,
when you port to PC do all the devs there also heavily manage the affinity of threads/processor to cores?
For Intel not much of an issue and can be left to defaults but it seems with Ryzen this needs to be managed carefully due to the CCX split and also meaning that each 8MB L3 cache are independent (so both core and cache considerations on latency), and it seems not all studios are doing this (not directed at you just those games that suffer much more notably since launch of Ryzen).
We can see as well the L3 Cache is not completely victim or another mechanism is impacting available L3 availability or behaving as another bifurcation; the latency jumps after 4GB with the 2x8GB SKUs and also jumps after 2GB for the 2x4GB SKU when tested by hardware.fr.
Until further tested it is difficult to know what the additional mechanism but we know that one SKU behaves correctly with L3 and that was the 1600X when hardware.fr tested with only the jump at 8GB (it is a 2x8GB L3 SKU) and then to system memory, so something that can or does change for whatever reason but to date only one SKU tested by hardware.fr behaves as expected in a way comparable to Intel's L3 (performance trend behaviour rather than architecture).

With regards to Ashes that had issues in both DX11 and DX12 on Ryzen, I thought it was a combination of defaults for thread/process management/affinity and also non-temporal instructions and the cache/writes (had a snippet somewhere on this that someone at Oxide touched upon but cannot find it now and was not broadly presented in any news sites, typical).

Cheers
 
Last edited:
New CPU-Z Upgrade Lowers Ryzen performance
The developer of CPU-Z claims that up-to Ryzen the benchmark methodology introduced in 2015 was valid, however with the introduction of Ryzen a thing or two have changed.
...
After a deep investigation, we found out that the code of the benchmark felt into a special case on Ryzen microarchitecture because of an unexpected sequence of integer instructions. These operations added a noticeable but similar delay in all existing microarchitectures at the time the previous benchmark was developed. When Ryzen was released, we found out that their ALUs executed this unexpected sequence in a much more efficient way, leading to results that mismatch the average performance of that new architecture. We reviewed many software and synthetics benchmarks without being able to find a single case where such a performance boost occurs.
http://www.guru3d.com/news-story/new-cpu-z-upgrade-lowers-ryzen-performance.html
 
In that context, it seems to be two 10-core/20-thread CPU more exactly (and the "v2" means Ivy Bridge)

This makes the table a bit hard to read if you want to distinguish between single and dual CPU machines. E.g. the first one with E5-2620 v4 is a pair of low end Broadwell Xeon while the one with E5-2697 v3 is a single high end Haswell Xeon. (which happened to be used on its own, either with one socket left empty or on a single socket motherboard)
E5-2695 v3 : single CPU
E5-2660 v4 : single CPU with lots of cores, lower clock. but with a rather high turbo clock!
http://ark.intel.com/products/91772/Intel-Xeon-Processor-E5-2660-v4-35M-Cache-2_00-GHz
E5-2670 v2 : dual CPU

We can also simplify by saying all the computers above perform identically (well, 6.7% difference between the first and last ones). Ivy Bridge seems to have a hard time from lacking AVX2?
 
Last edited:
Fortunately modern OoO cores are much better in this regard, but I am still a bit worried about SMT and hyperthreading, as these techniques practically double the number of unfinished ("open") write-combine lines.
That seems like something the vendors are assuming the software would be adjusted for.

However Ryzen game performance problems aren't caused by SMT + write-combining. DirectX 11 games seem to suffer the most, and most DirectX 11 games have a single render thread that does all the write-combined buffer updates. So there's no SMT involved in write-combining.

I am still trying to understand whether write-combining is one of the main causes of Ryzen's relatively slow gaming performance (compared to other applications) or do most modern games simply have too little parallelism.
I can only recall references to write-combing being a problem for Ashes of the Singularity, and it's not clear what specifically hit Ryzen.
https://twitter.com/FioraAeterna/status/847472586581712897

There is a store queue stall and a potentially different issue with streaming instructions apparently leading to chip-wide flushes. Why that would be, and what specifically is problematic about the implementation of streaming intrinsics is unclear.
Possibly, Ryzen's implementation hits more buffer flush events, or its events are painful enough that it cannot compensate in ways that other implementations can.
Ryzen's not unique in having very long non-temporal instruction latencies in Agner's instruction tables. Intel cores Nehalem onward have high latencies, although Ryzen is frequently twice as bad (or just "very high" for MOVNTI).


We can see as well the L3 Cache is not completely victim or another mechanism is impacting available L3 availability or behaving as another bifurcation; the latency jumps after 4GB with the 2x8GB SKUs and also jumps after 2GB for the 2x4GB SKU when tested by hardware.fr.
Until further tested it is difficult to know what the additional mechanism but we know that one SKU behaves correctly with L3 and that was the 1600X when hardware.fr tested with only the jump at 8GB (it is a 2x8GB L3 SKU) and then to system memory, so something that can or does change for whatever reason but to date only one SKU tested by hardware.fr behaves as expected in a way comparable to Intel's L3 (performance trend behaviour rather than architecture).
Hardware.fr also tested the 1800X's cache latencies with various combinations of cores being masked off. The 2-core+2-CCX setup in their testing also had an anomalous drop in latencies to something similar to the 1600X experienced.
It makes me curious if there's a relationship between what core is running the latency benchmark and what cores are active.
Perhaps the L3's capacity to index cache lines is somehow shared with the L2 shadow tags in some non-intuitive way, and it is possible that the benchmarking core and its location relative to inactive or fused-off cores can change what lines it can check.


I think it would be interesting to see what the code was.
It's vaguely reminiscent of a now-ancient transition for the x86 LOOP instruction (circa 486), which used to be a fast command that became microcoded with more advanced processors. I think AMD's implementations were among the last to degrade the instruction, and that there were situations where old software made assumptions that their "wait" states built around the slower implementations would be consistent.
 
Last edited:
jumped into the Ryzen bandwagon as well.
I got a Ryzen 1500X paired with 3200MHz -max of my mobo- RAM and a RX 570. There is a review on the 1500X here, and along with the RX570 it makes for a fine bang for the buck PC.

http://www.pcworld.com/article/3189...compelling-8-thread-gaming-to-the-masses.html

http://www.pcworld.com/article/3190...-can-buy-under-200-barely-changed.html?page=9

Yup this should be a nice PC build, both the 570 and the 1500x are the best bang for the buck in their respective price points :smile:

What CPU cooler are you planning on using? Ryzen is very efficient even when overclocked and doesn't get too hot. I'm able to hold a perfectly stable 3.750GHz at just 1.25v and the CPU is in the middle high 40C to low 50C range with maximum going up to 60C or so and that is with the stock heatsink that comes with the damn thing, I'm impressed!
 
Ah?
Not done the maths I assumed Ryzen1700 & RX580 where the most interesting perf/price wise.
 
Havok and Havok FX run both on CPU. Havok has very good multithreading implementation. Distributes work very nicely to multiple cores. PhysX multithreading isn't as good as Havok.

PhysX has both CPU and GPU solvers. So far the GPU solver support in games has been limited since the GPU solver is CUDA only. Runs only on Nvidia GPUs (no support for AMD and Intel GPUs). Nvidia has been lately porting some of their physics tech to DirectCompute: https://developer.nvidia.com/gameworks-dx12-released-gdc. Flex now supports DirectCompute (was previously CUDA-only). PhysX is still CUDA only.

Now that Microsoft bought Havok from Intel, we might see a cross-platform GPU-physics engine. This might also force Nvidia to extend PhysX GPU support to competing GPU hardware. I am confident that physics will eventually move to GPU, but we need to get good cross platform solutions first. Right now GPU physics is only used for additional visual candy in select hardware. But the emergence of cost efficient 8-core (16 thread) CPUs would also make properly multi-threaded CPU-based physics more interesting. Ryzen can simulate physics pretty well.

I had somehow missed that Microsoft had bought Havok, but that sounds like very good news to me. A DirectCompute implementation would seem to be a given now, and Scorpio being heavily biased toward GPU power only makes it likelier. It might even become part of DirectX somehow, which would be really great for developers.
 
That seems like something the vendors are assuming the software would be adjusted for.

However Ryzen game performance problems aren't caused by SMT + write-combining. DirectX 11 games seem to suffer the most, and most DirectX 11 games have a single render thread that does all the write-combined buffer updates. So there's no SMT involved in write-combining.


I can only recall references to write-combing being a problem for Ashes of the Singularity, and it's not clear what specifically hit Ryzen.
https://twitter.com/FioraAeterna/status/847472586581712897

There is a store queue stall and a potentially different issue with streaming instructions apparently leading to chip-wide flushes. Why that would be, and what specifically is problematic about the implementation of streaming intrinsics is unclear.
Possibly, Ryzen's implementation hits more buffer flush events, or its events are painful enough that it cannot compensate in ways that other implementations can.
Ryzen's not unique in having very long non-temporal instruction latencies in Agner's instruction tables. Intel cores Nehalem onward have high latencies, although Ryzen is frequently twice as bad (or just "very high" for MOVNTI).



Hardware.fr also tested the 1800X's cache latencies with various combinations of cores being masked off. The 2-core+2-CCX setup in their testing also had an anomalous drop in latencies to something similar to the 1600X experienced.
It makes me curious if there's a relationship between what core is running the latency benchmark and what cores are active.
Perhaps the L3's capacity to index cache lines is somehow shared with the L2 shadow tags in some non-intuitive way, and it is possible that the benchmarking core and its location relative to inactive or fused-off cores can change what lines it can check.
.
I must admit I am a bit leery of any of the artifical 2+2/3+3/etc done with the 1800X from the various review sites but yeah that is a very good point the artificial 2+2 behaves as one would expect and same as 1600X with regards to the L3 cache.
And yeah definitely something unusual in the background regarding the L3 cache that suggests it can be resolved but how easily, but I feel one cannot ignore that line where AMD in the presentation said "mostly victim".
The problem is the hardware.fr test is the closest one gets to real world behaviour and possibly impacts some applications/games, what I understand hardware.fr did was use affinity to control and test each core and so overcomes I would say the issue being the benchmark and fusing.
Also remember artificial changing core to 2+2/3+3/etc does not stop the full L3 cache of the SKU from being used and active - why I prefer the actual real world test and the follow-up review I referenced with the other SKUs to the artificial configs as that may introduce unexpected behaviour (that said by doing so I missed the behaviour of the artificial 2+2 correct behaviour in the 1800X review :) ).
Bear in mind they also state the behaviour even with the artificial 2+2/3+3/etc was that the L3 cache was only accessed on same CCX and then goes to system memory; it does not query the other CCX.
So in essence you are looking at a 2/3/4 core setup on a specific CCX and whatever Ryzen does with the L2/L3 said CCX tested with the hardware.fr tool and it being "mostly victim" as AMD state in one of their presentations.

Oxide also mention they worked on thread affinity-management with regards to AOTS and was not just the non-temporal instructions (this latter point was only picked up by a few places).
Cheers
 
Last edited:
In that context, it seems to be two 10-core/20-thread CPU more exactly (and the "v2" means Ivy Bridge)

You're right. I didn't even remember the 2670 v2 is actually the CPU I own so I should know, lol.
 
Back
Top