AMD RyZen CPU Architecture for 2017

Davros · Apr 30, 2017

xEx said:
Btw I always though physics were serial because of how objects interacts with each other.

Actually no, think of yourself moving through a large crowd everyone reacts to everyone else but everyone is totally independent nobody stops and waits for someone to bump into them so they can react to it

sebbbi · Apr 30, 2017

xEx said:
Btw I always though physics were serial because of how objects interacts with each other.

Physical bodies (or particles forming a body) affect each other. You can't obviously be writing and reading the same object simultaneously from multiple threads. This would cause problems. But you can split the problem in a way that each object (or particle) is processed by a single thread at each step of time.

Graph coloring is a commonly used algorithm to split dependent network to multiple independent parts ("colors"):
https://en.wikipedia.org/wiki/Graph_coloring

This is highly important algorithm to understand when you are parallelizing workloads with dependencies.

For example my original GPU solved used graph coloring to split particle joints to N passes, where in each pass a single particle was accessed by one joint only. This results in up to N/2 joints solved per pass (as each joint refers to two particles). Same is true for object based solvers. Then you apply iterative solver (gauss-seidel, jakobian, etc) that brings the joints closer and closer to equilibrium. Collisions are handled as one directional joints. This solver was fully running on a GPU. Scales to N/2 threads, where N is the particle count (or object count, if object based solver). My current solver is based on shape matching. It scales a bit worse, because there's a parallel reduction step, and you need to solve SVD at object level (https://en.wikipedia.org/wiki/Singular_value_decomposition). But as long as you have complex enough scene, it fill any modern GPU nicely. And compute overlap helps with the SVD step. Rest of the GPU can be running something else.

lanek · Apr 30, 2017

Im really amazed by thoses optimization, just because, i use ( badly ) onlly brute force for my simulation and renders .. ( Blender mostly )

3dilettante · May 1, 2017

Information concerning Ryzen has been added to Agner's optimization guide.
http://www.agner.org/optimize/

My skim of the optimization document had a few tidbits about Rizen:
The perceptron-based predictor predicts certain patterns well that the prior generation perceptron did not handle efficiently, such as nested loops. Some corner-case quirks concerning whether Ryzen would mispredict once after a loop that repeats more than 12 times.
Mispredict was measured at roughly 18.
The fetch bandwidth for the front end is indicated as 32 bytes by AMD, but was not measured to get much more than 16.
FMA does partially occupy the issue capability of FADD, with a mix of FMA and FADD getting 2 cycles' worth of issue out 3 cycles.
Store forwarding is significantly more robust that previous generations.

A brief check of the instruction tables does show Ryzen's gather instruction support is rather perfunctory, being poorer than Haswell's first attempt.

CSI PC · May 1, 2017

sebbbi said:
Physical bodies (or particles forming a body) affect each other. You can't obviously be writing and reading the same object simultaneously from multiple threads. This would cause problems. But you can split the problem in a way that each object (or particle) is processed by a single thread at each step of time.

Graph coloring is a commonly used algorithm to split dependent network to multiple independent parts ("colors"):
https://en.wikipedia.org/wiki/Graph_coloring

This is highly important algorithm to understand when you are parallelizing workloads with dependencies.

For example my original GPU solved used graph coloring to split particle joints to N passes, where in each pass a single particle was accessed by one joint only. This results in up to N/2 joints solved per pass (as each joint refers to two particles). Same is true for object based solvers. Then you apply iterative solver (gauss-seidel, jakobian, etc) that brings the joints closer and closer to equilibrium. Collisions are handled as one directional joints. This solver was fully running on a GPU. Scales to N/2 threads, where N is the particle count (or object count, if object based solver). My current solver is based on shape matching. It scales a bit worse, because there's a parallel reduction step, and you need to solve SVD at object level (https://en.wikipedia.org/wiki/Singular_value_decomposition). But as long as you have complex enough scene, it fill any modern GPU nicely. And compute overlap helps with the SVD step. Rest of the GPU can be running something else.

Just to add but different approach in implementation.

It is known that, while PhysX SDK 2.8 has rather limited multi-threading capabilities (mostly working on per-scene or per-compartment basis), PhysX SDK 3.x can distribute various tasks across worker threads much more effective, and thus offer better support for multi-core CPUs.

But how well does multi-threading actually work in PhysX 3 (we’ll take the latest 3.3 version)? Using the same PEEL (Physics Engine Evaluation Lab) tool to the record the performance metrics, we will try to shed the light on this question.

A tool was made to test physics multithreading for Nvidia solution and it does benefit from more than one thread but the benefits diminish pretty rapidly.
Albeit this is not in game but physics more traditional simulation-solver demo that is benchmarked.
Seems Nvidia PhysX really benefits from 2 threads and a bit more with 3 threads (diminishing returns kicking in) with their latest SDK.

Multithreaded performance scaling in PhysX SDK: http://physxinfo.com/news/11327/multithreaded-performance-scaling-in-physx-sdk/

Tool (PEEL) used back then is on GitHub.
Cheers

Edit:
But this has probably changed again with FLEX/FLOW and how Gameworks with these libraries has been updated to DX11/DX12.

3dilettante · May 2, 2017

One additional tidbit from the optimization guide is that Ryzen apparently supports FMA4, despite not reporting it. Perhaps this is a legacy of what was borrowed from the prior cores, or something to do with the overlapping development of K12.

It's not clear how exactly the FMA pipes occupy the FADD pipes.
If it were something like the the FMA stealing an operand from the other pipe, then perhaps the scheduler could get more throughput if the mix included instructions that had the same register twice, possibly saving a register rename or register port read.

If the adder was actually being used by the FMA pipe, then saving on operands wouldn't change the throughput. The fact that the adder can still be used at all rather than all additions being shut out by the FMAs may point to the adder being made available somehow for portions of the same cycle despite the FMA being in-flight at the same time.

Not sure about the testing mix for this, or if the CPU could take advantage of that scenario if the testing mix did check.

sebbbi · May 2, 2017

CSI PC said:
A tool was made to test physics multithreading for Nvidia solution and it does benefit from more than one thread but the benefits diminish pretty rapidly.
Albeit this is not in game but physics more traditional simulation-solver demo that is benchmarked.
Seems Nvidia PhysX really benefits from 2 threads and a bit more with 3 threads (diminishing returns kicking in) with their latest SDK.

Havok (newest versions) scale much better than that. Havok has very good multithreaded CPU scaling. Havok was owned by Intel, so excellent MT CPU scaling has obviously been a high priority for them. PhysX on the other hand has both CPU and GPU solvers. GPU solver scales to very high parallelism.

Nvidia's PhysX GPU solver example (high parallelism):

Nvidia also has FleX, a fully GPU-based physics engine. It uses similar techniques that I mentioned above.

sebbbi · May 2, 2017

3dilettante said:
Information concerning Ryzen has been added to Agner's optimization guide.
http://www.agner.org/optimize/

My skim of the optimization document had a few tidbits about Rizen:
The perceptron-based predictor predicts certain patterns well that the prior generation perceptron did not handle efficiently, such as nested loops. Some corner-case quirks concerning whether Ryzen would mispredict once after a loop that repeats more than 12 times.
Mispredict was measured at roughly 18.
The fetch bandwidth for the front end is indicated as 32 bytes by AMD, but was not measured to get much more than 16.
FMA does partially occupy the issue capability of FADD, with a mix of FMA and FADD getting 2 cycles' worth of issue out 3 cycles.
Store forwarding is significantly more robust that previous generations.

A brief check of the instruction tables does show Ryzen's gather instruction support is rather perfunctory, being poorer than Haswell's first attempt.

Yes, it can apparently do 4xFMA + 4xFADD in 3 cycles = 4 flop/cycle. Still this results in identical flop throughput as doing 2xFMA per cycle = 4 flop/cycle. But more flexibility is of course better (FADD can be fully independent and allows ratio of 2:1 adds:muls).

Fastpath double instructions now decode in single cycle. This is a big improvement in some cases. This also means that 256 bit AVX/AVX2 decode in single cycle (was 2 cycles in previous AMD designs). Now 256 bit AVX/AVX2 is a benefit more often.

Still no insight about the bottlenecks seen in games. I would love to see some write combining / streaming write benchmarks. Does everybody know whether Intel has separate write combine buffers for both hyperthreads, what about Ryzen? Tried to search for this info from the Ryzen technical docs, but failed. Store forwarding seems nice and robust, but what about write combining... Bad game performance seems to point to this direction.

Jay · May 2, 2017

sebbbi said:
Havok (newest versions) scale much better than that. Havok has very good multithreaded CPU scaling. Havok was owned by Intel, so excellent MT CPU scaling has obviously been a high priority for them. PhysX on the other hand has both CPU and GPU solvers. GPU solver scales to very high parallelism.

are you saying that havok doesn't also run on gpu, or just isn't as optimized as physx?
there's this ps4 demo but it doesn't really tell us much, is havok that far behind optimizing on gpu, can't be good moving forward?

CSI PC · May 2, 2017

sebbbi said:
Havok (newest versions) scale much better than that. Havok has very good multithreaded CPU scaling. Havok was owned by Intel, so excellent MT CPU scaling has obviously been a high priority for them. PhysX on the other hand has both CPU and GPU solvers. GPU solver scales to very high parallelism.

Nvidia's PhysX GPU solver example (high parallelism):

Nvidia also has FleX, a fully GPU-based physics engine. It uses similar techniques that I mentioned above.

Well one would need to use same test tool with both PhysX and Havok to have a clear conclusion on what was more optimal with CPU.

Yeah why I mentioned FleX/Flow in the edit and also the changes most recently introduced with the DX11/DX12 update and generally new suite improved on performance.
Not sure why that did not show up for you as the edit was yesterday.
Point was just to provide an alternative perspective to GPU implementation and the question raised asking if more cores will help and they may with CPU implementation but maybe I misread someone's post; I guess we will have to wait and see how good latest Havok is under ownership/development of Microsoft and see if it is adopted by devs on the PC platform with its high diversity.
Although TBH I would like to see Havok tested with same benchmark tool or one multiple solvers can be compared.
Last test-demo with comparative results I saw was relating to Unity Engine and the engineer compared to Nvidia's implementation but they used previous version of FleX as the event was late 2016.
Cheers

CSI PC · May 2, 2017

Here is Nvidia's most recent presentation on their PhysX at GDC this year:
http://developer2.download.nvidia.c...a4jrhC0RGp6r7Xt-EVxZJQzrepgPhfhjU3is7MPe_w3EM

Edit:
Shows some relative performance to previous PhysX gens and use of threads vs GPU latest versions.
I suppose they are making a thing now of GPU because it supports DX11 and DX12 now and not just CUDA.
Cheers

lanek · May 2, 2017

Jay said:
are you saying that havok doesn't also run on gpu, or just isn't as optimized as physx?
there's this ps4 demo but it doesn't really tell us much, is havok that far behind optimizing on gpu, can't be good moving forward?

Well the only version of Havok GPU as far as i know date from 2009 ( or even further ) developped for run on ATI ( PhysX was still owned by Ageia at this time ), Intel have buy Havok and you can imagine that any gpus developpement have been halted.

Now Havok is owned by Microsoft. I have no idea if theres a project on make a GPGPU version of it.

sebbbi · May 2, 2017

Jay said:
are you saying that havok doesn't also run on gpu, or just isn't as optimized as physx?
there's this ps4 demo but it doesn't really tell us much, is havok that far behind optimizing on gpu, can't be good moving forward?

Havok and Havok FX run both on CPU. Havok has very good multithreading implementation. Distributes work very nicely to multiple cores. PhysX multithreading isn't as good as Havok.

PhysX has both CPU and GPU solvers. So far the GPU solver support in games has been limited since the GPU solver is CUDA only. Runs only on Nvidia GPUs (no support for AMD and Intel GPUs). Nvidia has been lately porting some of their physics tech to DirectCompute: https://developer.nvidia.com/gameworks-dx12-released-gdc. Flex now supports DirectCompute (was previously CUDA-only). PhysX is still CUDA only.

Now that Microsoft bought Havok from Intel, we might see a cross-platform GPU-physics engine. This might also force Nvidia to extend PhysX GPU support to competing GPU hardware. I am confident that physics will eventually move to GPU, but we need to get good cross platform solutions first. Right now GPU physics is only used for additional visual candy in select hardware. But the emergence of cost efficient 8-core (16 thread) CPUs would also make properly multi-threaded CPU-based physics more interesting. Ryzen can simulate physics pretty well.

Deleted member 2197 · May 2, 2017

sebbbi said:
PhysX has both CPU and GPU solvers. So far the GPU solver support in games has been limited since the GPU solver is CUDA only. Runs only on Nvidia GPUs (no support for AMD and Intel GPUs). Nvidia has been lately porting some of their physics tech to DirectCompute: https://developer.nvidia.com/gameworks-dx12-released-gdc. Flex now supports DirectCompute (was previously CUDA-only). PhysX is still CUDA only.

Isn't the (PhysX) CUDA compiler open-source ... and provides support for AMD & Intel GPU's?

Edit: This is based on what someone said in the comments section of the link below.
http://physxinfo.com/news/12755/pre-release-version-of-physx-sdk-3-4-is-now-available-on-github/

Jay · May 2, 2017

sebbbi said:
Havok and Havok FX run both on CPU. Havok has very good multithreading implementation. Distributes work very nicely to multiple cores. PhysX multithreading isn't as good as Havok.

PhysX has both CPU and GPU solvers. So far the GPU solver support in games has been limited since the GPU solver is CUDA only. Runs only on Nvidia GPUs (no support for AMD and Intel GPUs). Nvidia has been lately porting some of their physics tech to DirectCompute: https://developer.nvidia.com/gameworks-dx12-released-gdc. Flex now supports DirectCompute (was previously CUDA-only). PhysX is still CUDA only.

Now that Microsoft bought Havok from Intel, we might see a cross-platform GPU-physics engine. This might also force Nvidia to extend PhysX GPU support to competing GPU hardware. I am confident that physics will eventually move to GPU, but we need to get good cross platform solutions first. Right now GPU physics is only used for additional visual candy in select hardware. But the emergence of cost efficient 8-core (16 thread) CPUs would also make properly multi-threaded CPU-based physics more interesting. Ryzen can simulate physics pretty well.

that's interesting, so that ps4 gpu enabled havok demo i linked was never brought into the main line branch of the code.

if the talk of havok being rebranded to direct x physics is true (based on the ms copyright name turning up i believe), then it may limit the platform support moving forward, or may have a light version for multi platform development or something?
Would still be many more platforms currently than physx though. Edit : Guess not true as physx can run on older versions of windows.

does running on an apu have any tangible benefits?
ryzen may be good for physics, but if talking beyond eye candy, then wouldn't it be more limited to the lowest common denominator?

CSI PC · May 2, 2017

sebbbi said:
Havok and Havok FX run both on CPU. Havok has very good multithreading implementation. Distributes work very nicely to multiple cores. PhysX multithreading isn't as good as Havok.

PhysX has both CPU and GPU solvers. So far the GPU solver support in games has been limited since the GPU solver is CUDA only. Runs only on Nvidia GPUs (no support for AMD and Intel GPUs). Nvidia has been lately porting some of their physics tech to DirectCompute: https://developer.nvidia.com/gameworks-dx12-released-gdc. Flex now supports DirectCompute (was previously CUDA-only). PhysX is still CUDA only.

Now that Microsoft bought Havok from Intel, we might see a cross-platform GPU-physics engine. This might also force Nvidia to extend PhysX GPU support to competing GPU hardware. I am confident that physics will eventually move to GPU, but we need to get good cross platform solutions first. Right now GPU physics is only used for additional visual candy in select hardware. But the emergence of cost efficient 8-core (16 thread) CPUs would also make properly multi-threaded CPU-based physics more interesting. Ryzen can simulate physics pretty well.

Would it be more likely that Nvidia would push FleX if looking for wider adoption as they continue to evolve it (more cost effective with latest update for particles) or they going to end up needing both due to FleX limitations?
Or more likely to see the hybrid structure then merging PhysX (3.4 or more likely after) on CPU with FleX(1.1 and higher) on GPU?
Would mean Nvidia could still extend support that way, and there is already an interaction between PhysX and FleX (can be coupled to PhysX).
Cheers

3dilettante · May 2, 2017

sebbbi said:
Still no insight about the bottlenecks seen in games. I would love to see some write combining / streaming write benchmarks. Does everybody know whether Intel has separate write combine buffers for both hyperthreads, what about Ryzen? Tried to search for this info from the Ryzen technical docs, but failed. Store forwarding seems nice and robust, but what about write combining... Bad game performance seems to point to this direction.

Details are more sparse from my searching, but current Intel cores have 10 line fill buffers which service misses and non-temporal writes between the L1 and L2 for Sandy Bridge. (2.4)
Prior to that, it was mentioned that there were 6-8 write combining buffers. (3.79 appears to apply to older gens)
http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

There are sparse references to the line fill arrangement being maintained for the later architectures.
Intel's method seems to provide for more combining beyond just WC, and tries to delay transactions at various times for ordering purposes.
I found no indication that anything that goes that far out into the hierarchy is SMT-aware.

https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf (p 235)
AMD's 15h family at least separates out streaming writes from the miss handling, giving 4 write-combining buffers.
The write-combining for regular stores would be generally covered by the WCC.

AMD's write-combining more clearly lists out a significant range of flush-worthy events and conflicts, coupled with documented severe regressions with Bulldozer's line versus prior cores--with some improvements in later revisions. (p107-109)
For example, wouldn't games like Ashes of the Singularity that removed non-temporal instructions for Zen have initially been using assumptions for AMD's prior cores with their write-combining issues--if anything?

It's curious that Zen may be having trouble here, given what it is replacing and how AMD typically sets GPU-shared locations to be WC.
Perhaps there's a bug like with BD, some kind of firmware setting that is pessimistically flushing the buffers (if they exist?), or AMD has other plans for how it intends to handle communication with other clients.

Kaotik · May 2, 2017

lanek said:
Well the only version of Havok GPU as far as i know date from 2009 ( or even further ) developped for run on ATI ( PhysX was still owned by Ageia at this time ), Intel have buy Havok and you can imagine that any gpus developpement have been halted.

Now Havok is owned by Microsoft. I have no idea if theres a project on make a GPGPU version of it.

Havok FX in 2005, then another demo set in 2009.
The Havok FX was apparently custom tailored for both NVIDIA & ATI, 2009 stuff was OpenCL

sebbbi · May 3, 2017

pharma said:
Isn't the (PhysX) CUDA compiler open-source ... and provides support for AMD & Intel GPU's?

Edit: This is based on what someone said in the comments section of the link below.
http://physxinfo.com/news/12755/pre-release-version-of-physx-sdk-3-4-is-now-available-on-github/

CUDA compiler being open source doesn't help much. Someone still needs to do backends for other GPUs. So far there's no CUDA backend for AMD or Intel GPUs. AMD has made a CUDA HIP translator, but translating languages to each other isn't trivial, and results often in pretty bad code quality. Also do you really think Nvidia would port PhysX code to AMD proprietary platform? PhysX code itself isn't open source. Even if PhysX was public domain (no license) and even if AMD had access to PhysX source code and even if AMDs CUDA translator worked 100% fine, they would still need to convince people to use their version of PhysX. The original obviously links to Nvidias libraries. Maybe they could replace DLLs or such, but I don't see an easy solution regarding to maintenance. New version would break things, unless Nvidia, Intel and AMD cooperated. Basically PhysX GPU backend had to be an open standard.

Kaotik said:
Havok FX in 2005, then another demo set in 2009.
The Havok FX was apparently custom tailored for both NVIDIA & ATI, 2009 stuff was OpenCL

Havok FX was originally going to be a GPU solver. But we know what happened. Nowadays Havok FX is a CPU solver. But Havok FX is a very fast CPU solver for visual physics. Scales well to large core counts.

Jay said:
that's interesting, so that ps4 gpu enabled havok demo i linked was never brought into the main line branch of the code.

if the talk of havok being rebranded to direct x physics is true (based on the ms copyright name turning up i believe), then it may limit the platform support moving forward, or may have a light version for multi platform development or something?
Would still be many more platforms currently than physx though. Edit : Guess not true as physx can run on older versions of windows.

does running on an apu have any tangible benefits?
ryzen may be good for physics, but if talking beyond eye candy, then wouldn't it be more limited to the lowest common denominator?

I don't know what has happened since Microsoft bought Havok. DirectPhysics trademark could mean that Microsoft is using Havok know-how to introduce a public GPU API. It will likely support both CPU and GPU physics simulation. This would be good news for GPU physics in the long run. But also potentially good news for multi-core CPU physics (8+ cores with AVX2+ is plenty fast).

DirectPhysics:
http://wccftech.com/microsoft-trademarks-direct-physics-dx12/

sebbbi · May 3, 2017

3dilettante said:
I found no indication that anything that goes that far out into the hierarchy is SMT-aware.

I wonder how they manage to avoid partial flushes when they have two SMT threads running on the same core. Let's say both SMT threads are doing heavy non-temporal writes (to their own locations obviously). One of the thread stalls because of a cache miss (very common), but hasn't completed a whole line of NT writes. The other thread continues without stalls. I would assume that this soon leads to the partially written NT line to be evicted. Then this happens, the evicted line needs to be combined with existing contents (as not all bits of the line were overwritten). When the thread later recovers, it will start writing from halfway of NT line, and again the data needs to be combined with existing memory contents as the whole NT line wasn't written. This would be awfully slow. This is why I would assume that each SMT thread has their own ("L0") write combine buffers. I am just trying to find out whether there's a difference between Intel and AMD in this regard. This is AMDs first SMT design after all. It might be that they have overlooked something.

AMD RyZen CPU Architecture for 2017

Davros

sebbbi

lanek

3dilettante

CSI PC

3dilettante

sebbbi

sebbbi

Jay

CSI PC

CSI PC

lanek

sebbbi

Deleted member 2197

Guest

Jay

CSI PC

3dilettante

Kaotik

Drunk Member

sebbbi

sebbbi

Similar threads