AMD RyZen CPU Architecture for 2017

Luxmark C++ 1 core + 1 SMT
luxmarktkznz.png


Luxmark C++ 8 cores + 8 SMT
luxmarkfullmdlhw.png
 
The SMT inpact was discussed here before, I remember having questions about its (unexpected) efficiency myself and receiving quite adeqate answers here.

I believe that the point of these 1core, 2 logical threads comparisons is to prove that Ryzen gets a bigger increase when SMT is enabled, relative to its single threaded performance (due to being wider and thus easier to saturate with 2 threads). 2 logical Ryzen threads shouldn't be generally faster than Intel's latest 2 logical threads, should they?
 
Last edited:
Looks like AMD made Zen with such wide pipeline mostly for SMT gains, not so much about single-thread performance.

With Haswell, Intel re-organized some ALUs along the new issue ports, to target more optimized ST performance and to streamline AVX code execution.
 
Looks like AMD made Zen with such wide pipeline mostly for SMT gains, not so much about single-thread performance.
Wider pipes can provide better single thread performance with ILP. I'm personally still waiting to see how Zen works with APUs that may take on part of the work. The Zen design choices might still lack context.
 
Of course they can, it seems just that they will only to a lesser extend relative to Intel's architecture.

But sure, APUs will provide more information. I'm not quite sure how the context is that different, though. Anything you expect we could be looking for beyond changes in caches & Infinity fabric?
 
Haswell vs Zen clock for clock in Cinebench:

123kqkas.png

cb4sa5r.png

Using 2 logical threads in the same core.

+27% performance for Haswell, +44% for Zen with SMT/HT.
 
But sure, APUs will provide more information. I'm not quite sure how the context is that different, though. Anything you expect we could be looking for beyond changes in caches & Infinity fabric?
Software changes for HSA and ROCm. Any indication of tightly integrated CPU and GPU. A connection via IF with higher bandwidth and mess latency/distance would be a good start. Adding a stack of HBM2 would be nice, but not absolutely necessary. Any indication AMD is attempting to accelerate AVX512 and similar instructions with a GPU.
 
Single core + HT: 693, on my Broadwell.

At this point, I blame the CCX domain cross penalty, since LuxRender, as any PT renderer, is rather sensitive to thread sync latencies.

At 1 core + 1 thread utilization it should not pass the other CCX though, your result is absurdly high, 693 x 8 would mean around 5500k score for 8 cores at 4.0 GHz. AVX2 instructions could be the "issue" here, Broadwell should generally perform better when AVX 256 is involved. Can you check for the all core result?
 
Last edited:
4091 for all the 12 threads. Slightly lower than perfect scaling, the L3 is only at 3400MHz, probably the main reason.

Dunno about AVX2, not sure if LuxRender core supports it, but Ryzen can do two FADD op's for all SIMD sets, while Broadwell can do only one. That's some potential advantage for all existing code.

it should not pass the other CCX though,
The data set could still spill over the local CCX.
 
Last edited:
The data set could still spill over the local CCX.

That's a really good score for a 6 core :)

But how can the data spill over to the other CCX if I specifically set it at core 0/1 for example? It doesn't seem to utilize anything but the first core + SMT that I'm setting it at through process lasso.
 
The two L3 clusters are virtually addressed as a single uniform space, regardless of thread allocation.

Also, the non-inclusive relation to the L2 cache makes the L3 a bit extra slower, due to the extra write cycle, when a cache line is evicted.
 
Looking at @Arnold Beckenvauer 's post:

Ryzen 5 2500U:

- 4 Zen cores, 8 threads
- 2GHz
- probably 15W, if it's going after Intel's "U" chips
- Using Carrizo's Southbridge


This is how it compares to the latest 4-core Kaby Lake Core i5 8250U:
https://browser.geekbench.com/v4/cpu/3977268

And this is a Skylake Core i5 7200U (still present in most 15W notebooks AFAIK):
https://browser.geekbench.com/v4/cpu/3880603

Ryzen 5 2500U
Single Score: 3561
Multi Score: 9421
OpenCL: 27092

Core i5 7200U
Single Score: 3765
Multi Score: 7512
OpenCL: 19254

Core i5 8250U
Single Score: 4164
Multi Score: 13679
OpenCL: 20498

Memory bandwidth in this Ryzen 2500U test is getting below 15GB/s, whereas those Intel results are getting 20-25GB/s and a Ryzen 3 with DDR4 2400MHz gets close 30GB/s in multi-core.
This test might have been done using a single SO-DIMM.
 
The two L3 clusters are virtually addressed as a single uniform space, regardless of thread allocation.

Also, the non-inclusive relation to the L2 cache makes the L3 a bit extra slower, due to the extra write cycle, when a cache line is evicted.

Then locking the core to a 4+0 setup should fix any CCX jumping latency, correct? I might have to try that tomorrow and see if there's any difference.

This test might have been done using a single SO-DIMM.

Geekbench is ridiculously memory reliant so if they are running a single channel config then it's definitely affecting the overall score.
 
Dell Inspirion 7570 https://browser.geekbench.com/v4/cpu/3999148

Differences between single channel and double channel aren't very high. But on the other hand there are no results of RR APUs with dual channel.
Single channel is good enough for fast dual core or low clocked quad core CPU. But it certainly isn't good enough for a fast iGPU.

11 CU Raven Ridge iGPU is roughly equivalent to Xbox One in GPU performance. Xbox One has quad channel memory controller with 68 GB/s main memory bandwidth and 32 MB ESRAM to reduce the main memory bandwidth bottleneck. It is still often memory bandwidth bound. I would guess that 11 CU Raven Ridge with single channel memory is crippled by lack of memory bandwidth in games. Even double channel should still be memory bandwidth bound. You would need at least a quad channel memory (like Xbox One) to avoid the memory bandwidth bottleneck. This is of course assuming that the 11 CU Raven Ridge has high enough TDP to run the GPU at ~900 MHz in games.
 
Single channel is good enough for fast dual core or low clocked quad core CPU. But it certainly isn't good enough for a fast iGPU.

It was Geekbench related.
11 CU Raven Ridge iGPU is roughly equivalent to Xbox One in GPU performance. Xbox One has quad channel memory controller with 68 GB/s main memory bandwidth and 32 MB ESRAM to reduce the main memory bandwidth bottleneck. It is still often memory bandwidth bound. I would guess that 11 CU Raven Ridge with single channel memory is crippled by lack of memory bandwidth in games. Even double channel should still be memory bandwidth bound. You would need at least a quad channel memory (like Xbox One) to avoid the memory bandwidth bottleneck. This is of course assuming that the 11 CU Raven Ridge has high enough TDP to run the GPU at ~900 MHz in games.

Some people are still thinking there will be RR APUs with HBM or quad channel memory controller. Yes: RR iGPUs - even with 8 CUs - will be bandwidth limited. The reality is:
I tried to play Crysis 3 on my HP 15 with an A10-9600P (15W TDP and 6 CUs). The CPU clocks were ~ 1,1 GHz and GPU clocks ~ 500 MHz. Unplayable. But what to expect from the RR APUs with 14nm, much better CPU architecture and upto 11 CUs? Much better performance and playable games.

But if you want better gaming performance, then buy a laptop with a dedicated GPU.

I'm afraid that a lot of manufactures like HP or Lenovo will offer RR based laptops and 2in1s with single channel memory/one DRAM slot or 2 DRAM slots with only one SO-DIMM inside
 
Back
Top