AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

fellix · Jun 16, 2019

Rootax said:
. And I my memory serves me right, they started avec different frequencies for avx256 / non avx256 with broadwell ? Or is it only with avx512/skylake-x ?

Intel introduced AVX clock throttling with high-core count Haswell-E Xeons.

3dilettante · Jun 18, 2019

hoom said:
Been slowly digesting the Anandtech article, there's an awful lot of doubling of stuff there that should at least help out with a bunch of corner cases.
Bunch of other improvements also.

There's a few places where it's not clear if they simplified the arrows, or there's something to be read into the diagram for the integer execution engine. The Load/Store block in particular has arrows that go to the retire queue, the forwarding mux, and register file.
Comparing the Zen and Zen 2 diagrams shows a change from paired arrows going into the register file and forwarding mux to a single arrow. The arrows from the integer units and load store blocks to the retire queue now show that one arrow from the integer block is sharing an entry point from the originally independent load/store path.
It could be to reduce visual clutter, or possibly a streamlining choice when it came to having a broader amount of hardware paths versus the likelihood that the congestion of more paths may not match the likelihood that they would all be used.
The number of uops that can be dispatched to the renamer hasn't changed, so the wider later stages may make that a more clear bottleneck than before.

TAGE branch predictor could be a big win, from what I've been reading apparently is pretty bleeding edge tech, much better than Perceptrons they've been using previously (vid above says it was intended for Zen3 but they brought it forward ), though I did find a suggestion Intel has already been using this.

The TAGE predictor is a level-two predictor, meaning it is accessed after the initial prediction by the perceptron. Perhaps Zen3 has a similar arrangement, or the later addition with Zen2 meant it was easier to fit the larger TAGE one level further out from the inner prediction loop, since power was the supposed reason for keeping the perceptron as the initial predictor.

Regarding the AVX256: will they do double-rate 128bit?

The number of ports and dispatch width hasn't changed with the FPU, so I don't think it does.

Occurs to me this chiplet architecture is arguably a return to separate CPU-Northbridge-Southbridge

I think this is the case, or at least I've not seen a strong enough distinction in terms of features or design behavior to make this appear any different from other cycles of integration and separation that happen over time.

hoom said:
According to Anand article apparently the scheduler tries to split them up to manage thermals -> no dedicated clock reduction like Intel has but not ruling out thermal throttling via normal systems.

I think the cited mechanism is that the DVFS system uses activity monitors and built-in estimates for the power cost of instructions to determine what voltage and clock steps should be used, rather than a coarse change in clocking regime based on what category of instruction the decoder encounters.
This may help in certain cases where instructions that might be considered wide by the front end have internally lower costs for whatever reason. One possible area is using very wide AVX instructions to boost the performance of memory copies and clears, where a naive throttling of the core that makes sense for heavy ALU work hurts the memory optimization. However, I think more recent Intel cores have gotten better at subdividing AVX categories so that fewer optimizations are treated like very wide ALU ops.

hoom · Jun 19, 2019

3dilettante said:
There's a few places where it's not clear if they simplified the arrows, or there's something to be read into the diagram for the integer execution engine...

So could be they simplified/downgraded some bits that haven't been bottlenecked to help make space for the extra bits?

3dilettante said:
The number of ports and dispatch width hasn't changed with the FPU, so I don't think it does.

Yeah, that and its probably the sort of thing they'd mention explicitly if it was double-rate.

Gubbi · Jun 19, 2019

3dilettante said:
This may help in certain cases where instructions that might be considered wide by the front end have internally lower costs for whatever reason. One possible area is using very wide AVX instructions to boost the performance of memory copies and clears, where a naive throttling of the core that makes sense for heavy ALU work hurts the memory optimization. However, I think more recent Intel cores have gotten better at subdividing AVX categories so that fewer optimizations are treated like very wide ALU ops.

Intel has two clocking penalties, one for AVX-256 ALU instructions and one for AVX-512 ALU instructions. AVX-256 memory moves runs at full speed (no transition to slower clock), but AVX-512 memory moves incur the AVX-256 frequency penalty.

If you try to optimize memory functions (memcpy/strcpy) with AVX-512 moves, you might very well end up with lower performance overall; Your memory moves will be faster, but everything else runs 10-15% slower.

Intel has hysteresis built into the frequency transitions. Before powering up the full width of the AVX-256/512 ALUs, the instructions are run through the narrower execution units (microcode ! ) until several thousand instructions have been executed within a set interval before powering up the full width ALUs and lowering frequency. There is also a delay after use, before frequency returns to normal.

AMD's mode of operation seems less heavy handed.

Cheers

fellix · Jun 19, 2019

ASRock X570 Taichi detailed overview and BIOS screens

Alexko · Jun 19, 2019

Was it always possible to select the IF clock speed?

Malo · Jun 19, 2019

Alexko said:
Was it always possible to select the IF clock speed?

I don't believe so. It was never an option on my X370 but I don't have the high end OC board. Don't remember seeing it mentioned on OC threads ever.

fellix · Jun 19, 2019

IF in Zen1 was fixed at 1/2 DRAM transfer rate.
So, looks like only X500 series mobos will be able to set arbitrary IF divider. Legacy boards probably lack the dedicated clock generator for that. Dunno.

Alexko · Jun 19, 2019

fellix said:
IF in Zen1 was fixed at 1/2 DRAM transfer rate.
So, looks like only X500 series mobos will be able to set arbitrary IF divider. Legacy boards probably lack the dedicated clock generator for that. Dunno.

That opens up some really, really interesting benchmarking options. Inter-CCX latency was said to be responsible for some performance pain points, so I'm eager to see how performance scales with IF clocks on those applications.

fellix · Jun 19, 2019

Async clock domains always incur latency penalty during transition. Overclockers will probably try to keep synced IF and DRAM clocks as far as possible, for latency sensitive benchmarks.
Kind of reminds me of the good old i875P chipset for P4, that had special "short path" mode when FSB and DRAM were operating at the same clocks.

digitalwanderer · Jun 20, 2019

fellix said:
Async clock domains always incur latency penalty during transition. Overclockers will probably try to keep synced IF and DRAM clocks as far as possible, for latency sensitive benchmarks.
Kind of reminds me of the good old i875P chipset for P4, that had special "short path" mode when FSB and DRAM were operating at the same clocks.

Pentium 4!

Ah thanks, I needed that. Hadn't thought of that CPU in a while, it was the reason I switched to AMD.

fellix · Jun 21, 2019

AMD Ryzen 5 3600 CPU Benchmarks Surface

CPU-Z and Cinebench 15/20 results.

xEx · Jun 25, 2019

Apparently this is legit....

https://elchapuzasinformatico.com/2019/06/amd-ryzen-5-3600-x470-review/

fellix · Jun 25, 2019

Memory latency and write speed are quite terrible. Hopefully it will perform much better on X570 with final BIOS.

Malo · Jun 25, 2019

fellix said:
Memory latency and write speed are quite terrible. Hopefully it will perform much better on X570 with final BIOS.

Well we don't know the latency cost of having the separate I/O die.

Rootax · Jun 25, 2019

If this is legit (like, not due to a bug, early bios, or something like that), it seems that the ipc gains or whatever are "wasted" by the memory performances ? I'll wait for more reviews of course. But it's a bad first impression...

xEx · Jun 25, 2019

Yes the memory performance was surprising but may be a BIOS bug, they say that they cannot OC it which tells the BIOS is not a 100% working yet.Although the gaming test, capped it seems, are not bad for that price. its a preview and will need to see more.

hoom · Jun 25, 2019

Is anyone really surprised that moving the memory controllers off-die makes for a big bump in memory latency? :neutral:

What I'm seeing is a 6-core & 100Mhz slower (both base & turbo) bottom of the new line-up chip hanging with & in a bunch of tests healthily beating the previous 8-core top model both in single-thread and multi-thread.

I think if this is legit then Intel is in a lot of trouble from the higher end

It'd be interesting to see what this core could do with an onboard memory controller though.

tunafish · Jun 25, 2019

hoom said:
It'd be interesting to see what this core could do with an onboard memory controller though.

I suspect we'll see that once the APUs arrive next year. If they are monolithic, I think they might beat the high-end ryzens in most gaming loads.

hoom · Jun 26, 2019

Yeah I was thinking the same.

AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

fellix

3dilettante

hoom

Gubbi

fellix

Alexko

Malo

Yak Mechanicum

fellix

Alexko

fellix

digitalwanderer

fellix

xEx

fellix

Malo

Yak Mechanicum

Rootax

xEx

hoom

tunafish

hoom

Similar threads