AMD: Zen 2 (Ryzen/Threadripper 3000?, Epyc 8000?) Speculation, Rumours and Discussion

Been slowly digesting the Anandtech article, there's an awful lot of doubling of stuff there that should at least help out with a bunch of corner cases.
Bunch of other improvements also.
There's a few places where it's not clear if they simplified the arrows, or there's something to be read into the diagram for the integer execution engine. The Load/Store block in particular has arrows that go to the retire queue, the forwarding mux, and register file.
Comparing the Zen and Zen 2 diagrams shows a change from paired arrows going into the register file and forwarding mux to a single arrow. The arrows from the integer units and load store blocks to the retire queue now show that one arrow from the integer block is sharing an entry point from the originally independent load/store path.
It could be to reduce visual clutter, or possibly a streamlining choice when it came to having a broader amount of hardware paths versus the likelihood that the congestion of more paths may not match the likelihood that they would all be used.
The number of uops that can be dispatched to the renamer hasn't changed, so the wider later stages may make that a more clear bottleneck than before.

TAGE branch predictor could be a big win, from what I've been reading apparently is pretty bleeding edge tech, much better than Perceptrons they've been using previously (vid above says it was intended for Zen3 but they brought it forward :oops:), though I did find a suggestion Intel has already been using this.
The TAGE predictor is a level-two predictor, meaning it is accessed after the initial prediction by the perceptron. Perhaps Zen3 has a similar arrangement, or the later addition with Zen2 meant it was easier to fit the larger TAGE one level further out from the inner prediction loop, since power was the supposed reason for keeping the perceptron as the initial predictor.

Regarding the AVX256: will they do double-rate 128bit?
The number of ports and dispatch width hasn't changed with the FPU, so I don't think it does.

Occurs to me this chiplet architecture is arguably a return to separate CPU-Northbridge-Southbridge
I think this is the case, or at least I've not seen a strong enough distinction in terms of features or design behavior to make this appear any different from other cycles of integration and separation that happen over time.

According to Anand article apparently the scheduler tries to split them up to manage thermals -> no dedicated clock reduction like Intel has but not ruling out thermal throttling via normal systems.
I think the cited mechanism is that the DVFS system uses activity monitors and built-in estimates for the power cost of instructions to determine what voltage and clock steps should be used, rather than a coarse change in clocking regime based on what category of instruction the decoder encounters.
This may help in certain cases where instructions that might be considered wide by the front end have internally lower costs for whatever reason. One possible area is using very wide AVX instructions to boost the performance of memory copies and clears, where a naive throttling of the core that makes sense for heavy ALU work hurts the memory optimization. However, I think more recent Intel cores have gotten better at subdividing AVX categories so that fewer optimizations are treated like very wide ALU ops.
 
There's a few places where it's not clear if they simplified the arrows, or there's something to be read into the diagram for the integer execution engine...
So could be they simplified/downgraded some bits that haven't been bottlenecked to help make space for the extra bits?

The number of ports and dispatch width hasn't changed with the FPU, so I don't think it does.
Yeah, that and its probably the sort of thing they'd mention explicitly if it was double-rate.
 
This may help in certain cases where instructions that might be considered wide by the front end have internally lower costs for whatever reason. One possible area is using very wide AVX instructions to boost the performance of memory copies and clears, where a naive throttling of the core that makes sense for heavy ALU work hurts the memory optimization. However, I think more recent Intel cores have gotten better at subdividing AVX categories so that fewer optimizations are treated like very wide ALU ops.

Intel has two clocking penalties, one for AVX-256 ALU instructions and one for AVX-512 ALU instructions. AVX-256 memory moves runs at full speed (no transition to slower clock), but AVX-512 memory moves incur the AVX-256 frequency penalty.

If you try to optimize memory functions (memcpy/strcpy) with AVX-512 moves, you might very well end up with lower performance overall; Your memory moves will be faster, but everything else runs 10-15% slower.

Intel has hysteresis built into the frequency transitions. Before powering up the full width of the AVX-256/512 ALUs, the instructions are run through the narrower execution units (microcode ! ) until several thousand instructions have been executed within a set interval before powering up the full width ALUs and lowering frequency. There is also a delay after use, before frequency returns to normal.

AMD's mode of operation seems less heavy handed.

Cheers
 
ASRock X570 Taichi detailed overview and BIOS screens

Edzazkt.jpg
 
Was it always possible to select the IF clock speed?
I don't believe so. It was never an option on my X370 but I don't have the high end OC board. Don't remember seeing it mentioned on OC threads ever.
 
IF in Zen1 was fixed at 1/2 DRAM transfer rate.
So, looks like only X500 series mobos will be able to set arbitrary IF divider. Legacy boards probably lack the dedicated clock generator for that. Dunno.
 
IF in Zen1 was fixed at 1/2 DRAM transfer rate.
So, looks like only X500 series mobos will be able to set arbitrary IF divider. Legacy boards probably lack the dedicated clock generator for that. Dunno.

That opens up some really, really interesting benchmarking options. Inter-CCX latency was said to be responsible for some performance pain points, so I'm eager to see how performance scales with IF clocks on those applications.
 
Async clock domains always incur latency penalty during transition. Overclockers will probably try to keep synced IF and DRAM clocks as far as possible, for latency sensitive benchmarks.
Kind of reminds me of the good old i875P chipset for P4, that had special "short path" mode when FSB and DRAM were operating at the same clocks.
 
Async clock domains always incur latency penalty during transition. Overclockers will probably try to keep synced IF and DRAM clocks as far as possible, for latency sensitive benchmarks.
Kind of reminds me of the good old i875P chipset for P4, that had special "short path" mode when FSB and DRAM were operating at the same clocks.
Pentium 4! :LOL::LOL::LOL:

Ah thanks, I needed that. Hadn't thought of that CPU in a while, it was the reason I switched to AMD. :)
 
Memory latency and write speed are quite terrible. Hopefully it will perform much better on X570 with final BIOS.
Well we don't know the latency cost of having the separate I/O die.
 
If this is legit (like, not due to a bug, early bios, or something like that), it seems that the ipc gains or whatever are "wasted" by the memory performances ? I'll wait for more reviews of course. But it's a bad first impression...
 
Yes the memory performance was surprising but may be a BIOS bug, they say that they cannot OC it which tells the BIOS is not a 100% working yet.Although the gaming test, capped it seems, are not bad for that price. its a preview and will need to see more.
 
Is anyone really surprised that moving the memory controllers off-die makes for a big bump in memory latency? :neutral:

What I'm seeing is a 6-core & 100Mhz slower (both base & turbo) bottom of the new line-up chip hanging with & in a bunch of tests healthily beating the previous 8-core top model both in single-thread and multi-thread.

I think if this is legit then Intel is in a lot of trouble from the higher end :cool:

It'd be interesting to see what this core could do with an onboard memory controller though.
 
It'd be interesting to see what this core could do with an onboard memory controller though.

I suspect we'll see that once the APUs arrive next year. If they are monolithic, I think they might beat the high-end ryzens in most gaming loads.
 
Back
Top