Been slowly digesting the Anandtech article, there's an awful lot of doubling of stuff there that should at least help out with a bunch of corner cases.
Bunch of other improvements also.
There's a few places where it's not clear if they simplified the arrows, or there's something to be read into the diagram for the integer execution engine. The Load/Store block in particular has arrows that go to the retire queue, the forwarding mux, and register file.
Comparing the Zen and Zen 2 diagrams shows a change from paired arrows going into the register file and forwarding mux to a single arrow. The arrows from the integer units and load store blocks to the retire queue now show that one arrow from the integer block is sharing an entry point from the originally independent load/store path.
It could be to reduce visual clutter, or possibly a streamlining choice when it came to having a broader amount of hardware paths versus the likelihood that the congestion of more paths may not match the likelihood that they would all be used.
The number of uops that can be dispatched to the renamer hasn't changed, so the wider later stages may make that a more clear bottleneck than before.
TAGE branch predictor could be a big win, from what I've been reading apparently is pretty bleeding edge tech, much better than Perceptrons they've been using previously (vid above says it was intended for Zen3 but they brought it forward
), though I did find a suggestion Intel has already been using this.
The TAGE predictor is a level-two predictor, meaning it is accessed after the initial prediction by the perceptron. Perhaps Zen3 has a similar arrangement, or the later addition with Zen2 meant it was easier to fit the larger TAGE one level further out from the inner prediction loop, since power was the supposed reason for keeping the perceptron as the initial predictor.
Regarding the AVX256: will they do double-rate 128bit?
The number of ports and dispatch width hasn't changed with the FPU, so I don't think it does.
Occurs to me this chiplet architecture is arguably a return to separate CPU-Northbridge-Southbridge
I think this is the case, or at least I've not seen a strong enough distinction in terms of features or design behavior to make this appear any different from other cycles of integration and separation that happen over time.
According to Anand article apparently the scheduler tries to split them up to manage thermals -> no dedicated clock reduction like Intel has but not ruling out thermal throttling via normal systems.
I think the cited mechanism is that the DVFS system uses activity monitors and built-in estimates for the power cost of instructions to determine what voltage and clock steps should be used, rather than a coarse change in clocking regime based on what category of instruction the decoder encounters.
This may help in certain cases where instructions that might be considered wide by the front end have internally lower costs for whatever reason. One possible area is using very wide AVX instructions to boost the performance of memory copies and clears, where a naive throttling of the core that makes sense for heavy ALU work hurts the memory optimization. However, I think more recent Intel cores have gotten better at subdividing AVX categories so that fewer optimizations are treated like very wide ALU ops.