That doesn't sound like a good compromise... The number of memory instructions needing synchronization in common code base is several orders of magnitude smaller compared to basic load/stores.
This may depend on what workloads are part of the set. Zen and K12 are ranging from consumer to HPC and high(er) end servers.
The mix of synchronization and the scale in terms of software complexity and core andsocket counts can vary significantly.
That K12 was meant to be a sibling to Zen also means that there is a desire to transfer applications or knowledge built on a strongly-ordered architecture to a weaker one, which brings in some thorny issues.
The other direction is easier since it might just mean the core ignores something.
x86 is dominant in much of the server space. Power is weaker, although even it has taken on stronger semantics in recent iterations, and the big-iron Z-arch is strongly ordered as well.
That's more of a target market concern for K12, although at this point we may not see it where Zen is going.
The other question, which might be addressed in the Sutter video, is the question of what the hardware actually does when it encounters a barrier or the acquire/release operations. Once I am able to watch it, it might go into detail. In the past, such events could severely constrain parallelism by forcing pipeline drains or more global stalls, and even if sections that need synchronization are 1-2 orders of magnitude fewer in static count, the dynamic count can shift. Also, aggressively OoO cores like Zen and K12 can host over 100 memory operations in their buffers--with some that can be speculated past a barrier. How the L/S pipeline reacts can drastically change the cost of the operations despite their rarity, particularly if a design's speculative capability means the probability of having at least one barrier in-flight goes to 1.
ARM V8's method might constrain the cost, although it seems like it wants to scale from some relatively conservative cores to the high-end like K12. An in-order little core with limited speculation has a lower peak to drop from in terms of parallelism and speculation, and a small L/S pipeline tied to a simple hierarchy won't consider a drain as prohibitive as a 72+44 OoO pipeline.
Another synchronization difference, as noted is that Zen (AMD64 in general?) keeps its Icache coherent, whereas ARM does not.
I am not a hardware engineer, but my educated guess is that looser memory model (reordering loads and stores) would affect many places in the OoO machinery. Keller saying that K12 has better performance than Zen would point out to the direction that AMD planned to customize the chip to take advantage of the looser ARMv8 memory model.
The memory pipeline is one area where I considered the possibility where the "engine" could be bigger. The decoder could be wider, up to the uop issue width. The actual growth in processor width may become gated by the renamer rather than the decoder.
One item about OoO machinery is that it also affects things in the other direction. The looser memory model makes changes in the other's behavior more clear, if it doesn't try to hide it.
I would have liked to have seen Zen and K12 if the memory pipeline was less restricted. There would have been no better way to compare the benefits and limitations of the two schools of thought than a design with the same implementation competence, overall philosophy, and design targets.
Btw wouldn't make sense to tried to make square waffles? the amount of area lost in circular ones is huge.
It's a result of how the wafer is created. The wafer must be one large silicon crystal, which is created by starting from from a seed in a vat of molten silicon and slowly growing it outwards.
The natural tendency would be to grow out equally in all directions since there's no particular way that the atoms care about corners.
However, a sphere doesn't give consistently-sized 2d surfaces, so instead the crystal is slowly lifted so that the sphere stretches into a cylinder.
It's worth losing the area to get an affordable mass-produced ultrapure giant crystal of silicon, versus the expense and likelihood of failure trying to make silicon do something it does not want to do.