]
I've been envisioning the L3 as a read/write cache for a single local (CCX) memory controller. That controller servicing requests over the interconnect, providing a uniform interface for whatever memory was attached. L3 is unified, but with multiple pools having different latencies corresponding to an address range.
The descriptions of the elsewhere seem to make it out to be localized to the L3, not being an LLC and not part of the northbridge. I'm not sure what behaviors or tests would indicate that the L3s are that closely linked to the memory controller.
Ideally, contended lines would be evicted to the L3, at which point the mostly exclusive properties would be useful. Moreover, E in L3 seems possible in the sense that uncontended clean lines in E state being evicted from the cores.
Contended lines would usually be mean there are live copies in one L2 or the other. I noted in passing that an exclusive line upon reading would convert to a shared status, making it drop from the L3 and/or snooping in general. In the case of an E line being evicted and then read back in by another CPU, the L3 would face the choice of invalidating its copy upon servicing the request or change it to shared--which in normal MOESI renders a line silent.
As an extension to MOESI, lines in E state might possibly respond with data and ownership transfer too (say if requests snoop other L2s inside CCX before going to the system), instead of transitioning to S state (which would require an invalidation broadcast when further transitioning to M/E) or not responding at all.
Exclusive is defined as the line being unique, which would normally preclude the L3 keeping the line in that status after responding to a read. Perhaps the math generally works out that a line in E status is either going to be written soon or satisfies enough demand for when it satisfies one additional read request before dropping to shared. Since the CCX probes both the L2 shadow tags and L3, there's no good way to trust an E line in one cache if there can be an S copy in the other.
I can see a more in-line path for Modified and Owned lines evicting to the local L3. A modified line could drop to the L3 in the same status, and transition to O as needed. This sort of transition would allow for a mostly exclusive L3 without interfering as much with MOESI. This at least makes some sense for the prior CPUs that had MOESI but dropped the L3, although now that the L3 is supposedly not optional perhaps it is different.
An Owned line might be able to drop back to the L3 as well due to an L2 eviction and run as normal.
What might make this problematic is when it comes time to transition dirty lines in preparation for a request from another CCX. Depending on how the code works, it might explain why memory write bandwidth aligns with the inter-CCX bandwidth, if a transition for dirty lines throws them back to the controller/DRAM.
Plain MOESI may highlight the discrepancy in bandwidth within a CCX and the external fabric. MOESI seems to have limited amplification for sharing clean lines, and there are certain paths for sharing dirty lines that leave no amplification opportunities on the far side of an inter-CCX transfer (read places source line to O, remote cache line to S). I don't know if AMD's L3 arrangement may allow an Owned line to shift to Modified after a write probe without forcing a write to memory, but that might be problematic outside of the more tightly integrated local CCX L3 and nearby shadow tags.
Perhaps some of the problematic workloads have a lot of easily forwarded clean lines which MOESI doesn't utilize the CCX data paths as well for.
Alternately they could write to producer-consumer buffers sized to some fraction of a larger LLC rather than a set of smaller L3s or L2s, leading to more writebacks+invalidates and not utilizing internal CCX bandwidth well.
I may have missed the benchmarking of cache ping-pong latency or synchronization latency, which might be another factor for the workloads that show unusually low performance.