https://github.com/llvm/llvm-project/commit/47bac63d3f6b9e64fdf997aff1f145bc948f02d9
It seems like cache coherence in GFX940 (CDNA 3?) is achieved by:
* memory local to L2: reads & writes are cacheable
* remote memory (other L2s or CPU): reads are uncached; writes are write through and send an invalidation to the home L2.
* Each L2 keeps a probe filter for CPU cached lines from its local memory, and forces CPU invalidation or writeback as appropriate.
It appears that GFX940 no longer has a unified L2 cache shared by all CUs. It is configurable:
1. from many smaller agents/virtual devices, each having their own private L2 cache;
2. to one single agent having
multiple L2 caches.
Each L2 owns a disjoint(?) region of the device memory now, while they appear to still have internal interleaved “channel” partitions.
Makes perfect sense in multi-agent mode where each small agent gets a fixed contiguous region, which can be owned outright by a single L2. But I am not sure how more “monolithic” configurations where one agent sees multiple standalone L2s would work effectively. Page level interleaving, eh?