I just realized optimizing for LLC locality via access pattern seems a long stretch for MI300A, which has one-forth of the LLC and memory channels not being bound to any GPU XCD. It really seems like the partitioned modes are the only way to make full use of the LLC peak bandwidth.It sounds like it should be possible but how exactly would you do it in practice? You'd need full knowledge of the entire hashing algorithm *and* you'd need the virtual=>physical memory mapping to remain consistent across way more than 4KiB.
Either way, given that the LLC is memory-side, and that everything is power-of-two, It could be as simple as a power-of-two scheme where PA bit 14:13 select a home IOD and PA bit 12:8 select a channel within an IOD. Though there seems a plausible argument as well, where striping first across IODs gives better power performance probabilistically (and to a lesser extent, the local effective LLC bandwidth)
Last edited: