Seems a rather roundabout way of doing things, making remote access to the producer's L2 subset faster ... even though the moment you store the data you already know where the consumer is going to be. The consumer already has a subset of cache to which it has fast access, it's own L1 cache and L2 subset.This alone would save on coherency traffic to and from the data-caches. The next step is to make a special chunk of logic to handle these loads and stores, ie a subset of the L2 with much faster access times to lower communication latencies
PS. of course even ignoring that, there is already a subset of the local cache with much faster access times ... even though it's not a subset of L2 Just keep it there then, yet another cache doesn't make a whole lotta sense to me.
Last edited by a moderator: