AMD said pretty much exactly that when asked about the Inter-CCX communication. If the data is found in the other CCX's L3, then the memory request is cancelled. Which only makes sense if you have a buffer that's deep enough to accomodate for the latency of getting this information (not necessarily the data itself though), won't it?
The L3 isn't inclusive, so checking it is insufficient. Perhaps the statement covers the shadow tag check as well?
There's a number of questions I have about the sequence of events, since this is not in a directly written AMD statement.
The buffer for absorbing unneeded RAM requests shouldn't need to be the size of those external arrays. I'm not sure where exactly the L2 miss handling is being done, and the CCX interface may play some role in it with the shadow tag functions. Wherever the handling is done, the number of misses that can be handled before the pipeline stalls is limited. Some papers on one of AMD's ideas for handling coherence between the CPU and GPU peg it at perhaps several dozen (problematic for GPU loads that throw out multiple thousands, http://research.cs.wisc.edu/multifacet/papers/micro13_hsc.pdf).
If we go by the visually estimated 2MBx2 capacity of those external arrays, that is 64K cache line requests. A whole Naples 2-socket system is going to stall before noticeably filling one array.
Could AMD have massively changed things this time around? Even if we assume the paper's baseline is conservative and predates Zen, I'm not sure Zen's coherent pipeline has shown a revamp so extreme.
A directory or filter like HT Assist seems to fit the storage capacity.
The old version of HT Assist from Istanbul could cover 16MB of cache with 1MB of capacity stolen from the L3. (edit: https://www.cs.columbia.edu/~junfeng/12sp-w4118/lectures/amd.pdf)
One of those possibly 2MB arrays in Zen could cover 32MB--which would be enough to cover the L3s of a 4-chip Naples, although insufficient to cover the L2s. Having two, however, would cover one Naples MCM entirely if there's some amount of balance in cache utilization. The other effect of having the array external to the CCX is that it would allow the CCX to downclock or power-gate without affecting memory traffic or coherence, since the L3 is no longer part of the northbridge.
Another unknown is what AMD's version of MOESI is for Zen. The classical MOESI would generally not need as much buffering because a lot of traffic is going straight to memory no matter what (L3 victim cache, Shared state does not forward) . Bulldozer's MOESI seems to have shifted that preference, although some of its changes may interact with activating the directory. In the old scheme, the access is happening anyway for a lot of misses--meaning no buffering is going to save the controller's time.
If Zen is more like Bulldozer, this might be less frequent. The question becomes how quickly the probe can be resolved before the entry comes up in the memory controller's queue. It would be relatively cheap to buffer the access and overlap the delay if the probe takes a little longer. It wastes bandwidth and power if the access is cancelled, but it shaves off potentially tens of nanoseconds in the single-chip case, given the gulf in memory latency versus the inter-CCX ping latency.