Unless you want to design in a cache coherency protocol, which would be ridiculously complex and totally inconsistent with AMD's design philosophy.
Texture caches are read-only, so the coherency is relatively easy, isn't it? With tiled formatting of textures in memory (i.e. deterministic texel addresses), I'm guessing that there's only one cache to query for the presence of any quad of texels, rather than having to broadcast a request. Obviously all the TUs will be fetching multiple quads of texels in parallel, so the rate of queries is quite high.
Writes to render targets are tiled, i.e. any caching of those targets is localised, obviating coherency.
It's hard to work out what's needed to make each of the graphics-pipeline inter-stage buffers work. I'm still quite fuzzy on how VS/GS work is distributed across the clusters of a single GPU, so don't know what implications there are for multi-chip processing of geometry.
And I'm not clear if it's meaningful to have TS fixed-function units, on each of multiple GPUs, running in parallel, due to the locality/neighbourhood properties of patch processing (similar question mark for GS since locality of primitives is a key concept).
On the other hand, with post-transform caches being very small in GPUs, with implicit re-shading of vertices that have been evicted due to LRU or other policy, absolute throughput for VS/GS seems a lower priority. Maybe this will change with D3D11 due to the intensity of tessellation-based pipelines? I don't know how often typical games are VS or geometry bound in some way, generally.
---
Under GPGPU video memory is effectively read-write. This type of access is uncached in current GPUs, so requires plenty of threads to hide the resulting latency. Hard to know if there are any plans to implement caching for read-write accesses
---
Why do you say PCI-Express arbitration is required for multi-chip? The buffers are assigned by the host, which requires commands sent over PCI-Express, but after that what kind of arbitration are you referring to? Why would two chips be "horribly" slower than one chip?
Jawed