It's not at all clear to me that the same can be done for GPUs without a massive BW interface between GPUs, and what the cost of that would be.
I believe it could be possible, assuming of course that there's shared memory (like in multi-socket CPU configs)...
Let's talk about traditional vertex shader + pixel shader pipeline first. In this case your inputs are commonly RO (buffers and textures). GPU can simply cache them separately. No coherence is needed. Output goes from pixel shader to ROP which does the combine. There's no programmable way to read the render target while you are rendering to it. Tiled rasterizer splits triangles to tiles and renders tiles separately. You need to have more tiles in flight to saturate a wider GPU. This should work also seamlessly for two GPUs with shared memory. If they are processing different set of tiles, there's no hazards. Tile buffers obviously need to be flushed to memory after finishing them, but I would assume that this is the common case in single GPU implementation as well (if the same tile is rendered again twice, why is it split in the first place?).
Now let's move to the UAVs. This is obviously more difficult. However it is worth noticing that DirectX (and other APIs) by default only mandate that writes by a thread group are only visible to that thread group. This is what allows GCN CU L1 cache to be incoherent between other CU L1 caches. You need to combine the writes at some point, if cache line was partially written, but you can simply use a dirty bitmask for that. There's no need for complex coherency protocol. It's undefinied behavior if two thread groups (potentially executing on different CU) write to the same memory location (group execution order isn't guaranteed and memory order isn't guaranteed = race condition). If we forget that atomics and globallycoherent UAV attribute exist, we simply need to combine partial dirty cache line with existing data (using a bit mask) when it is written to memory.
Globallycoherent attribute for UAV is a difficult case. It means that UAV writes must be visible for other CUs after doing DeviceMemoryBarrierWithGroupSync. Groups can use it in combination with atomics to ensure data visibility between groups. However this isn't a common use case in current rendering code. For example Unreal Engine code base shows zero hits for "globallycoherent". Atomics however are used quite commonly in modern rendering code (without combining it with globallycoherent UAV). DirectX mandates that atomics are visible to other groups (even without a barrier). The most common use case is one global counter (atomic add), but you could do random access writes with atomics to a buffer or even a texture (both 2d and 3d texture atomics exist). But I would argue that the bandwidth used for atomics and globallycoherent UAVs is tiny compared to other memory accesses, meaning that we don't need full width bus between the GPUs (for transferring cache lines touched by these operations requiring coherency).
But these operations still exist and must be supported relatively efficiently. So it definitely isn't a trivial thing to scale to 2P GPU system with memory coherence and automatic load balancing (automatically split single dispatch or draw call to both).
However if we compare CPU and GPU, I would argue that GPU seems much simpler to scale up. CPU is constantly accessing memory. Stack = memory. Compilers write registers to memory very frequently to pass them to function calls, to index them and to spill them. There's potential coherency implication on each read and write. GPU code on the other hand is designed to do much less memory operations. Most operations are done in registers and in groupshared memory. Writing a result to memory and immediately reading it back from memory afterwards is not a common case. Most memory regions (resources) that are random accessed are marked as read only. Most resources that are written are marked as only needing group coherency (group = all threads executing on same CU). Resources needing full real time coherency between CUs and between multiple GPUs are rare, and most of these accesses are simple atomic counters (one cache line bouncing between GPUs). This is a much simpler system to optimize than CPUs.