I think the implementation is more that the GPU sends a write over the Onion bus, and the interface logic itself or in conjunction with the request queue handles inserting it into the order of requests in the coherent hierarchy.
A broadcast of invalidations would happen once the write gets to this point, and any future coherent traffic is going to get on the queue as well. All other attempts to use that cache line are going to hit the queue, until it writes to memory and main memory becomes the final arbiter.
.