Can there not be an intermediate with the ability to flag chunks of address space as "currently in work", in which case a synchronization process is initiated, or "not currently in work" in which case any device can read/write from that space in the fast manner as described above, ignoring coherency issues on other devices accessing the ram? What address space granularity would be needed (not a flag for each bit, obviously, but what size chunks?), what would the transistor budget be for that sort of table, and is that feasible? Or is that in any way how HSA/HUMA/UMA whatever is intended to function?
HSA promises a unified and coherent memory space, with a very weak consistency model.
A memory value, if updated, should eventually become visible at some point to the rest of the system, emphasis on eventually.
The memory pipelines are not tightly coupled in the way that the x86 multicore fabric is.
Everything up to that final point in time, unless you are in that specific work item on the GPU, is borderline free for all.
What values you see at a specific location or what values that are visible to other locations written by the same program or from multiple other HSA coprocessors, may not be the same as what another client might be observing.
So memory is coherent, but while in the thick of things, don't hope for updates to necessarily make sense.
Fences and load acquire and store release operations basically checkpoint the points in the process where things absolutely must start making sense. The CPUs do have some requirement for synchronization operations in a more limited subset of cases.
This basically is as you say: do a bunch of work, then signal when the mess has settled.
The base HSA model doesn't forbid devices with stronger models, which can just skip some of the extra steps in places where they are strong. However, a wide umbrella is necessary for a platform that includes devices with very primitive memory subsystems or different architectures.
I haven't found a description of GCN's coherence protocol, and it is doubly complicated because GCN is internally already weak and considers coherence optional. The cache setup is a de facto coherence setup where there's a common L2 everyone writes to, which is the next tiny step up from not being coherent at all.
I can't speak to Kaveri, but the other upcoming GPUs like the one in Orbis have already indicated a pretty onerous flush requirement for coherent traffic (which Mark Cerny said was optimized with an option to just avoid the GPU cache hierarchy). That's indicative of a memory hierarchy on the GPU side just barely on this side of being incapable of being considered coherent with the CPUs.
Even if Kaveri improves on this a little, the GPU subsystem is much longer latency and has a massive amount of in-flight traffic. Sharing would need to be done carefully. Care is needed when contending for data between CPUs, but the interface and GPU pipeline latencies are such that the costs will be massively higher.