This is my biggest concern as well. I was assuming Kaveri would have some shared LLC similar to Haswell with all of the talk of heterogeneous computation and so on but someone told me that was not the case. Thus if you need to spill your whole working set to memory every time you want to swap between the CPU and the GPU, it's not going to be a lot more fine-grained than it is today.
Given the very low bar for coherent memory given for Kaveri, there is an expectation that it's spilling to memory outside of the case of an initial handoff from CPU to GPU, and that may not happen all the time.
Onion+ is what AMD is declaring "coherent", which is a bypass of the unsnooped GPU cache hierarchy. GPU reads can hit a CPU cache, but the other direction is going to memory. Unless something happens to hit that narrow time window in the memory queue, it's in DRAM.
The GPU caches are likely too primitive, too numerous, and too slow to be plugged into the same coherent broadcast path.
The latencies of the GPU are such that I would suspect the longevity of cached data is not going to be that great. Commands should be handled by queues and hardware, and the wavefront initialization requires no outside intervention.
It's much lower latency than what has come before, but the latencies were originally horrible.
The GPU itself is still a very long-latency entity relative to the CPU, so the odds are good that any data you'd want to share is going to be evicted.
Because of the GPU's coarseness, it would favor heavy streamout that would swamp any cache anyway.
Being able to use pointer-based data structures is nice and all (although hardly required), but it's far more important to be able to share data on-chip. Hopefully either the person that told me this is wrong or this will get addressed in the next chip.
I suspect it's not wrong.
Not much has been disclosed that would point to it changing.
The architectures as they stand don't look like they can take advantage of such sharing.
I'm also not totally clear on how the GPU sending work to the CPU stuff is supposed to happen. Is the HSA runtime going to own threads that it spins up and down to do work and fires callbacks into user code?
I'm not sure about the CPU side, I think the HSA runtime is a part of it. However, AMD stated that that kind of thread management could be accessed at a lower level.