If dealing with standard API stuff, there are number of places where they'd have to make such a trip, such as handling occlusion queries (edit: and then routing the results back)--which requires CPU intervention through commands put through the runtime and driver. Because of the massive and unpredictable latencies, it simply does not get done on the current frame or possibly multiple frames.
The more integrated architectures or low-level APIs would be lower latency, or remove outside intervention.
In relative terms, though, it wouldn't be considered heavy, at least in terms of frequency.
For CPU to CPU communication, getting data to use in the worst case would be main memory latency, so over a hundred of cycles, with in-cache access taking a handful of cycles. It's still used judiciously.
Doing the same thing with the latest APUs by sending a command to a GPU buffer to make the results of compute available without using Onion+ would according to Vgleaks have a worst-case of tens of thousands of GPU cycles.
The predictability of the GPU's queueing is not that great at present, though. That could still make the case for buying 33ms or so by working on previous frame data.
With Onion+, a bandwidth-restricted amount of data could be sent from the GPU to main memory and then back to a requesting CPU after some multiple hundreds of cycles. It's a minority of the data being processed.
The amount of synchronization between the two sides would be commensurate to how debilitating using it would be.
The best GPGPU methods are painful at present, and are used sparingly. They just aren't horrific anymore.
If long-running compute that handles itself mostly on the GPU with occasional runs through Onion+ can be done, it might lead to a somewhat freer interplay with the CPU because it should remove much of the multi-frame queueing latencies that can accumulate if the GPU is under load. Presentations from Sucker Punch on the PS4 indicate this is still troublesome. For PC drivers, it might be an application killer, since such a kernel isn't one that would conclude in time for a driver's timeout/freakout limit.