One doesn't need to add much. Usual GPUs have that on board already, so do Orbis as well as Durango. A transfer of a cacheline (64bytes) takes two bursts or four command clock/eight data clock cycles on one 32bit channel.
The read-write turnaround for the bus is a multiple of that time period. At 5.5 Gbs, the best-case latency where the bus provides nothing is 13 command clocks, worst case 20.
[CLmrs+(BL/4)+2-WLmrs]*tCK
(20,19,18,17,16)+8/4+2-(4,6,7)
http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf
That would be six to ten cache transfers not utilized, with the length of time before the next transition dependent on what level is considered good enough, balanced with latency requirements for the CPU.
If it were the GPU alone running a traditional graphics workload, that looks well-handled.
The unknown in my eyes is the octo-core Jaguar part of the APU and Mark Cerny's desire to leverage asynchronous compute heavily.
This falls heavily on the CPU memory controller, onion bus, and the customizations in the L2, since this traffic does not rely on the ROPs.
I'm hoping for disclosure on this as developers have time to work on it. Existing desktop APUs aren't benchmarked for bandwidth utilization with the CPU and GPU sides under load. Latency for the CPU side is usually relatively poor or mediocre, but it's difficult to determine how well the GPU side is catered to since current APUs tend to be bandwidth-strangled.
Orbis would be a real test as to how well AMD's controller tech really handles the disparate needs of the two sides, and the compute customizations show a strong desire to keep a handle on the access patterns of the compute jobs.
Different memory controllers can of course have completely separate memory operations in flight. It basically works similar as banking of the eSRAM (which it quite probably uses) and helps already to avoid conflicts.
There are definitely optimizations that can be done here that can make the job easier on the memory interface. Making jobs more aware of what phase in the frame-generation time the GPU is in would help, as would making target buffer allocation more aware of what controllers would be used.
The ROPs on the other side already include specialized caches exactly for the coalescing/write combining and to increase the access granularity seen by the the memory.
They are also such a major bandwidth consumer that they are placed right next to the memory controllers, so we know their miss rates are very high (edit: misses creating demand bandwidth versus other client types).
The CPU and compute sides would actively contend for the bus, and the ROP caches cannot do more than make the color and Z traffic as well-behaved as they can in the face of an unknown.
Btw., as the eSRAM sits outside/behind all the GPU caches, it means that the same granularities will apply.
There shouldn't be a significant turnaround penalty, if any. There's probably no problem in the likely case that it's a bidirectional interface.
Reads can switch to writes and switch back to reads all day with no bandwidth loss and a much reduced amount of queuing would be needed.