If an access is going to memory from a CPU, it is going through the individual connections to the L1 to L2, to the L2 interface, to the system request queue, and then the memory controller.
Jaguar is a low-power architecture, so there are already things like half-width buses between the caches, and the L2 interface per cluster is only able to transfer a limited amount of data per clock.
The magic of caches is that (usually, hopefully) you don't need to lean on these connections more than 10% (or some other small number) of the time per level. So you only want to go out of the L1 10% of the time, and of the times you do, only 10% will hit the L2 interface.
AMD's architectures past this point have a memory crossbar with fixed width connections, which has itself been a bottleneck in the past.
There is also a request queue that coherent accesses (most CPU accesses) must go on so that they can be kept in order, snoop the other L2, and so that the Onion bus can do its job.
This is all in the uncore and would be part of the northbridge.
This ordered and coherent domain has to manage the traffic between the clients and broadcast coherence information (a lot of back and forth) between them.
AMD probably wants to do this all without too much hardware, power, latency, or engineering effort.
Contrast this with Garlic, which is not coherent and deals with a GPU whose memory model is very weakly ordered, not CPU coherent, and whose memory subsystem has extremely long latency. This is a comparatively simpler connection that goes from the GPU memory subsystem to the memory controllers.
Thanks. Im still wondering why AMD would design more throughput for a DDR3 based apu versus a GDDR5 based apu.
But I don't won't to push the discussion off topic.