I know this question may sound very newbie.
How can the main ram be having all these contention problems from cpu and gpu?
At a base level, there are physical constraints on the behavior of any device, and the DRAM bus and DRAM arrays have very significant restrictions and latencies.
DRAM favors highly linear accesses and long stretches of pure reads or pure writes. There are historical reasons, like saving costs by having read and write traffic have to use the same wires and favoring array density to the point that DRAM arrays have not gotten faster for years.
Access a bank that hasn't had enough time to get ready, or force the bus and DRAM to change from read to write, and you start getting long stretches of no memory traffic at all.
Changing once from read mode to write mode for GDDR5, for example, is roughly the same as losing 28 or so data transfers.
The figures for memory operations taking hundreds of cycles are the best-case numbers when the DRAM is actually ready to respond to a CPU's request. Get the access pattern wrong, and the DRAM wastes more time getting everything ready than actually sending data.
It is difficult to get this right, and even the best CPU memory controllers in pretty favorable benchmarks lose tens of percent of peak performance. In reality, CPUs don't commonly consume that much bandwidth thanks to their caches and code that doesn't always need a memory access every cycle. They tend to care more about getting smaller amounts of data quickly.
GPUs are generally better at utilizing large amounts of bandwidth, and they do this by accepting long latencies and firing off a massive number of accesses. Generally, if you throw up enough accesses over time, eventually a good pattern should fall out of it.
If both access patterns are going full-bore, then the CPU takes a latency hit because the GPU is sucking up accesses some of the time, and the GPU loses bandwidth because the CPU's insistence on latency means cutting the long stretches of GPU traffic off prematurely.
Because it's generally a bad idea for the GPU and CPU to access the exact same portions of memory, DRAM loses more time hopping across distant portions of the storage space, which leads to more dead cycles.
How can they possibly be using so much memory/bandwidth cpu side?
A Jaguar core can read in 16 bytes per cycle for its SSE/AVX operations. With 8 of them at 1.75 GHz, that's 224 GB/s without including writes. It is possible to create code that can spam memory traffic, but this is not a compelling scenario to have a low-power CPU target.
The assumption is that the caches reduce the need for external bandwidth, and the actual northbridge bus that connects them to memory is significantly narrower and likely clocked at a lower multiplier than the cores. This is most likely for power and complexity reasons, leading to the 30 GB/s limit for coherent bandwidth.