That's certainly how I would expect things to work, although I think there's the tendency from the CPU vendor side to try and come up with reasons to couple them more tightly (and bias importance towards the CPU unsurprisingly).
ARM v7 has recommendations as far as cache implementation and many of its cache operations are built to distinguish between L2 as an external cache and any level above as tightly coupled. You can get around this, of course, but it's easier just to add an L0 as the nearest level cache and use L1 as you would L2.
In either case, most CPU designs for such an SoC are intended for a whole system, and the CPU design unlikely to focus solely on CPU performance.
I disagree, the CPU vendor is designing a cache to meet the needs of the CPU which are subtly different to those of the GPU, you really need one bulk level of caching between the CPU and any shared infrastructure to avoid the CPU being mangled by latency.
Yes and no. Again, you're thinking in terms of a desktop system, where the bus is incredibly high latency and low bandwidth compared to the last level of CPU cache. Typical ARM SoC's have high-speed and low-latency on-chip buses that can usually saturate the L2 array and provide acceptable (2-3 CPU cycles) latency for a read access.
And since we're talking ARM here, the application is always (for now) going to be an SoC. ARM is a system design company as much as a CPU design company. Their IP will take into account system needs as well as the needs of the CPU. Other ARM architectural licensees do the same.
Not the same thing, a true shared cache does not require snooping protocols, traffic just flows through the cache and is inherently coherent.
I'm not sure how you imagined shared cache works but snooping is definitely required at all times. Before write-back occurs to the shared cache, a snoop barrier must be sent to all other possible write-sources lest there is a hazard. Additionally, snoop kills and invalidates must be supported in the case you happen to write to a cached address that is in another core's L1 or L0.
Yes external memory may be much slower than internal buses, however that doesn't change the fact that pushing GPU traffic through the CPU cache would just thrash the cache to the extreme detriment of the CPU. Further when you actually look at the typically bandwidth provide by the coherency buses it is often poor in comparison to even the external memory bandwidth. Obviously the later could be viewed as an implementation issue, but when you look at the problems associated with feeding a non latency tolerant CPU (none of them are) I doubt this is going to change significantly.
You'd be surprised. We're talking ~233MHz LP-DDR1/2 here. Perhaps 64-bit, but in most cases, 32-bit bus. Not what you'd see in a desktop. We're also talking DRAM that requires wake-from-low-power-state, which adds even more latency.
In contrast, the on-chip bus is generally ~800MHz 64-bit. Some high-end chips (I'd venture Marvell's) may expand this to dual buses with a 128-bit data path.
In ~1GHz CPU's, that's really not that much latency to go over the bus.
Note that I think that once we get to systems with 10's of MBs available for caches the dynamics of this are likely to change, unfortunately we're not quite at that point.
GPU's in these classes of SoC's aren't going to eat up a ton of memory and they aren't going to eat it up in a haphazard, indeterministic way like CPU code would. Generally speaking, while sharing the L2 does affect CPU performance by some percentage, system performance is more important. This isn't Intel. ARM vendors don't sell CPU's; they sell SoC's.