Coherency updates could be communicated directly from cache to cache, it doesn't necessarily have to go through main memory. Like what Intel provides with the F state in the MESIF protocol.
Is there any really definitive source on the alleged L2 arrangement? The only one I'm aware of is a pretty old rumor..
The old rumor matches what I was told by another developer.
I'd guess some metric somewhere led them to the 2MB cache size for the "primary" processor, and die size probably dictated the size of the other caches.
The caches being separate is probably a function of the easiest way to implement the snoop logic on an existing processor.
It's not a horrible arrangement if you look at how games use the PS360 processors and if the snoop logic moves data L2 to L2.