I would expect that most traffic on "leading mobile devices", using ImgTec GPUs, is vertex traffic, due to the interpolator-buffer that is streamed out and in again.
But I wouldn't expect that's the usual case for other GPUs. In my software renderer the most traffic is coming from the zbuffer (I don't have any hierarchical Z, tho), next on stage is the framebuffer and if you just consider passes that contribute to the final image, it's texture DMA, but if you take other passes into account e.g. shadows, it's Vertex data, and then texture data.
That's because of the usually good compression of textures. one vertex is easily 32byte, while tixel are 4 or 8 bit. if you consider that ~half of the vertices are processed because of backfaces, you have already an 16:1 ratio of vertex vs texture.
I use some caches, but not because of lowering the latency, but to lower the bandwidth need, freeing some resources for other parts of the system.
I didn't read any paper regarding Mali, but I think someone said it's deferred, but not tiled as ImgTec, but rather binning drawcall based? In that case I wouldn't expect the framebuffer has any benefits using the cache. (sorry if that was already said here, I wasn't reading all of the thread, just the last 10 post, kinda
).