16 MB is enough for 720p forward rendering with no MSAA. 720P RGBA16f color buffer + 32b Z buffer = 11 MB. But the same cache is also used for texture sampling and geometry (vertex and index buffer loads). 16 MB cache will trash for sure. 32 MB cache would be much better. Intel chose 64 MB and 128 MB for their Crystalwell GT4e cache sizes. Intel integrated chips are roughly in the same ballpark in performance and are mostly used to play games at 720p. So the comparison is relevant.
16 MB cache would take less die space than Xbox One's 32 MB ESRAM scratchpad, but not significantly so, as cache logic and tags take plenty of space. It would also be somewhat slower (latency) than current < 2 MB GPU L2 caches. Would likely affect performance a bit (at least global atomics). Would still be a huge perf boost assuming only 25.6 GB/s memory bandwidth.
I hadn't thought about Intel's IGP as a point of comparison - thanks for bringing it up! However, I feel like the thrashing issue could be addressed without expanding capacity to match GT4e (or moving to a scratchpad) by e.g. allowing cache ways to be allocated based on usage - basically Broadwell's L3 cache partitioning (CAT) but for GPUs. (http://danluu.com/intel-cat/, and 17.17 in http://www.intel.com/content/dam/ww...s-software-developer-vol-3b-part-2-manual.pdf has details).
I'm curious though - under the assumption that Switch is successful and becomes a specific optimization target for game devs, would you prefer a large cache (or scratchpad), or would you rather have more compute throughput and a smaller cache that requires tailored software to really shine?