On Xbox 360, the EDRAM helps a lot with backbuffer bandwidth. For example in our last Xbox 360 game we had a 2 MRT g-buffer (deferred rendering, depth + 2x8888 buffers, same bit depth as in CryEngine 3). The g-buffer writes require 12 bytes of bandwidth per pixel, and all that bandwidth is fully provided by EDRAM. For each rendered pixel we sample three textures. Textures are block compressed (2xDXT5+1xDXN), so they take a total 3 bytes per sampled texel. Assuming a coherent access pattern and trilinear filtering, we multiply that cost by 1.25 (25% extra memory touched by trilinear), and we get a texture bandwidth requirement of 3.75 bytes per rendered pixel. Without EDRAM the external memory bandwidth requirement is 12+3.75 bytes = 15.75 bytes per pixel. With EDRAM it is only 3.75 bytes. That is a 76% saving (over 4x external memory bandwidth cost without EDRAM). Deferred rendering is a widely used technique in high end AAA games. It is often criticized to be bandwidth inefficient, but developers still love to use it because it has lots of benefits. On Xbox 360, the EDRAM enables efficient usage of deferred rendering.
Also a fast read/write on chip memory scratchpad (or a big cache) would help a lot with image post processing. Most of the image post process algorithms need no (or just a little) extra memory in addition to the processed backbuffer. With large enough on chip memory (or cache), most post processing algorithms become completely free of external memory bandwidth. Examples: HDR bloom, lens flares/streaks, bokeh/DOF, motion blur (per pixel motion vectors), SSAO/SSDO, post AA filters, color correction, etc, etc. The screen space local reflection (SSLR) algorithm (in Killzone Shadow Fall) would benefit the most from fast on chip local memory, since tracing those secondary rays from the min/max quadtree acceleration structure has quite an incoherent memory access pattern. Incoherent accesses are latency sensitive (lots of cache misses) and the on chip memories tend to have smaller latencies (of course it's implementation specific, but that is usually true, since the memory is closer to the execution units, for example Haswell's 128 MB L4 should be lower latency than the external memory). I would expect to see a lot more post process effects in the future as developers are targeting cinematic rendering with their new engines. Fast on chip memory scratchpad (or a big cache) would reduce bandwidth requirement a lot.