My point is, if GT3e only uses the eDRAM for frame buffer, that'd be pretty boring. However, just using the eDRAM for a "texture cache" (not the same as the traditional on-chip texture cache) is not ideal, because the relatively small size of the eDRAM would cause a lot of cache thrashing. A better way is to make the eDRAM and the main memory as some sort of "unified" memory, only that some parts are quicker. The system will have to determine which texture (or which part of some textures) are more frequently used and should be stored in the eDRAM, while others should stay in main memory.
Example: A 1920x1080 render target with 4x MSAA and 16bytes per fragment G-buffer data is 128MB, but since we only need the 12 non-Z bytes of a fragment's G-buffer when we're rendering edges the actual touched memory footprint is much lower. Assuming 40% extra fragments for 4xMSAA we get 1920x1080 pixels x 4 Z-samples/pixel x 4 bytes/Z-sample+ 1920x1080 pixels x 1.4 fragments/pixel x 12 bytes/fragment = 65MB , leaving 63 MB for textures which should be plenty, - it's more than 22 bytes per fragment.
I wouldn't expect the eDRAM to be a fully functional L4 cache for both CPU and GPU (with 64 byte cache lines, coherency, etc goodies). The cache logic + memory for tags, etc would just take too much die space (as the 128 MB cache is huge compared to traditional CPU designs).
It could however work in a similar way than AMDs PRT (partially resident textures) and 3d Labs hardware virtual texturing (
http://www.graphicshardware.org/previous/www_1999/presentations/v-textures.pdf) works.
The GPU could be able to sample textures from both DDR memory and from the eDRAM, and whenever a page is accessed that is not in eDRAM, the page starts to be moved to eDRAM (and simultaneously the accessed cache lines obviously to the GPU L1/L2 caches to serve the memory request with minimal latency). The texture page then would be in the eDRAM for the next access (during the same frame, or during the next frame). Assuming 64 KB pages (like in Tahiti PRT), the cache management would be much simpler compared to 64 byte cache lines, as there's 1024x less items in cache. That wouldn't require much extra logic / on die memory at all.
I have been programming software based virtual texture algorithms for a few years now, and I have analyzed the memory requirements for different resolutions (for various types of source content). Basically in the worst case (lots if discontinuities, 64KB pages), you need to have around 16 million texture samples in memory to render a 720p scene. That's equivalent of a single 4096x4096 texture (if you have only a color map). In our game we have 3xDXT5 textures sampled per pixel, so the worst memory requirement for textures at 720p is 3 bytes * 16 million pixels = 48 megabytes. And this allows unique 1:1 pixel sharp texel for each screen pixel (source texture resolution doesn't matter at all).
Optimized g-buffer layout at 720 (similar to Cryengine 3) uses: 720p * (8888*2 + D24S8) = ~ 10.5 MB memory. Add the texture pages mapped, and you get ~60 MB required to render a frame (for opaque geometry). As the data set changes slowly from frame to frame (animation must be smooth to look good), there's not many texture pages you have to move in/out of the eDRAM every frame. 16 x 64 KB pages (per frame) would be enough for most of the time (except for camera teleports), and since we should assume the GPU can also sample textures directly from DDR memory, the eDRAM<->DDR memory transfer bandwidth would never be a bottleneck (you could amortize the transfers over several frames).
With an optimized texture layout (3 bytes per pixel) + optimized g-buffer layout (12 bytes per pixel) 1080p would need (worst case)... ~ 60 MB * 2.25 = ~135 MB. That could often be below 128 MB, allowing us to fully texture our frame from eDRAM. My calculations are however not taking account how much data the shadow maps require, or how much data the transparent passes require (particles, windows, etc). However this data could be sampled directly from the DDR memory, assuming the memory management can set priorities correctly (and try to keep a most often reused subset of 64 KB texture pages in the eDRAM).
The interesting thing here is that Haswell is the first Intel chip to have this kind of large eDRAM die. If we had instead 256 MB of eDRAM, the GPU wouldn't need to access DDR at all (except for camera teleports). Everything a single frame needs would fit to the eDRAM. The system would just need to gradually move the 64 KB pages in/out of the eDRAM based on GPU access patterns. If the system would be well designed, there wouldn't be a need to move more than a few megabytes of data every frame between DDR<->eDRAM (assuming the access patterns are similar to our games using software based virtual texturing).