I do not see Crystalwell (or any other CPU embedded memory technology) to be a threat to DIMM modules (large RAM memory on motherboard) in near future. SSDs were warmly welcomed because there was a gap in the memory hierarchy. The speed difference between RAM<->HDD was too high compared to difference from L1<->L2, L2<->L3 and L3<->RAM. Modern SSD caching solutions are pretty good, and offer an extra step in the memory hierarchy. However if you remove RAM from the hierarchy, and basically replace it with L4 cache (even 1 GB of embedded memory is not enough to replace RAM), there will be an ever larger gap to fill, even if we assume that all systems have SSD caches. L4<->SSD will just be a too large gap to handle efficiently.
A huge L4 cache would be a great thing to have, especially if it is larger than the memory access footprint of a single frame in a game. A game running on Ivy Bridge or Trinity APU can theoretically newer access more than 366 MB of memory per frame (22 GB/s shared bandwidth for GPU/CPU, 60 fps). And this figure assumes that no memory is touched multiple times and all cache lines are fully utilized (and there's no overdraw in rendering either). In practice the upper limit of accessible memory is under 200 MB per frame (more analysis here:
http://forum.beyond3d.com/showthread.php?t=62108).
If we had a 512 MB L4 cache, we could assume that:
- All data we access multiple times during a frame costs no extra memory BW, and is low latency (coming from L4)
-> Overdraw (objects behind each other, particles, alpha layers, etc) costs no memory BW, depth buffering costs no memory BW. Other temporary buffers cost no memory BW, shadowmap sampling costs no memory BW (assuming the map is rendered earlier this frame or used last frame), shadowmap rendering (updating) cost no BW either (if the same map has been already used last frame). Latency in accessing this data is also reduced (so we have less texture cache stalls, etc).
- The current working set from last frame is always in the L4 cache, and never needs to be reloaded from memory.
-> We only need to access RAM for data that is new.
--> At 60 fps only a few percent of data changes from frame to frame (this is a requirement for smooth animation). Old data that is updated (object is rotated/moved or shadowmap contents refreshed, etc) is not considered new data, as it lies in the same memory region (and thus is already in L4 cache). The same is true for all temporary work buffers that are used every frame.
Do I need to tell that I would absolutely love a system like this
. Unfortunately Haswell didn't bring it yet, and I don't believe we will see this anytime soon. As memory bandwidth of CPUs/GPUs keep rising all the time, the required L4 cache size to hold a full frame keeps rising as well. If technology keeps advancing as usual, in a few years we would likely need at least 2 GB of L4 to keep a full frame in cache at once. 2 GB L4 wouldn't be an overkill solution for systems that have 32+ GB of main memory. But the L4 wouldn't be enough alone (as all cache misses would need to be queried from SSD/HDD... that would be really bad indeed).