Do we have any data on what percentage of the CPU time is spent getting data outside the 2mb cache for a typical modern game workload?
Are there certain engines or game types which would work better in that limited 2mb cache space to avoid cache misses?
I'm guessing this whole latency issue can be avoided for the most part by pre-fetching, but this would seem to require a lot of "hand holding" by devs to get it working optimally.
You had to manually prefetch data with older in-order PPC based consoles (x360 and PS3). It required lots of manual fine tuning to work well.
All modern x86 CPUs have data cache prefetchers. Whenever the CPU notices several memory accesses that form a linear access pattern (stride), the CPU starts to automatically prefetch data according to that stride. This means that as long as you are processing things linearly by incrementing / decrementing the memory address, your code will basically never cache miss... unless you run out of bandwidth (or get hit by cache aliasing).
It's hard to program anything more complex by using only linear data accesses (only using arrays, and only batch processing them). Pointer accesses are often needed, and pointers can point to any address in the CPU memory. It's impossible for the hardware to predict where to prefetch data, until the pointer value is loaded to memory. Pointer chains (pointer list for example) are the worst case offenders. Fortunately out of order execution helps in these cases.
A simple example: Lets assume a full cache miss takes ~150 cycles (this is typical for modern Intel x86 CPUs). We are iterating though a pointer list (nodes are at random memory accesses and thus cannot be prefetched). The minimum time it would require to iterate though this list is always at least 150 cycles * N (where N is number of nodes in the list). If your operation takes only 10 cycles per node (*), you are wasting 140 cycles per node waiting for the memory latency. You are proceeding 15x slower because of memory latency. However if your operation is more complex, and takes for example 200 cycles to process, the CPU can prefetch the next node while processing the current one (because of out of order execution), as long as the next node address calculation doesn't depend on results of the operations of the current node. You could of course also use manual cache line prefetch in this case (SSE has instruction for it). If you do it like that, you would like to start prefetching the next node before you start processing the current one (as soon as you can get the next node pointer to a register).
For best performance, you should always optimize your memory access patterns according to the caches available in your target hardware. All new data that enters caches should be prefetched in one way or another (to prevent stalls). Old data already in caches can be accessed by any random pattern without performance loss. If you know (with relatively high confidence) what data lies already in the CPU caches, you know that you can efficiently access it again. For best result you should both think your memory (spatial) layout and your (temporal) access pattern (where and when).
(*). A modern CPU core can process at least 2 uops per cycle (sustained),. Most AMD cores process 2 uops per cycle, while most Intel cores process 4 uops per cycle. However in general the actual IPC howers around 1.0-2.0 instructions per cycle. Dependencies and memory stalls prevent cores from reaching their maximum theoretical performance. Thus a 150 cycle memory stall could (in the worst case) cause a CPU core to skip executing up to 600 instructions (doing nothing).