Thanks for the feedback SMM. To generalize, it has been frequently noted that Xenon (and the Cell PPU to a degree) have significantly less cache than their PC counterparts and that this could pose serious performance issues, especially Xenon which would have less than 512KB per core (compared to something like C2D with 2MB per core). IMO this is an oversymplified arguement and ignores the differences in task and market but the amount of cache, specifically the lack thereof, has been a point mentioned on the forums quite often over the last couple years.
As important as the amount of level 2 cache is the associativity of it, especially for a multi-cored device , -and even more so when each core is multi-threaded. Each CPU context generally has three segments, code, stack and heap.
To get any kind of reasonable performance you
need good hitrate in your instruction caches since you feed the front of the pipeline and has little room to dodge dependecies here (other than switching context). So I-fetch has to operate mostly from I$ or you're screwed anyway.
The stack segment usually sees exceptionally good hitrates because of the high level of spatial and temporal locality.
This means that the level 2 cache has to primarily deal with heap data. Usually you'd want at least one way per CPU context to avoid the most obvious pathological case where each context thrashes the content of the L2 for the other contexts.
The Xenon cache hierarchy seems fairly well designed. Each core supports two contexts for a total of six for the entire device. The L2 cache is eight way associative, and therefore should serve all six contexts if they are fairly well behaved.
Likewise each core has a 4-way associative L1 D$ supporting a stack and heap segment for each of the two contexts.
As for the amount of L2 cache. Yes, a C2D has more cache, but each core is also a lot faster (don't be fooled by the mega bollocks per second numbers) - and with a higher latency main memory system that means that each miss is relatively much more expensive, hence it makes sense to add cache to lower miss rates.
On top of that game developers are already focused on a whole bunch of other performance related problems like instruction scheduling and vector unit intrinsics and damn well better keep an eye on cache miss rates and memory footprint/behaviour in general (ie. are more likely to use non-temporal loads and stores to avoid polluting the caches with data with poor locality)
Cheers