Jaws said:
Not to mention that it's an in-order processor too. Cache is there to hide latency and one could argue an in-order processor needs to hide latency better than an OOOe processor and it would need more cache. I'm also fully aware that more cache doesn't necessarily mean better too, but, heck, you have HT P4's with 2MBs. That's 1MB per thread...
As I always like to point out at this juncture, often beaten by single-threaded A64s with 512KB of cache.
And with 6 threads on Xenon, each thread is proceeding at an effective 1.6GHz. This obviously paints quite a different picture in terms of cycles of latency with L1 or L2 misses. A 41-cycle L1 miss in Xenon becoms a 21-cycle miss for a single thread if a core is running two threads symmetrically. And if a pre-fetch is setup correctly, this miss affects that thread only, with the other thread carrying on regardless.
It's up to the devs to tweak their code using pre-fetching, hardware-thread priorities, cache-line data alignment, data-tiling, blah blah in order to maximise performance and minimise miss-induced stalls or flushes.
It's pretty pointless saying the caches are too small without quantifying it - and we're utterly in the dark on that.
A doubling of cache size generally brings 10% more performance - but again, that's a rule of thumb that doesn't necessarily account for code being rewritten specifically to match the cache sizes under consideration.
Considering how big the dies are, modern GPUs really do have piddly amounts of cache. That's partly because they use vast register files (just a kind of memory accessed with a different set of patterns), but also because the useful lifetime of texels is pretty limited over the duration of a frame render.
Jawed