What are the cases where cache misses are common ?
A lot of compute jobs.
Any Shader with unpredictable access.
Any Shader with enough inputs that the cache can't handle it.
Any unswizzled input
My guess would be that it's designed the way it is for a reason, MS do a lot of measurement of 360 titles and GPU utilization, and I'd guess that they found that a lot of the compute resources were under utilized because of data stalls.
There isn't much you can do on a PC to fix this, you need API support for something like a fast memory pool, all you can really do on a PC is increase the size of register pools to increase the number of threads in flight and increase the caches, and that might be more expensive for a given performance gain than throwing more CU's ar the problem.
As I said I wouldn't like to posit how much CU's are underutilized in the average modern renderer because of data stalls. If I were guessing I would GUESS it is a significant amount.
I do know it's stupidly easy to write a compute job that you think will run 100x faster than your trivial CPU solution, benchmark it and discover it's actually slower, because the ALU's are all sitting there waiting for data. It's one of the reasons I've been saying that FLOPS are not a useful performance metric.
And as I said before the danger with the solution MS has come up with is you need to schedule moves of source data to the fast RAM, that eats bandwidth, and if you can't get it there fast enough the entire rendering pipeline stalls.
I also have questions as to how deferred renderers are best handled with such a small fast memory pool. It wouldn't be unusual for a deferred renderer to write 28 bytes/pixel, and that won't fit in 32MB at 1080P. Can you split the MRT's to different pools? Does MS provide guidance for devs trying to do this?
To me it's an interesting approach, How effective it is will depend on how ALU bound vs Data bound shaders are in modern games, I just haven't looked at enough data or spoken to anyone who has to have a good idea.