Yes, when I say CPU I mean the software.
Doing a virtual cache mapping is problematic because the associativity and line size and what have you actually has a practical impact on how the software works, just as the size does. The have to emulate the actual behavior, and the only way to really do that in a cache is to have the same cache arrangement..
Again, depending on how you remap the cache-ops, the only thing that needs to be guaranteed is that data that is more pessimistically coherent. Performance on the LP core need not be very important.
Of course, it's not like nVidia is using a custom cache controller to begin with.
For Tegra 3 at least.
Full swapping is a standard usage model for big.LITTLE, and ARM provides firmware code for it - and it's the swapping latency that ARM is describing (what else would they be?). Sure, the software triggers the swapping in big.LITTLE, but why would that increase the latency requirements? If anything, you would want something that transparently swaps you to be as low latency as possible..
I thought swapping was considered a stop-gap solution until the OS is heterogeneous aware.