I must admit I was thinking in the context of the cluster migration policy of Exynox 5 series SoCs. Every time you bounce your run queues from one cluster (A) to the other (B) you're powering two L2s for a while. You either:
1. Power both L2s indefinitely; letting cluster B demand load data from cluster A on misses.
2. Flush cluster A's L2 cache to dram, let cluster B pull it back in when it misses its own L2.
3. A hybrid of the above: run the migrating processes for a few timeslices so they can pull the most frequently used data from on die cache, then flush the rest of the old L2.
They all have a performance impact and they all have a power consumption impact.
But the impact is strictly dependent on how often the migrations happen. You can't just say it's there and therefore it's bad, you have to look at its contribution in useful real-world scenarios.
Normally a thread should
not be constantly bouncing between heavy and low utilization at time scales still large enough so that it makes more sense to migrate it than to optimize the clock on one of the clusters for it.
But if you are constantly shifting CPU load then that doesn't mean you're also bringing along all your data with you. Once enough time has passed for the scheduler to realize you've substantially cooled down or heated up chances are pretty good that your working set will have changed a lot. Meaning that it's not necessarily true that a huge chunk of L2 cache will need to move around.
Also consider that some tasks are pretty transient, there will be new processes showing up on a new core on one of the clusters that'll last a few seconds then quit. These won't be migrated nor will they have a big cache flush footprint.
You get the same problem to a lesser degree with a core-migration policy.
I really see both cluster-migration and core-migration as stepping stones to the full HMP solution. Software lagging to this end doesn't mean that the hardware design is fundamentally broken and needs to change.
The whole point of big.LITTLE is to shut down you big cluster whenever you can to save power. Everytime you do that you have the above issue
I don't agree with you that the whole point of big.LITTLE is simply to shut down the big cluster. The point is to run tasks with a high average CPU load on the big cluster and tasks with a lower average CPU load on the little cluster. You shut down either when there are no tasks for long enough.
But if the whole point is to keep the big cluster off as much as possible then the cost of migration is negligible since you're only migrating
some of the time when you power the big cluster on and off.
Currently ARMs SCU/L2 cache supports four user agents, I don't understand why they don't pair a big and a LITTLE core for L2 arbitration (ie. 8 cores with 4 L2 "ports").
There are a lot of disadvantages to this:
1) It's locked on to core-migration when normally big.LITTLE is more flexible than that
2) The big core will often benefit from more cache than the little core
3) Having 4 caches instead of 2 increases coherency requirements
4) Smaller separate caches are less flexible, especially in quad core situations where often only a one or two cores will be active
5) The big and little cores run at different clock speeds and voltages (they have to), sharing a cache between different domains like this has overhead