It's a cheap alternative (from a design effort POV) than individual DVFS for each core. And considering the size of an A9, might even be better for area. It's also got the advantage of being better for idle power.
Remember T3 cannot use both the 1xLP and the 4xGP cores at the same time. It doesn't help at all for mid-throttle workloads, but it helps a lot for very-low-throttle ones. On the other hand, per-core DVFS helps for mid-throttle up to near-maximum throttle, but not very-low throttle. So it's actually orthogonal and complementary.
If you could run the LP and GP cores at the same time, the problem is that the LP core would need to either run at a different voltage (if so why not just do per-core DVFS on the GP cores, at least in a 3+1 configuration?) or run at a ridiculously low frequency (since the GP cores would run fairly low in a mid-throttle and even at max-throttle LP Nominal voltage is higher than G's). You could still achieve slightly higher performance but that's really negligible and not worth the effort.
So it seems to me that the optimal implementation (taking cost into consideration) is 1xLP sharing a voltage domain with 1xG (never active at the same time for complexity reasons) plus 3xG sharing another voltage domain (but with asynchronous clocks). The 1xG is the first one to reach maximum clock speed and the 3xG cannot reach the maximum clock speed simultaneously (ala Intel Turbo Boost). The whole thing is monitored ala AMD PowerTune to stay within the TDP and minimise heat hotspots.
With the A15 that extra 5th core is quite expensive and it's also overkill for background tasks at idle so maybe an A9-class core would be better (or a higher clocked A5-class core - even with higher leakage transistors it should be better than an A15 since it's so small). It will be interesting to see what exactly ST-Ericsson is doing on the A9600. There's a bigger argument for simultaneous operation of all the cores there but I don't think they'll do it because of the massive software complexity involved (including OS-level).
A small summary:
- Dual-core: Higher perf, higher TDP, higher perf/W for mid-throttle workloads with 2+ fairly symmetric threads.
- Quad-core: Higher perf, higher TDP, higher perf/W for mid-throttle workloads with 4+ fairly symmetric threads.
- Per-core Clock: Higher perf/W for mid-throttle up to near-maximum workloads with non-symmetric threads (i.e. most of them to some extent). Some indication T30 might support it as well BTW.
- Per-core DVFS: Same as above, but to a *much* greater extent. Also a bigger incremental cost increase but still not that big (mostly harder to implement and A9 doesn't support it iirc). This is unique to Qualcomm for now.
- Same LP Core: Higher perf/W at very-low-throttle. Indirectly allows you to get away with slightly higher leakage transistors on the other cores to increase performance and/or reduce dynamic power at the same frequency. First introduced by Marvell in the Armada 628.
- Different Lower Power Core: Same benefits but with lower cost and/or higher idle power savings, especially when the main cores are as big as the Cortex-A15. Might not use as low performance/leakage transistors so that the performance gap between the cores isn't too huge.
- Simultaneous Different Cores: Here be dragons!