The whole point of big.LITTLE is to extend to curve and get optimal power efficiency at lower performance levels (although since it's an inherently more power efficient architecture, the break-even point may be higher than minimum voltage for the A15). The "companion core" approach from NVIDIA helps in the similar way although only when you don't need more than one thread - still that's the bulk of workloads today, so it should help a lot already compared to a standard A15 implementation.
I understand the appeal of big/little, but isn't that a theoretical option at the moment? Is big/little available as a solution now or is actual silicon still a year out?
It'd be interesting to see how much power you can gain (*if* it's on the same process) by just synthesizing with lower clock. Tegra 3 can't provide any insight in that.
Ideally what you'd want is big.LITTLE with both sets of cores active and visible to the OS (with the right kernel logic to make it work) combined with a single A7/LITTLE companion core that could be active at the same time for lower leakage/performance (i.e. High Vt with longer gate channel lengths and power-optimised synthesis) so that the main cores could be implemented with a higher performance/leakage process (for lower active power by undervolting). So I can imagine something like 4xA57+5xA53 being very interesting on 20nm...
What do you mean 'with both sets active'? Is this: both the big and the little of the same combo are running their own independent code? So you're basically running 8 CPUs? That's crazy.
Is this what ARM is currently promoting?
What's the performance difference between big and little anyway?
But you should care about image quality. Tegra 3 didn't support framebuffer compression, so to save bandwidth they only supported (or at least exposed?) a pitiful 16-bit depth buffer. That leads to quite a lot of depth precision issues...
I think this is a developers headache more than anything else, not something most people will actively experience as a reduction in image quality.
But more important: again as a user, I don't think it matters for the vast majority of games that are currently out there. Cut the Rope, Angry Birds, board games etc. They are all at the top of the sales ladder. They have cute graphics and they better be smooth these days, but 16-bit Z fights are not going to be a concern.
So I completely understand why they'd want to focus on CPU performance: that's where 99% of the users can still see a difference: the time to load an app, the time to load a web page. I haven't seen many people rave about the 2x faster GPU on the iPad 4. It's just not very noticeable.
In my mind this leads to a fundamental flaw of the architecture: it's an IMR but it's not fast enough at doing a Z-Only prepass (because of bandwidth, depth rate, and geometry performance) so you need (unrealistically?) good front-to-back ordering to get good performance on complex workloads. And even then I suspect they waste more time than they should on rejecting pixels for perfectly front-to-back ordered scenes that have high overdraw as they have no Hier-Z of any kind...
I understand the first order bandwidth implications of an IMR, but it strikes me that the effects are less pronounced in the real world that what you'd expect them to be. Take Tegra3: it has the same 32-bit wide MC as Tegra2, needs to feed double the CPUs and a more demanding GPU, yet the results are really much better than you'd expect them to be, with a pretty small die size to boot. Take the A5X, (
http://www.anandtech.com/show/5688/apple-ipad-2012-review/15, bottom of page): it's a ridiculous 160mm2 vs 80mm2. And the GPU only area ratio should be even more out of whack, yet for equivalent resolutions it's only 2.5x faster. The A5X has not only way more external BW, but the disparity is even higher taking into account the on-chip RAMs. I'm really not impressed by this 2.5x and I don't understand why it's so little.