The implementation on Tegra 4 is totally different from the implementation on Tegra 3, since it has separate L2 caches.
In Tegra 4, when switching to the battery saver core, the 2MB L2 cache used for the main cores is flushed and power gated, while the battery saver core uses it's own 512KB L2 cache to save power. In Tegra 3, when switching over to the battery saver core, the 1MB L2 cache used for the main cores is not flushed and is used for the battery saver core too, is that right? That may not be a trivial difference, but there should still be many similarities between hardware and software implementation of the battery saver core on Tegra 4 compared to Tegra 3.
Since nVidia doesn't design or validate the CPUs it hardly matters.
That may be the case today, but as far as I can tell, it is NVIDIA's intention to design their own CPU's at some point in the future. So 4+1 may be a design feature that NVIDIA wants to include in the future in their own custom CPU's too, but who knows for sure.
That's only one example. If you think that 4 A7s can't be loaded then you either think that 4 cores will only be loaded heavily or 4 cores can't be loaded at all.
The examples I listed above are some of the most common usage scenarios for mobile devices today. When watching a video or movie, when listening to audio with some headphones on while walking or jogging, when reading an ebook, when checking email, when talking on the phone, etc...for these basic tasks, it is very unlikely for the user to be using their handheld device for anything else other than the task at hand, so multi-core CPU's would not be very beneficial in these scenarios in my opinion.
Both viewpoints are against what nVidia has been marketing. According to them, having four cores improves power consumption because there are workloads that will complete faster with more cores at a lower clock.
What NVIDIA is saying only applies to multi-threaded scenarios. In a scenario where multiple cores will actually be used, then four cores should have lower power consumption than dual cores (all else equal), and dual cores should have lower power consumption than a single core (all else equal). In scenarios that are not heavily multi-threaded (such as some of the usage models I outlined above), there would be little to no benefit in power consumption (all else equal).
nVidia is saying that their power optimized A15 uses ~70% the power of a Tegra 3 A9 at 1.6GHz (40% less, 1 / 1.4), or maybe 60% the power (1 - 0.4) if I'm not reading that correctly. A 1.2GHz Tegra 3 Cortex-A9 core can probably use a similar amount or less than that, so there's zero doubt in my mind that a 1.2GHz Cortex-A7 on 28nm would use substantially less power.
I do realize that Cortex A7 is a very power efficient design, but how useful is it to have quad A7 cores when the heavily multi-threaded tasks are consistently performed on the quad A15 cores?
Tegra 4, by virtue of not being able to run both clusters simultaneously for some reason, almost certainly has a worse penalty in switching from one to the other than anything you'd get running in a proper big.LITTLE setup using HMP.
According to NVIDIA, with respect to the battery saver core in Tegra 3, the total switching time is less than 2 milliseconds of delay (ie. imperceptible delay to a human), but I haven't seen anything specific for Tegra 4. The reason that the four main CPU cores are not enabled at the same time as the battery saver core is to not incur penalties involved with synchronizing caches between cores running at different frequencies.
Not sure why you bring up performance, the point of HMP in these processors is to enable more efficient power consumption when you have threads that don't need the full load. Yes it's more complex, but it's not like nVidia has to invest solely into this technology, it's going to happen on Android regardless. If anything it'd be smarter for them to leverage something that's benefited by active research and development outside of their company.
Sure, one of the goals of big.LITTLE MP is to use the right processor for the right task all at the same time if possible, but this is easier said than done, and most big semiconductor companies cannot even agree on what is the "right" processor design in the first place! In theory it is actually a nice idea, but in practice it will likely be challenging to implement. Also, there is no way to get around the fact that the highest performance CPU cores will be relatively power hungry and will still be needed for the most CPU-intensive tasks.