Samsung Exynos 5250 - production starting in Q2 2012

  • Thread starter Deleted member 13524
  • Start date
This is why the big power comparison articles motivated by Intel annoy me. They've been looking at peak CPU vs peak CPU for two different CPUs that deliver very different power at peak. I want to see some curves for several different perf levels, like what Nebuchadnezzar gave for Exynos 4210. A CPU may use twice as much power to give only 20% more speed, but if it uses the same amount of power to give the same amount of speed then how is it any worse, especially if you rarely need more perf than that for sustained periods of time..

I bet pretty much all games out now can maintain steady state just fine Exynos 5 Octa's Cortex-A7s, with only brief periods of firing up the A15s (notice in the video it happens during scene transitions). These periods won't heavily impact battery life or make the unit too hot.
 
I didnt call them idiots so please dont put words in my mouth.
I also never mentioned octa being marketing or that it woukdnt work. .again...

I only questioned the point of having 4 a15s with a7s...aside from the recommendations of using equal amounts of both cores. .due to Migration software not mature enough, I still think its over kill in a smartphone....tegra even more so because no a7s and no asynchronous mp.

I would have preferred the setup in the arm demo (running on android ics)...it loaded web pages instantly with only 2 a15s and no gpu..I fail to see what noticeable advantage another 2 eagle cores would achieve..other than sucking power.

Snapdragon is different because kraits consume less power than a15s, seem to be more optimised for power efficiency (L0 cache? Less execution units if I recall) and asynchronous mp.

When any of the A15s are not in use they will be power gated, so as mentioned before, it costs in terms of die area, but not in power consumption.
 
When any of the A15s are not in use they will be power gated, so as mentioned before, it costs in terms of die area, but not in power consumption.

Yes I know this...this should be obvious...but if they are on die they will be used no?....ironically I argued your point last year when certain peeps more experienced than I were adamant 4 cores were useless on a smartphone and a power drain. .so im well aware of very simple features like power gating.

4 eagle cores powering up to load web pages or run games has the potential (although not set in stone due to load balancing) to consume more power than 2 a15s would you agree?

Anyway we have 4 a15s to play with which is certainly nice and 4 a7s to offset consumption worries....from a consumer perspective im pleased with that.

Enough said.
 
Yes I know this...this should be obvious...but if they are on die they will be used no?....ironically I argued your point last year when certain peeps more experienced than I were adamant 4 cores were useless on a smartphone and a power drain. .so im well aware of very simple features like power gating.

4 eagle cores powering up to load web pages or run games has the potential (although not set in stone due to load balancing) to consume more power than 2 a15s would you agree?

Anyway we have 4 a15s to play with which is certainly nice and 4 a7s to offset consumption worries....from a consumer perspective im pleased with that.

Enough said.

Never knew that Android / iOS were that well multi-threaded!, javascript alone is single-threaded until HTML5 become the default. But assuming a workload that scales to 4 cores, running 4 A15s would complete the task in roughly half the time vs 2 A15s, so the total energy used is comparable, and the user is much happier with the faster completion time.
 
Four cores that can accomplish the same task in the same amount of time as two cores will use less power. Four cores that accomplish the same task in less time may use more power doing so, but the whole system may use less if it means the user spends less time with the screen powered on. Ultimately the peak clocks and therefore peak perf and peak power consumption will be something that the OS and user can control.
 
One reason I see for nVidia to go with the ninja core instead of big.LITTLE is because licensing for Cortex A15 could be a lot cheaper than licensing for Cortex A15 + Cortex A7 + big.LITTLE's "glue" (assuming there is such a thing).
nVidia is definitely pursuing cost-effectiveness with Tegra and this could be a factor.
 
nVidia's 4+1 decision with Tegra 4 makes no real sense.

I am sure that there are logical reasons for NVIDIA to stick with 4+1 for Tegra 4. For one, NVIDIA already has hardware and software experience using 4+1. Secondly, rather than needing to rely on two radically different CPU designs like the A15 and A7 on one SoC die, they are able to use just one latest-and-greatest CPU design with the fifth battery saver core synthesized to run at lower frequencies (and hence lower voltages and lower power). Third, since the battery saver core is used for tasks that are not very CPU-intensive (such as video/movie/audio playback, web page and ebook reading, email syncing, social networking, idling, etc.), then it makes little sense to use a multi-core CPU solution. In fact, if you look at the big.LITTLE video demonstrations on youtube showing CPU load on the quad-core A7's and quad-core A15's, you will notice that the quad-core A7's are never even close to being fully utilized. Now, the quad-core A7 "little" cores could be more fully utilized during "MP" processing, but that mode of operation is far more complicated and requires more software and OS support to work right (not to mention the fact that the A15 cores are already far faster than the A7 cores, so adding lower clocked A7 cores in addition to higher clocked A15 cores will only give an incremental improvement in peak performance while possibly incurring some penalties from synchronizing caches between A15 and A7 cores/clusters running at different operating frequencies; and if the A15 cores/clusters are clocked lower to match the clock operating frequency of the A7 cores/clusters, then the peak performance may not even improve at all).
 
For one, NVIDIA already has hardware and software experience using 4+1.

The implementation on Tegra 4 is totally different from the implementation on Tegra 3, since it has separate L2 caches. It's probably a standard multi-cluster implementation from ARM, not that much different from one for big.LITTLE except both clusters are Cortex-A15.

Whatever software experience they had would pretty easily translate to big.LITTLE if only using it in the same 4+1 mode. They could start with that if they feel they need time to mature the software to be more capable. Since 4 A7s use about as much area as 1 A15 they wouldn't be losing anything there.

Secondly, rather than needing to rely on two radically different CPU designs like the A15 and A7 on one SoC die, they are able to use just one latest-and-greatest CPU design with the fifth battery saver core synthesized to run at lower frequencies (and hence lower voltages and lower power).

Since nVidia doesn't design or validate the CPUs it hardly matters.. externally (with regards to interfacing it with the rest of their SoC) the two different types of clusters (A7 vs A15) are liable to look the same.

Third, since the battery saver core is used for tasks that are not very CPU-intensive (such as video/movie/audio playback, web page and ebook reading, email syncing, social networking, idling, etc.), then it makes little sense to use a multi-core CPU solution. In fact, if you look at the big.LITTLE video demonstrations on youtube showing CPU load on the quad-core A7's and quad-core A15's, you will notice that the quad-core A7's are never even close to being fully utilized.

That's only one example. If you think that 4 A7s can't be loaded then you either think that 4 cores will only be loaded heavily or 4 cores can't be loaded at all.

Both viewpoints are against what nVidia has been marketing. According to them, having four cores improves power consumption because there are workloads that will complete faster with more cores at a lower clock. This would immediately translate better to an A7 cluster after a certain point. Never mind of course that real threaded workloads tend to be fairly asynchronous and would benefit further from asynchronous cores.

And of course nVidia believes in four cores in general or it wouldn't have four A15s.

nVidia is saying that their power optimized A15 uses ~70% the power of a Tegra 3 A9 at 1.6GHz (40% less, 1 / 1.4), or maybe 60% the power (1 - 0.4) if I'm not reading that correctly. A 1.2GHz Tegra 3 Cortex-A9 core can probably use a similar amount or less than that, so there's zero doubt in my mind that a 1.2GHz Cortex-A7 on 28nm would use substantially less power.

It wouldn't offer nearly the same performance in SPEC2k as an 825MHz Cortex-A15 (and thus would fail nVidia's marketing) but in other apps it probably wouldn't trail it by all that much. But the real question isn't whether or not you could live with it but just how much power it actually saves you to run that A15 at 825MHz instead of running the other cluster with one A15 at 825MHz.

Now, the quad-core A7 "little" cores could be more fully utilized during "MP" processing, but that mode of operation is far more complicated and requires more software and OS support to work right (not to mention the fact that the A15 cores are already far faster than the A7 cores, so adding lower clocked A7 cores in addition to higher clocked A15 cores will only give an incremental improvement in peak performance while possibly incurring some penalties from synchronizing caches between A15 and A7 cores/clusters running at different operating frequencies; and if the A15 cores/clusters are clocked lower to match the clock operating frequency of the A7 cores/clusters, then the peak performance may not even improve at all).

Tegra 4, by virtue of not being able to run both clusters simultaneously for some reason, almost certainly has a worse penalty in switching from one to the other than anything you'd get running in a proper big.LITTLE setup using HMP.

Not sure why you bring up performance, the point of HMP in these processors is to enable more efficient power consumption when you have threads that don't need the full load. Yes it's more complex, but it's not like nVidia has to invest solely into this technology, it's going to happen on Android regardless. If anything it'd be smarter for them to leverage something that's benefited by active research and development outside of their company.

Ultimately I think true big.LITTLE requires separate power and clock domains for each cluster and this may be something nVidia didn't want to have on their on SoC (particularly the former). With their design they can probably mux the same regulator and PLL outputs to either cluster.
 
The implementation on Tegra 4 is totally different from the implementation on Tegra 3, since it has separate L2 caches.

In Tegra 4, when switching to the battery saver core, the 2MB L2 cache used for the main cores is flushed and power gated, while the battery saver core uses it's own 512KB L2 cache to save power. In Tegra 3, when switching over to the battery saver core, the 1MB L2 cache used for the main cores is not flushed and is used for the battery saver core too, is that right? That may not be a trivial difference, but there should still be many similarities between hardware and software implementation of the battery saver core on Tegra 4 compared to Tegra 3.

Since nVidia doesn't design or validate the CPUs it hardly matters.

That may be the case today, but as far as I can tell, it is NVIDIA's intention to design their own CPU's at some point in the future. So 4+1 may be a design feature that NVIDIA wants to include in the future in their own custom CPU's too, but who knows for sure.

That's only one example. If you think that 4 A7s can't be loaded then you either think that 4 cores will only be loaded heavily or 4 cores can't be loaded at all.

The examples I listed above are some of the most common usage scenarios for mobile devices today. When watching a video or movie, when listening to audio with some headphones on while walking or jogging, when reading an ebook, when checking email, when talking on the phone, etc...for these basic tasks, it is very unlikely for the user to be using their handheld device for anything else other than the task at hand, so multi-core CPU's would not be very beneficial in these scenarios in my opinion.

Both viewpoints are against what nVidia has been marketing. According to them, having four cores improves power consumption because there are workloads that will complete faster with more cores at a lower clock.

What NVIDIA is saying only applies to multi-threaded scenarios. In a scenario where multiple cores will actually be used, then four cores should have lower power consumption than dual cores (all else equal), and dual cores should have lower power consumption than a single core (all else equal). In scenarios that are not heavily multi-threaded (such as some of the usage models I outlined above), there would be little to no benefit in power consumption (all else equal).

nVidia is saying that their power optimized A15 uses ~70% the power of a Tegra 3 A9 at 1.6GHz (40% less, 1 / 1.4), or maybe 60% the power (1 - 0.4) if I'm not reading that correctly. A 1.2GHz Tegra 3 Cortex-A9 core can probably use a similar amount or less than that, so there's zero doubt in my mind that a 1.2GHz Cortex-A7 on 28nm would use substantially less power.

I do realize that Cortex A7 is a very power efficient design, but how useful is it to have quad A7 cores when the heavily multi-threaded tasks are consistently performed on the quad A15 cores?

Tegra 4, by virtue of not being able to run both clusters simultaneously for some reason, almost certainly has a worse penalty in switching from one to the other than anything you'd get running in a proper big.LITTLE setup using HMP.

According to NVIDIA, with respect to the battery saver core in Tegra 3, the total switching time is less than 2 milliseconds of delay (ie. imperceptible delay to a human), but I haven't seen anything specific for Tegra 4. The reason that the four main CPU cores are not enabled at the same time as the battery saver core is to not incur penalties involved with synchronizing caches between cores running at different frequencies.

Not sure why you bring up performance, the point of HMP in these processors is to enable more efficient power consumption when you have threads that don't need the full load. Yes it's more complex, but it's not like nVidia has to invest solely into this technology, it's going to happen on Android regardless. If anything it'd be smarter for them to leverage something that's benefited by active research and development outside of their company.

Sure, one of the goals of big.LITTLE MP is to use the right processor for the right task all at the same time if possible, but this is easier said than done, and most big semiconductor companies cannot even agree on what is the "right" processor design in the first place! In theory it is actually a nice idea, but in practice it will likely be challenging to implement. Also, there is no way to get around the fact that the highest performance CPU cores will be relatively power hungry and will still be needed for the most CPU-intensive tasks.
 
Last edited by a moderator:
In Tegra 4, when switching to the battery saver core, the 2MB L2 cache used for the main cores is flushed and power gated, while the battery saver core uses it's own 512KB L2 cache to save power. In Tegra 3, when switching over to the battery saver core, the 1MB L2 cache used for the main cores is not flushed and is used for the battery saver core too, is that right? That may not be a trivial difference, but there should still be many similarities between hardware and software implementation of the battery saver core on Tegra 4 compared to Tegra 3.

That is correct. And the reason why that is is that Cortex-A9 modules didn't come with L2 cache, it had to be connected externally by the SoC vendor so muxing a core's interface to the L2 was feasible. On A15 it's back to tightly coupled L2.

I don't see any reason to think that the hardware implementation shares anything in common with Tegra 3 at all. Designing two separate A15 clusters with separate L2 caches sharing a coherency interconnect is standard procedure with ARM's IP.

That may be the case today, but as far as I can tell, it is NVIDIA's intention to design their own CPU's at some point in the future. So 4+1 may be a design feature that NVIDIA wants to include in the future in their own custom CPU's too, but who knows for sure.

While that could very well apply to future Denver-based SoCs that shouldn't have anything to do with a Cortex-A15 based SoC today.

nVidia is probably paying less in licensing for this solution and that's well and good but I'm really talking about technical advantages and disadvantages. nVidia would really like you to think this design has technical merit over the alternative.

The examples I listed above are some of the most common usage scenarios for mobile devices today. When watching a video or movie, when listening to audio with some headphones on while walking or jogging, when reading an ebook, when checking email, when talking on the phone, etc...for these basic tasks, it is very unlikely for the user to be using their handheld device for anything else other than the task at hand, so multi-core CPU's would not be very beneficial in these scenarios in my opinion.

One example of a web browsing does not capture all use scenarios of web pages. Web loading is often when you need peak perf the most. And even when the user is only "doing" one thing there will still be other background tasks occasionally popping up in the background.

What NVIDIA is saying only applies to multi-threaded scenarios. In a scenario where multiple cores will actually be used, then four cores should have lower power consumption than dual cores (all else equal), and dual cores should have lower power consumption than a single core (all else equal). In scenarios that are not heavily multi-threaded (such as some of the usage models I outlined above), there would be little to no benefit in power consumption (all else equal).

Four cores will only use less power than two cores if they're clocked at a (perhaps substantially) lower clock speed. Therefore nVidia believes that there is merit in running more cores at lower perf. Which is exactly what a 4+4 big.LITTLE arrangement offers.

Of course 4 A7 cores won't benefit in scenarios that aren't heavily multithreaded, that goes without saying. But nVidia believes in scenarios that benefit from heavy threading at well below peak clocks...

According to NVIDIA, with respect to the battery saver core in Tegra 3, the total switching time is less than 2 milliseconds of delay (ie. imperceptible delay to a human), but I haven't seen anything specific for Tegra 4. The reason that the four main CPU cores are not enabled at the same time as the battery saver core is to not incur penalties involved with synchronizing caches between cores running at different frequencies.

You got that totally backwards. Not being able to run them at the same time makes the transition take longer, because the big 2MB L2 cache has to be flushed to RAM, then loaded into the smaller 512KB L2 cache. If the two ran concurrently for a while they could transfer lines directly and on-demand. It doesn't matter if the faster cluster has to clock down/stall for the slower core to keep up because it's not active anymore (and could have entered a state where primarily only the L2 is powered)

Sure, one of the goals of big.LITTLE MP is to use the right processor for the right task all at the same time if possible, but this is easier said than done, and most big semiconductor companies cannot even agree on what is the "right" processor design in the first place! In theory it is actually a nice idea, but in practice it will likely be challenging to implement. Also, there is no way to get around the fact that the highest performance CPU cores will be relatively power hungry and will still be needed for the most CPU-intensive tasks.

What does choosing right scheduling have to do with choosing "right processor design"? You act like no work has been done on this and they need to start from zero. Go read the papers Nebuchadnezzar has linked. It's already closer to "good enough" than "nothing."

Yes, there's no way to get around that the higher performance CPUs will be needed.. obviously. I'm supporting big.LITTLE, not an A7 only SoC. That the higher performance cores need a lot more power is exactly why this system is worth anything..
 
Back
Top