NVIDIA Tegra Architecture

What is HMP? Running both A57 and A53 cores simultaneously? If so, I doubt ARM's claims on any kind of benefit to HMP. If not, enlighten me. =)
Yes. Benefit compared to what? A non-heterogeneous CPU arch? I don't know about that, I can't really comment on Apple's cores and Krait is too outdated to consider as a serious argument anymore.

HMP in the context of running the big and little cores together versus just running one cluster at a time is a no-brainer though. The big cores are so much more power hungry than the little cores that any kind of situation where a load is high enough and requires to be computed on the big cores achieves a huge power benefit in HMP because other threads will not be forced to reside on that cluster too. I didn't actually do measurements on real-world workloads to have things lock in the big cluster because it seemed pointless to me but I did general efficiency and leakage. Gaming is a huge benefactor here as we might have only a single constant high load thread with several smaller ones, the power difference can be >1W here. The Linux scheduler is at the moment too stupid (especially if Nvidia doesn't use a modified one) to not have things spill over to other cores. Nvidia has some other software tricks for this in >Tegra4 but they're all not optimal and things generally are trying to move away from those hacks (CPUQuiet framework). HMP/GTS has it's fair share of problems but it's mostly software related and there's solutions to it.
 
So nVidia says that Tegra X1 is using Cortex-A57 + A53 instead of Denver because this was faster and simpler to implement in 20nm. But Anandtech says that they have a completely custom physical implementation. In that case, how would "hardening" A57 and A53 - two CPUs they've never used before, especially the latter - be faster or simpler than using Denver, which they already have a custom implementation of and would require a more straightforward shrink?

There are already CPU cores with very broad voltage ranges to support sleep modes with low power retention. Denver itself had such a mode.

I'm wandering far afield on this, but what if the later data point has something to do with the answer to the original question?
Denver's dynamic range and power states seem to be more ambitious. Even if the vanilla cores and Denver were using custom physical design, one architecture is promising wackier things than the other, and getting the physical characterization right is an important part of this.
A risk-management measure like having a side-project for an optimized version ARM's more conservative designs (necessitated by the goal of being more broadly used) might be an incremental increase in development expenditure that could be handled somewhat in parallel.

The physical design/verification delay might be an echo of how Jaguar couldn't manage its promised turbo functions until Puma, although that also involved some unspecified amount of disruption to the development effort and juggling foundries.
 
HMP in the context of running the big and little cores together versus just running one cluster at a time is a no-brainer though. The big cores are so much more power hungry than the little cores that any kind of situation where a load is high enough and requires to be computed on the big cores achieves a huge power benefit in HMP because other threads will not be forced to reside on that cluster too. I didn't actually do measurements on real-world workloads to have things lock in the big cluster because it seemed pointless to me but I did general efficiency and leakage. Gaming is a huge benefactor here as we might have only a single constant high load thread with several smaller ones, the power difference can be >1W here. The Linux scheduler is at the moment too stupid (especially if Nvidia doesn't use a modified one) to not have things spill over to other cores. Nvidia has some other software tricks for this in >Tegra4 but they're all not optimal and things generally are trying to move away from those hacks (CPUQuiet framework). HMP/GTS has it's fair share of problems but it's mostly software related and there's solutions to it.

What kinds of general efficiency and leakage measurements did you perform?

I'm skeptical that using 4 A57 and 4 A53 simultaneously provides benefit over using just the A57 or just the A53 cores. If the big cluster is turned on, slipping some light workloads onto the big cores won't change much. The performance impact would be small, and you've already paid the power cost of turning the big cores turned on.

I'd love to hear how you measured the benefit of running the big and little cores simultaneously.
 
Let me twist that one: if there aren't any benefits why not stick with a 4+1 config in the first place?
Two reasons from the top of my head:
1. ARM did the work to make the A57 and A53, so NVIDIA doesn't have to spend any time or money making a leakage optimized A57.
2. Marketing droids prefer octacore.
 
The performance impact would be small, and you've already paid the power cost of turning the big cores turned on.

There's a big power difference for every individual big core that is enabled, especially at higher clock speeds. Each core can use well over 1W on its own.
 
User space means it's an ordinary program, as opposed to needing kernel access. Most (but not all) drivers and file systems work in kernel space or are part of the kernel itself.
 
What kinds of general efficiency and leakage measurements did you perform?

I'm skeptical that using 4 A57 and 4 A53 simultaneously provides benefit over using just the A57 or just the A53 cores. If the big cluster is turned on, slipping some light workloads onto the big cores won't change much. The performance impact would be small, and you've already paid the power cost of turning the big cores turned on.

I'd love to hear how you measured the benefit of running the big and little cores simultaneously.
I did power measurements on all core and frequency combinations and some use-cases for big/little/big.little scenarios. Anyway I think you have a bad understanding of the power management because you say you pay the cost of turning big cores on, but you forget that they have fine-grained power-gating. "Turning on" means having the CPU visible to the scheduler, not it being physically on. The OS will always try to spread load horizontally over the CPUs of a cluster, and spreading medium to low tasks over the big cores is a stupid thing to do. As I said the power penalty for anything running on the big cores which could have otherwise been scheduled on the little cores can be 5-10 times higher.
What exactly does this mean?
System coherency for all 8 cores, the kernel can schedule processes on all of them, processes being any application that runs on top of the OS.
 
I did power measurements on all core and frequency combinations and some use-cases for big/little/big.little scenarios.

Sorry, but this is super vague. What workloads were you running for these big/little/big.little scenarios? As I mentioned earlier, the big savings comes from choosing to use either the big cores or the little cores, not from using them both together.

Prove me wrong: I'd love to hear about a real world scenario where running the big cores simultaneously with the little cores mattered! It would make a great article to provide details on how ARM's HMP technology makes a difference in some important scenario - maybe you should consider writing one.

Until I see such evidence, the discussion contrasting cluster scheduling (either A57 or A53) versus global scheduling (A57 and A53) is merely marketing.
 
There's a big power difference for every individual big core that is enabled, especially at higher clock speeds. Each core can use well over 1W on its own.
Sure. So if you have a workload with one heavy thread and one light thread, how do you save {power, time} by turning on a little core? You've already burned the power to turn on the big core, so the little core is roundoff error in terms of power, right? Turning on the little core is also roundoff error in terms of performance: you could just keep the big core on and multiplex the heavy thread and the light thread multiplexed on the big core without any performance penalty (otherwise, we have two heavy threads, right?)

Doesn't make much sense to me.
 
Sorry, but this is super vague. What workloads were you running for these big/little/big.little scenarios? As I mentioned earlier, the big savings comes from choosing to use either the big cores or the little cores, not from using them both together.

Prove me wrong: I'd love to hear about a real world scenario where running the big cores simultaneously with the little cores mattered! It would make a great article to provide details on how ARM's HMP technology makes a difference in some important scenario - maybe you should consider writing one.

Until I see such evidence, the discussion contrasting cluster scheduling (either A57 or A53) versus global scheduling (A57 and A53) is merely marketing.

I don't see why you have a hard time imagining a scenario where, for example, 1 A57 + 3 A53 cores on uses less power than 4 A57 cores on. Yes, you pay an extra cost for having the cluster powered at all (for the L2 cache and what have you), but that's still not enough to make the dual-cluster scenario useless.
 
Sure. So if you have a workload with one heavy thread and one light thread, how do you save {power, time} by turning on a little core? You've already burned the power to turn on the big core, so the little core is roundoff error in terms of power, right? Turning on the little core is also roundoff error in terms of performance: you could just keep the big core on and multiplex the heavy thread and the light thread multiplexed on the big core without any performance penalty (otherwise, we have two heavy threads, right?)

Doesn't make much sense to me.

Okay, now it's getting clearer. The big problem you have is that you're overestimating the difference in power between the big and little cores. There are plenty of scenarios where 1 big + 1 little core offers far more overall compute than just shoving everything on the 1 big core - the difference is far from "roundoff error". And if you look at some CPU load distributions settled on by real workloads you'll see that you very often do end up using some cores at a significant fraction of the peak speed.

And if you really wanted to make that performance tradeoff, you could end up more efficient with the big + little scenario where the big core is clocked lower. Because you pay a roughly cubic cost for power as frequency increases, but worse yet voltage scaling on nVidia's SoCs isn't very fine grained, so you just pay a lot for the top range of MHz period.
 
I don't see why you have a hard time imagining a scenario where, for example, 1 A57 + 3 A53 cores on uses less power than 4 A57 cores on. Yes, you pay an extra cost for having the cluster powered at all (for the L2 cache and what have you), but that's still not enough to make the dual-cluster scenario useless.

How much performance do you get by adding those 3 A53 cores? Not much. Probably better to just keep the 1 A57 core on.

Again: please give me a concrete, measurable scenario.
 
How much performance do you get by adding those 3 A53 cores? Not much. Probably better to just keep the 1 A57 core on.

Again: please give me a concrete, measurable scenario.

You can't find some benchmarks of A53 vs A57 out there already? For some tasks those three A53s will be about the speed as an entire separate A57. Are you okay just keeping the A57 on and letting performance halve?
 
My opinion is things are already complicated enough as is. No idea if HMP is only for benchmark whoring or if it can be useful. Then, 4+4 where it's only the one or the other is established. Perhaps it's good for OSes and application developers to not have to care about yet many other scheduling schemes.
 
You can't find some benchmarks of A53 vs A57 out there already? For some tasks those three A53s will be about the speed as an entire separate A57. Are you okay just keeping the A57 on and letting performance halve?

First, I think there are very few important tasks with just these characteristics (1 critical heavy thread, 3 critical light threads). The fact that no one has yet managed to name a compelling use case for this feature supports my argument. It's sad, really, to be discussing this so hypothetically.

Second, even if there were an important task with these peculiar characteristics, what would be the savings versus processing this task with 2 A57 cores? Let's say an A57 burns 1 W for 3X the performance of an A53 at 1/6 W (making these numbers up for the sake of argument). So, the HMP situation uses 1.5 W, while using 2 A57 cores burns 2 W, a savings of 25%. Not super earth shattering. As I said, once you turn on the big core, the little ones are basically roundoff error. It's rather like Amdahl's law.
Now apply the same reasoning over time (what fraction of the time you use a mobile device has this strange task running), and the benefit diminishes further.
 
Asynchronous loads are not uncommon at all. For example, a game where one thread performs graphics related tasks, another physics, another AI, another audio, etc. These aren't going to all take the same amount of time per frame to complete, but they'll still be roughly independent. And like Nebuchadnezzar says, scheduling tends to favor horizontal, and to an extent this is a better strategy because power consumption scales super-linearly with frequency, so up to a point you'd prefer to have tasks on another core than clock one higher. And then there are tasks where one thread will simply demand as much CPU power as possible, with the others hanging off of some fraction of it. If you really want a specific example, I see things like this in my DS emulator, with the main thread doing CPU + 2D + geometry + 1 chunk of 3D, other threads doing other chunks of 3D, another thread updating surface textures, etc. And these examples are ignoring actual multitasking.

If asynchronous weren't useful Qualcomm wouldn't have stuck with asynchronous DVFS since their first dual Scorpion cores until their last quad Kraits. If it's useful there it's going to be useful with A57 + A53, albeit at a lower capacity. And it's not like this is totally theoretical, this has been in development in Linux for ages now and multiple presentations have been made with hard data. Like this one:

http://www.linuxplumbersconf.org/20...12-lpc-scheduler-task-placement-rasmussen.pdf

And this one:

http://www.slideshare.net/linaroorg/lca14-104-gtsasolutiontoarmsbiglittletechnology

There have been others still. Don't you think they would have realized by now if this were completely pointless and just marketing bluster?
 
As I mentioned earlier, the big savings comes from choosing to use either the big cores or the little cores, not from using them both together.
And what exactly in your mind is the difference in power consumption between having all threads switch cpu clusters and having individual threads switch cpus? The savings on the per-thread level don't vanish anywhere.
First, I think there are very few important tasks with just these characteristics (1 critical heavy thread, 3 critical light threads). The fact that no one has yet managed to name a compelling use case for this feature supports my argument. It's sad, really, to be discussing this so hypothetically.
That's just absurd. Just install an CPU monitor overlay on an Android device and go do anything with it. There you have your use case.
 
And what exactly in your mind is the difference in power consumption between having all threads switch cpu clusters and having individual threads switch cpus? The savings on the per-thread level don't vanish anywhere.
That's just absurd. Just install an CPU monitor overlay on an Android device and go do anything with it. There you have your use case.
The savings are still there, unfortunately just a very small win (at the cost of a lot of hardware and software complexity). I explained this in my other posts: once you turn on the big core, savings from using a small core are roundoff error.

Why would a CPU monitor show anything about the benefits of HMP? All it shows is that there are concurrent tasks, it says nothing about the power or performance savings from various scheduling choices.

My assertion is that HMP at best offers marginal power and performance savings. So far I have not seen data to the contrary.
 
Asynchronous loads are not uncommon at all. For example, a game where one thread performs graphics related tasks, another physics, another AI, another audio, etc. These aren't going to all take the same amount of time per frame to complete, but they'll still be roughly independent. And like Nebuchadnezzar says, scheduling tends to favor horizontal, and to an extent this is a better strategy because power consumption scales super-linearly with frequency, so up to a point you'd prefer to have tasks on another core than clock one higher. And then there are tasks where one thread will simply demand as much CPU power as possible, with the others hanging off of some fraction of it. If you really want a specific example, I see things like this in my DS emulator, with the main thread doing CPU + 2D + geometry + 1 chunk of 3D, other threads doing other chunks of 3D, another thread updating surface textures, etc. And these examples are ignoring actual multitasking.

If asynchronous weren't useful Qualcomm wouldn't have stuck with asynchronous DVFS since their first dual Scorpion cores until their last quad Kraits. If it's useful there it's going to be useful with A57 + A53, albeit at a lower capacity. And it's not like this is totally theoretical, this has been in development in Linux for ages now and multiple presentations have been made with hard data. Like this one:

http://www.linuxplumbersconf.org/20...12-lpc-scheduler-task-placement-rasmussen.pdf

And this one:

http://www.slideshare.net/linaroorg/lca14-104-gtsasolutiontoarmsbiglittletechnology

There have been others still. Don't you think they would have realized by now if this were completely pointless and just marketing bluster?

I've been around the industry long enough to know that marketing bluster goes a surprisingly long way. The first set of slides you posted were pretty unconvincing. The second set gives me a 502 server error for some reason. The scenarios you outline with games seem like they'd run fine on A57 cores.

Qualcomm doesn't have a great reputation for CPU design. Arguments about their DVFS are unconvincing.

Apple CPUs, on the other hand, seem to constantly surprise - and AFAIK they don't use big.LITTLE or asynchronous DVFS.
 
Back
Top