Tegra 3 officially announced; in tablets by August, smartphones by Christmas

silent_guy · Sep 21, 2011

Ike Turner said:
http://www.anandtech.com/show/4840/kalel-has-five-cores-not-four-nvidia-reveals-the-companion-core

Ailuros,
with this information available, how do you think the A5 has more advanced power and clock gating? I'm not saying they aren't, but Apple is notoriously secretive about pretty much anything and I didn't come across anything that would confirm this.

Ailuros · Sep 21, 2011

silent_guy said:
Ailuros,
with this information available, how do you think the A5 has more advanced power and clock gating? I'm not saying they aren't, but Apple is notoriously secretive about pretty much anything and I didn't come across anything that would confirm this.

How do you get comparable power consumption/battery life between A4 and A5 while at the same time performance increasing in the latter by N%? I know it's just a theory, but if A5 isn't on a comparable level considering power-/clock gating fine; that shouldn't mean though that most if not all future SoCs won't move in the same direction.

Arun · Sep 21, 2011

metafor said:
It's a cheap alternative (from a design effort POV) than individual DVFS for each core. And considering the size of an A9, might even be better for area. It's also got the advantage of being better for idle power.

Remember T3 cannot use both the 1xLP and the 4xGP cores at the same time. It doesn't help at all for mid-throttle workloads, but it helps a lot for very-low-throttle ones. On the other hand, per-core DVFS helps for mid-throttle up to near-maximum throttle, but not very-low throttle. So it's actually orthogonal and complementary.

If you could run the LP and GP cores at the same time, the problem is that the LP core would need to either run at a different voltage (if so why not just do per-core DVFS on the GP cores, at least in a 3+1 configuration?) or run at a ridiculously low frequency (since the GP cores would run fairly low in a mid-throttle and even at max-throttle LP Nominal voltage is higher than G's). You could still achieve slightly higher performance but that's really negligible and not worth the effort.

So it seems to me that the optimal implementation (taking cost into consideration) is 1xLP sharing a voltage domain with 1xG (never active at the same time for complexity reasons) plus 3xG sharing another voltage domain (but with asynchronous clocks). The 1xG is the first one to reach maximum clock speed and the 3xG cannot reach the maximum clock speed simultaneously (ala Intel Turbo Boost). The whole thing is monitored ala AMD PowerTune to stay within the TDP and minimise heat hotspots.

With the A15 that extra 5th core is quite expensive and it's also overkill for background tasks at idle so maybe an A9-class core would be better (or a higher clocked A5-class core - even with higher leakage transistors it should be better than an A15 since it's so small). It will be interesting to see what exactly ST-Ericsson is doing on the A9600. There's a bigger argument for simultaneous operation of all the cores there but I don't think they'll do it because of the massive software complexity involved (including OS-level).

A small summary:
- Dual-core: Higher perf, higher TDP, higher perf/W for mid-throttle workloads with 2+ fairly symmetric threads.
- Quad-core: Higher perf, higher TDP, higher perf/W for mid-throttle workloads with 4+ fairly symmetric threads.
- Per-core Clock: Higher perf/W for mid-throttle up to near-maximum workloads with non-symmetric threads (i.e. most of them to some extent). Some indication T30 might support it as well BTW.
- Per-core DVFS: Same as above, but to a *much* greater extent. Also a bigger incremental cost increase but still not that big (mostly harder to implement and A9 doesn't support it iirc). This is unique to Qualcomm for now.
- Same LP Core: Higher perf/W at very-low-throttle. Indirectly allows you to get away with slightly higher leakage transistors on the other cores to increase performance and/or reduce dynamic power at the same frequency. First introduced by Marvell in the Armada 628.
- Different Lower Power Core: Same benefits but with lower cost and/or higher idle power savings, especially when the main cores are as big as the Cortex-A15. Might not use as low performance/leakage transistors so that the performance gap between the cores isn't too huge.
- Simultaneous Different Cores: Here be dragons!

I.S.T. · Sep 21, 2011

"Here be dragons!"

Very informative post, Arun.

silent_guy · Sep 21, 2011

Ailuros said:
How do you get comparable power consumption/battery life between A4 and A5 while at the same time performance increasing in the latter by N%?

Because A5 has a bunch of power improvements compared to A4?
If we are to believe Nvidia, T3 will use significantly less power than T2 for the same workloads. Or at least for the cases they presented in their white paper. And that's without the benefit of a process shrink.

Given enough time, you can engineer improvements everywhere. There is no question A5 has advances power features. But I don't see any data to claim that T3 has less of them. Do you?

There's no such thing as advanced clock gating, there's just the extent to which you do it, with diminishing returns as you go. By now, both Apple and Nvidia and everybody else should already be scraping the bottom of the barrel on that front.

As for power gating: there are no that many options either. Very high !/$ for on/off power islands with separate voltage rail. Less so for local power gating with virtual grounds. But since this kind of stuff has ready been publicly discussed for years at conferences, everybody must have been doing all of that too.

Ailuros · Sep 21, 2011

silent_guy said:
Because A5 has a bunch of power improvements compared to A4?
If we are to believe Nvidia, T3 will use significantly less power than T2 for the same workloads. Or at least for the cases they presented in their white paper. And that's without the benefit of a process shrink.

Given enough time, you can engineer improvements everywhere. There is no question A5 has advances power features. But I don't see any data to claim that T3 has less of them. Do you?

No I didn't mean that T3 has less; if it came across that way it's a str8 no. I don't need NV to publicly state that T3 at the same workload will consume less power than T2; as I layman I can understand as much myself since on T2 unfortunately the two CPU cores cannot be clocked independently. If one clocks at say 600MHz for workload N, the 2nd core is unfortunately bound to clock at 600MHz too. T3 is in that regard one "minor" step ahead with far more positive results than the change itself might imply to someone.

There's no such thing as advanced clock gating, there's just the extent to which you do it, with diminishing returns as you go. By now, both Apple and Nvidia and everybody else should already be scraping the bottom of the barrel on that front.

As for power gating: there are no that many options either. Very high !/$ for on/off power islands with separate voltage rail. Less so for local power gating with virtual grounds. But since this kind of stuff has ready been publicly discussed for years at conferences, everybody must have been doing all of that too.

It's an ongoing evolution of already employed techniques. From what I can tell Qualcomm should also employ relevant power-/clock gating techniques for its dual core Snapdragons.

***edit: can't reach your PM box; you need to do some housekeeping there I guess.

Pressure · Sep 21, 2011

silent_guy said:
Because A5 has a bunch of power improvements compared to A4?
If we are to believe Nvidia, T3 will use significantly less power than T2 for the same workloads. Or at least for the cases they presented in their white paper. And that's without the benefit of a process shrink.

Given enough time, you can engineer improvements everywhere. There is no question A5 has advances power features. But I don't see any data to claim that T3 has less of them. Do you?

There's no such thing as advanced clock gating, there's just the extent to which you do it, with diminishing returns as you go. By now, both Apple and Nvidia and everybody else should already be scraping the bottom of the barrel on that front.

As for power gating: there are no that many options either. Very high !/$ for on/off power islands with separate voltage rail. Less so for local power gating with virtual grounds. But since this kind of stuff has ready been publicly discussed for years at conferences, everybody must have been doing all of that too.

I would guess P.A. Semi and Intrinsity has a bit more experience than nVIDIA with regards to ARM designs?

metafor · Sep 21, 2011

Arun said:
Remember T3 cannot use both the 1xLP and the 4xGP cores at the same time. It doesn't help at all for mid-throttle workloads, but it helps a lot for very-low-throttle ones. On the other hand, per-core DVFS helps for mid-throttle up to near-maximum throttle, but not very-low throttle. So it's actually orthogonal and complementary.

I agree with that. But a world of limited die area and limited engineering time, things that are complementary can be mutually exclusive

That being said, a 500MHz A9 is no slouch. I think it's sufficient for most non-active tasks (including audio streaming, etc.). But also keep in mind that even a DVFS core will leak more and likely use more active power when on a G process as compared to a core on an LP process.

There's also the cost in terms of L2 access latency when you run separate cores asynchronously.

If you could run the LP and GP cores at the same time, the problem is that the LP core would need to either run at a different voltage (if so why not just do per-core DVFS on the GP cores, at least in a 3+1 configuration?) or run at a ridiculously low frequency (since the GP cores would run fairly low in a mid-throttle and even at max-throttle LP Nominal voltage is higher than G's). You could still achieve slightly higher performance but that's really negligible and not worth the effort.

If you're swapping between the two sets of cores, you wouldn't need a separate regulator. You just need one voltage supply (well, two, one for the retention voltage if such is being used). The regulator -- and subsequent validation of functionality -- is the costly part of DVFS for separate power islands.

So it seems to me that the optimal implementation (taking cost into consideration) is 1xLP sharing a voltage domain with 1xG (never active at the same time for complexity reasons) plus 3xG sharing another voltage domain (but with asynchronous clocks). The 1xG is the first one to reach maximum clock speed and the 3xG cannot reach the maximum clock speed simultaneously (ala Intel Turbo Boost). The whole thing is monitored ala AMD PowerTune to stay within the TDP and minimise heat hotspots.

Individual cores can be power-gated and clock-gated, so it functions almost the same (albeit no fine-grain control). Course-grain clock throttling would also be cheap and allow quite a bit of power savings.

With the A15 that extra 5th core is quite expensive and it's also overkill for background tasks at idle so maybe an A9-class core would be better (or a higher clocked A5-class core - even with higher leakage transistors it should be better than an A15 since it's so small). It will be interesting to see what exactly ST-Ericsson is doing on the A9600. There's a bigger argument for simultaneous operation of all the cores there but I don't think they'll do it because of the massive software complexity involved (including OS-level).

Well, again, it comes down to design time. Having an identical core means that you can simply duplicate one of your current cores and mux it in. At that point, it looks to both the rest of the system as well as software as a core going to sleep and waking up again.

I agree that ideally, it'd be best to have an A9-class (or hell, A5 class) core sitting around for background tasks but that would require more changes than just hardware.

A small summary:
- Dual-core: Higher perf, higher TDP, higher perf/W for mid-throttle workloads with 2+ fairly symmetric threads.
- Quad-core: Higher perf, higher TDP, higher perf/W for mid-throttle workloads with 4+ fairly symmetric threads.
- Per-core Clock: Higher perf/W for mid-throttle up to near-maximum workloads with non-symmetric threads (i.e. most of them to some extent). Some indication T30 might support it as well BTW.
- Per-core DVFS: Same as above, but to a *much* greater extent. Also a bigger incremental cost increase but still not that big (mostly harder to implement and A9 doesn't support it iirc). This is unique to Qualcomm for now.
- Same LP Core: Higher perf/W at very-low-throttle. Indirectly allows you to get away with slightly higher leakage transistors on the other cores to increase performance and/or reduce dynamic power at the same frequency. First introduced by Marvell in the Armada 628.
- Different Lower Power Core: Same benefits but with lower cost and/or higher idle power savings, especially when the main cores are as big as the Cortex-A15. Might not use as low performance/leakage transistors so that the performance gap between the cores isn't too huge.
- Simultaneous Different Cores: Here be dragons!

Heheh. Alas, the last one is likely the best solution, but no one's working on it to my knowledge :/

silent_guy · Sep 21, 2011

Pressure said:
I would guess P.A. Semi and Intrinsity has a bit more experience than nVIDIA with regards to ARM designs?

PA semi was founded in 2003 and exclusively PowerPC. They only became ARM based long after Nvidia started doing it.

Intrinsity's ARM experience dates from around 2006. Roughly similar time frame as Nvidia.

I take issue with the term experience by itself. It's a hollow slogan that's pretty much meaningless, especially if you're talking 5+ years. If the years of GPU experience were a factor, no SOC maker would have a chance against Nvidia or Qualcomm.

There are not many mysteries in chip development, everybody looks at each other and copies what works or works around it in a different way.

E.g. the approach of 4+1 cores instead of designing a fast LP core is an acceptable trade-off to get both fast time to market, high clock speeds and low power at small additional area cost without the development cost of a custom core like Intrinsity.

Arun · Sep 21, 2011

metafor said:
I agree with that. But a world of limited die area and limited engineering time, things that are complementary can be mutually exclusive

Absolutely. Per-core DVFS isn't easy and I suppose another small complication is that most power management chips on the merchant market don't have that many DC/DCs. Qualcomm has an advantage there in that they make their own PMICs. I suppose they might catch up or it might become more common to have chip-specific PMICs by third parties (ala Maxim or Dialog). I was struck that Maxim's custom PMIC for Samsung's Exynos has 7 step-down DC/DCs whereas the highest-end merchant ones have only 4.

But also keep in mind that even a DVFS core will leak more and likely use more active power when on a G process as compared to a core on an LP process.
[...]
If you're swapping between the two sets of cores, you wouldn't need a separate regulator.

Right, that's what I meant, not sure it was clear enough

There's also the cost in terms of L2 access latency when you run separate cores asynchronously.

Very good point. It seems to me the cache hierarchy on handhelds is a bit archaic though and sharing the same L2 cache between 4+ cores might not be optimal. If you have a fairly small but very fast L2 per-core (e.g. 256KB on Nehalem) and a larger global L3, then that extra latency isn't the end of the world (although it might still not be worth the trouble and maybe there are other problems with that approach when very power-conscious). Either way I agree asynchronous clocks without per-core DVFS seems like an unattractive trade-off on today's architectures.

Also am I just being stupid, or can't you get most of the benefits of per-core clocks without per-core DVFS by doing aggressive clock gating for the full core? I just realised that the dynamic power of a 2GHz core being fully clock gated 50% of the time should be nearly identical to that of a 1GHz core running continuously (if they are both stuck at the exact same voltage because another core needs to run at 2GHz anyway).

metafor · Sep 21, 2011

Arun said:
Very good point. It seems to me the cache hierarchy on handhelds is a bit archaic though and sharing the same L2 cache between 4+ cores might not be optimal. If you have a fairly small but very fast L2 per-core (e.g. 256KB on Nehalem) and a larger global L3, then that extra latency isn't the end of the world (although it might still not be worth the trouble and maybe there are other problems with that approach when very power-conscious). Either way I agree asynchronous clocks without per-core DVFS seems like an unattractive trade-off on today's architectures.

The limiting factor is die area; far more so than desktop/laptop chips. 256KB of fast L2 per core would be great, but there'd be no room for a global L3. Keep in mind the GPU's GMEM has to be on-die as well along with a plethora of other things.

That isn't to say, however, that there aren't multiple separate levels of cache to even out the latency issue. A certain asynchronous design has an L0....

Also am I just being stupid, or can't you get most of the benefits of per-core clocks without per-core DVFS by doing aggressive clock gating for the full core? I just realised that the dynamic power of a 2GHz core being fully clock gated 50% of the time should be nearly identical to that of a 1GHz core running continuously (if they are both stuck at the exact same voltage because another core needs to run at 2GHz anyway).

It is. The problem is knowing when to clock gate and also knowing when to wake the core. You can't just stop a CPU if there are no more incoming instructions; it has to finish all the ones it has -- else another CPU or device may stall continuously waiting for an event or write. This is a particular problem for handling semaphores, for instance.

I'm a bigger fan of course-grain clock-scaling in which a custom clock-divider or pulse-stealing circuit is used to cut the clock by powers of 2. This way, things are still synchronous but the CPU is effectively running twice as slow. It would still require some type of hand-shake between the L2 and CPU but there's no asynchronous clock-crossing which is typically the bulk of the mismatch-clock latency.

Arun · Sep 21, 2011

metafor said:
The limiting factor is die area; far more so than desktop/laptop chips. 256KB of fast L2 per core would be great, but there'd be no room for a global L3. Keep in mind the GPU's GMEM has to be on-die as well along with a plethora of other things.

GMEM? Do you mean the large tiling memory for Adreno? Obviously that's rather unique to Qualcomm - no other modern GPU architecture has anything like that. GPU die sizes are certainly increasing for a variety of reasons though (performance, features, minimising bandwidth, etc.)

Also 256KB of fast L2 per core (or something along those lines) will be perfectly fine on 20nm. OMAP5 already has 2MB of cache for a dual-core A15, so if you had 256KB L2/core and 2MB L3, you'd still only have 3MB of cache. That doesn't mean it will necessarily happen, but I think die sizes will allow for it sooner rather than later. If it's not Qualcomm, maybe Marvell (who also has ARM server ambitions) or NVIDIA (who makes their own L2 cache controllers) will do it. Or maybe not!

That isn't to say, however, that there aren't multiple separate levels of cache to even out the latency issue. A certain asynchronous design has an L0....

Ohhh, fun.

It is. The problem is knowing when to clock gate and also knowing when to wake the core. You can't just stop a CPU if there are no more incoming instructions; it has to finish all the ones it has -- else another CPU or device may stall continuously waiting for an event or write. This is a particular problem for handling semaphores, for instance.

Right, same problem as for a power gate, only slightly easier (faster wake up) but with smaller benefits - whether it's worth doing obviously depends on just how much faster and how much smaller.

I'm a bigger fan of course-grain clock-scaling in which a custom clock-divider or pulse-stealing circuit is used to cut the clock by powers of 2. This way, things are still synchronous but the CPU is effectively running twice as slow. It would still require some type of hand-shake between the L2 and CPU but there's no asynchronous clock-crossing which is typically the bulk of the mismatch-clock latency.

Interesting approach, it still adds some complexity at the OS level, but that will presumably be solved one day. I kinda wish the OS would let the chips handle more of the thread scheduling process - it'd certainly encourage this kind of innovation more.

---

BTW, did anyone notice the text on Figure 8 of Page 13 on NV's second vSMP whitepaper? http://www.nvidia.com/content/PDF/tegra_white_papers/tegra-whitepaper-0911b.pdf

I could be reading way too much into this, but it seems to me their quad-core maximum clock rate on the smartphone SKU will be 1GHz. I remember Rayfield implied there would be both a 1GHz and a 1.5GHz SKU back around February. That seemed like a large performance gap to me and 1GHz seemed too low given the competition. But if that's the maximum quad-core clocks for phones and tablets rather than the peak single-core clocks (presumably 1.5GHz+ even on smartphones) then it all makes a lot more sense. Or maybe I'm reading too much into it as I said...

metafor · Sep 21, 2011

Arun said:
GMEM? Do you mean the large tiling memory for Adreno? Obviously that's rather unique to Qualcomm - no other modern GPU architecture has anything like that. GPU die sizes are certainly increasing for a variety of reasons though (performance, features, minimising bandwidth, etc.)

Yes. My point being that die area is far more premium than on 4MB L3+ desktop chips. And if you had the die area, spending it towards extra GPU pipelines may be more helpful....

Also 256KB of fast L2 per core (or something along those lines) will be perfectly fine on 20nm. OMAP5 already has 2MB of cache for a dual-core A15, so if you had 256KB L2/core and 2MB L3, you'd still only have 3MB of cache. That doesn't mean it will necessarily happen, but I think die sizes will allow for it sooner rather than later. If it's not Qualcomm, maybe Marvell (who also has ARM server ambitions) or NVIDIA (who makes their own L2 cache controllers) will do it. Or maybe not!

Could be. Cache sizes and hierarchy will be whatever the feature size allows and certainly doesn't remain constant. I was just commenting that at 45/40/28, it really didn't make sense to have a large 256KB L2 and a 2MB L3.

Right, same problem as for a power gate, only slightly easier (faster wake up) but with smaller benefits - whether it's worth doing obviously depends on just how much faster and how much smaller.

Interesting approach, it still adds some complexity at the OS level, but that will presumably be solved one day. I kinda wish the OS would let the chips handle more of the thread scheduling process - it'd certainly encourage this kind of innovation more.

What complexity would be added at the OS level? This could be almost entirely transparent to software. Of course, the OS could architecturally set the processor speed as well and the hardware divider can simply pick the "nearest" frequency it can do.

BTW, did anyone notice the text on Figure 8 of Page 13 on NV's second vSMP whitepaper? http://www.nvidia.com/content/PDF/tegra_white_papers/tegra-whitepaper-0911b.pdf

I could be reading way too much into this, but it seems to me their quad-core maximum clock rate on the smartphone SKU will be 1GHz. I remember Rayfield implied there would be both a 1GHz and a 1.5GHz SKU back around February. That seemed like a large performance gap to me and 1GHz seemed too low given the competition. But if that's the maximum quad-core clocks for phones and tablets rather than the peak single-core clocks (presumably 1.5GHz+ even on smartphones) then it all makes a lot more sense. Or maybe I'm reading too much into it as I said...

I wouldn't read too much into it. Likely they chose the 1GHz point because that's the best datapoint in terms of perf/W they could get without looking ridiculous.

Novum · Sep 22, 2011

metafor said:
It is. The problem is knowing when to clock gate and also knowing when to wake the core. You can't just stop a CPU if there are no more incoming instructions; it has to finish all the ones it has -- else another CPU or device may stall continuously waiting for an event or write. This is a particular problem for handling semaphores, for instance.

You can preempt a semaphore. It is not correct, that it will pin a thread to a specific CPU core.

metafor · Sep 22, 2011

Novum said:
You can preempt a semaphore. It is not correct, that it will pin a thread to a specific CPU core.

What happens if it's your interrupt handler that gets halted mid-write? Or your kernel doesn't have pre-emption enabled?

Deleted member 13524 · Sep 27, 2011

I'm wondering, why not using a Cortex A5 for the fifth, low-power core? Isn't this one even more power efficient?

Would it somehow hinder the transition between the low-power core and the high-power ones?

hoho · Sep 27, 2011

I would think biggest problem would be with A5 not supporting some of the A15 functionality so some things couldn't be ran on that core.

Deleted member 13524 · Sep 27, 2011

hoho said:
I would think biggest problem would be with A5 not supporting some of the A15 functionality so some things couldn't be ran on that core.

Did you mean some of the A9 functionality?
Isn't the A5 using the exact same instruction set as the A9? Isn't it equally capable of having NEON and the FPU? I
I thought the only "practical" difference was in the higher IPC for the A9.

Laurent06 · Sep 27, 2011

I can see at least two reasons to use the same core:
- nVidia now are probably very familiar with the A9 design and its implementation
- having the exact same core reduces the cost of OS porting (I'm not sure Linux and/or Android transparently support heterogeneous cores).

metafor · Sep 27, 2011

ToTTenTranz said:
Did you mean some of the A9 functionality?
Isn't the A5 using the exact same instruction set as the A9? Isn't it equally capable of having NEON and the FPU? I
I thought the only "practical" difference was in the higher IPC for the A9.

Ideally, the companion core should be identical to the OS to, say, core0 such that a switch in hardware is transparent. This includes things like how it handles ARM debug mode, its non-architectural registers and its TLB eviction behavior.

Also, the A5 has some added instructions and changes that didn't make it into A9; fused MAC and saturating VMAX/VMIN, for instance.

Tegra 3 officially announced; in tablets by August, smartphones by Christmas

silent_guy

Ailuros

Epsilon plus three

Arun

Unknown.

I.S.T.

silent_guy

Ailuros

Epsilon plus three

Pressure

metafor

silent_guy

Arun

Unknown.

metafor

Arun

Unknown.

metafor

Novum

metafor

Deleted member 13524

Guest

hoho

Deleted member 13524

Guest

Laurent06

metafor

Similar threads