NVIDIA Tegra Architecture

The *huge* question is whether 825MHz is actually on the power saver core or not. Everyone is assuming that it is but I'm not sure that's the case.

Based on Anandtech's claim that the battery saver fifth CPU core operates up to 700-800MHz, I do think that the 825MHz is referring to one of the main CPU cores.

It's a little known fact that Tegra 3's high speed cores are actually slightly *lower power* at 500MHz than the battery saver core. The only reason why the battery saver core can clock that high is so that it can handle short bursts and more importantly not be too slow for the 2ms it takes to wake up the other cores. And it might make sense to have the same minimum frequency on the high speed cores as the maximum frequency on the battery saver core.

That is interesting! Do you know at what frequency the battery saver core in Tegra 3 actually results in lower power than one of the main CPU cores?

There are some differences in implementation of the battery saver fifth CPU core in Tegra 3 vs. Tegra 4. According to Anandtech: "The fifth/companion core is also a Cortex A15, but synthesized to run at lower frequencies/voltages/power. This isn't the same G in and island of LP process that was Tegra 2/3".

I still wish they moved to a 4xA15+(4+1)xA7 architecture in Tegra 5 but we'll see.

That would certainly give potentially more flexibility and potentially lower power consumption, but there may be some tradeoffs involved too. I guess we'll find out within the next year.
 
Exophase believes (quite rightly I suspect) that SPECInt is basically a best case for the A15 and performance will not increase as much in most benchmarks - therefore the perf/W improvement would also be less impressive.

The *huge* question is whether 825MHz is actually on the power saver core or not. Everyone is assuming that it is but I'm not sure that's the case. It's a little known fact that Tegra 3's high speed cores are actually slightly *lower power* at 500MHz than the battery saver core. The only reason why the battery saver core can clock that high is so that it can handle short bursts and more importantly not be too slow for the 2ms it takes to wake up the other cores. And it might make sense to have the same minimum frequency on the high speed cores as the maximum frequency on the battery saver core.

The high speed A15s on Tegra 4 are very likely lower power at their lowest voltage than the battery saver core at the same frequency (and/or at its highest voltage). That's just the way 4+1 works in practice. So given NVIDIA obviously wanted the lowest possible power for Tegra 4 on that slide, I suspect it's actually a high speed core at its lowest voltage, and Tegra 4's battery saver core can clock significantly lower. Of course the A15s still need to be at their lowest possible voltage to significantly beat Tegra 3 at its highest voltage so it's nearly certainly less efficient at the process nominal voltage but it might work out fine with a bit of luck. I still wish they moved to a 4xA15+(4+1)xA7 architecture in Tegra 5 but we'll see.

The thing to keep in mind is that on Tegra 3 the power saver core used a different more leakage optimized process (requiring higher nominal voltages like you said) than on the other cores. That is NOT the case for Tegra 4. The only difference is that the fifth core has a different layout that diminishes max clock speed in order to reduce power consumption. I have no idea if this would require different switching voltages but it shouldn't, should it? It also doesn't say anything about leakage being lower, but if the area is smaller it should be somewhat.

I'm surprised to hear that Tegra 3's normal cores would use less power at 500MHz, though. I thought the 500MHz transition point was chosen precisely because at that point the lower static power outweighed the higher dynamic power, although your explanation makes sense.

That's fine, but at least with respect to video playback, NVIDIA did demonstrate at MWC 2013 that power consumption hovers close to 920mW on a Tegra 4 reference phone: http://www.youtube.com/watch?v=Ne1nT_g5_vs

Sure, but that's going to be dominated by the video decode engine; under good conditions (that nVidia could be sure to optimize for) the CPU could be almost unused. There's a reason that's what they show instead of something that uses the CPU to a known extent.


This doesn't add up at all. You don't have 40% better power consumption at same performance and 75% better performance at same power consumption. The latter number is always lower than the former, not higher. That's because perf/W is super-linear. Both of these statements can't be true.

According to Anandtech, the battery saver fifth CPU core will run up to 700-800MHz depending on SKU: http://www.anandtech.com/show/6550/...00-5th-core-is-a15-28nm-hpm-ue-category-3-lte . So this implies that the SPECInt data is from one of the main Cortex A15 CPU cores.

Maybe; we don't really know what their test units clocked what at. If their power numbers are not for the power optimized core then it's a better situation.
 
This doesn't add up at all. You don't have 40% better power consumption at same performance and 75% better performance at same power consumption. The latter number is always lower than the former, not higher. That's because perf/W is super-linear. Both of these statements can't be true.

What NVIDIA is showing in their slide is that, for a given SPECInt2000 score of 520 (ie. iso-performance between Tegra 4 and Tegra 3), Tegra 4 consumes 0.666w of power (ie. 520 SPECInt / 780 SPECInt/w) while Tegra 3 consumes 1.155w of power (ie. 520 SPECInt / 450 SPECInt/w). So Tegra 4 has nearly 75% higher SPECInt performance per watt (ie. 780 SPECInt/w / 450 SPECInt/w), while at the same time having nearly 40% lower power consumption (ie. 0.666w / 1.155w) compared to Tegra 3 for the same level of SPECInt performance. An alternate way to say it would have been to say that Tegra 3 has nearly 75% higher power consumption (ie. 1.155w / 0.666w) compared to Tegra 4 for the same level of SPECInt performance.
 
Last edited by a moderator:
Well that slide is very confusing, but okay. In other words, the 40% and 75% numbers are the same thing..

SPECInt 2k is single threaded, so assuming they aren't running multiple instances then Tegra 3 consuming 1.155W of power sounds awfully high. I wonder how much of this is outside of the CPU cores..
 
It's a little known fact that Tegra 3's high speed cores are actually slightly *lower power* at 500MHz than the battery saver core. The only reason why the battery saver core can clock that high is so that it can handle short bursts and more importantly not be too slow for the 2ms it takes to wake up the other cores. And it might make sense to have the same minimum frequency on the high speed cores as the maximum frequency on the battery saver core.

By the way, NVIDIA does note in their vSMP 4+1 whitepaper that the main CPU cores in Tegra 3 have better performance per watt at and above 500MHz, while the battery saver core has better performance per watt below 500MHz. With Tegra 4, NVIDIA decided to go with a much more performant battery saver core with it's own dedicated L2 cache.
 
nVidia had no choice but to give the battery saver core dedicated L2 cache, that's required by the nature of Cortex-A15's design.
 
The thing to keep in mind is that on Tegra 3 the power saver core used a different more leakage optimized process (requiring higher nominal voltages like you said) than on the other cores. That is NOT the case for Tegra 4. The only difference is that the fifth core has a different layout that diminishes max clock speed in order to reduce power consumption. I have no idea if this would require different switching voltages but it shouldn't, should it? It also doesn't say anything about leakage being lower, but if the area is smaller it should be somewhat.
It's nearly certainly not just a different synthesis - they likely used a very different voltage threshold as well. While this doesn't provide as dramatic a difference as different process characteristics, it is still very significant - Slide 12 in this presentation has a nice graph although no hard data: http://www.arm.com/files/pdf/AT_-_L...zed_SoC_Implementations_at_40nm_and_below.pdf

If you're willing to use High Vt cells (disadvantages: much lower performance, higher variability, not able to work at voltages as low as normal Vt) with increased channel length in both directions (much lower performance) then leakage would probably become a non-issue. The obvious trade-off is the massively lower performance at a given voltage, and so worse Perf/W at intermediary frequencies where you need a high voltage while normal transistors need a low voltage. But at extremely low frequencies, the very low leakage and ability to still use a low voltage does make it very efficient.

As I believe I've said before, my long-term preference is a 4+4+1 hybrid approach where in addition to the 4 A15s/A57s, you have 4 A7s or A53s that are performance-optimised (for highest possible Perf/W at moderately high frequencies and ability not to wake up the A15s for most workloads) and one leakage-optimised A7/A53 that works as a battery saver core. Also since the A7 doesn't have a tightly integrated L2 cache AFAIK, you could go back to Tegra 3's approach of reusing the same L2, although I can't remember if that's still the case for the A53. Ideally you'd also have a different number of the active cores if the kernel can handle it (e.g. 2+4+1 or 4+12+1) as it makes no sense to have more complex OoOE cores the battery can't handle but it could be beneficial to have higher many-core performance.
 
It's nearly certainly not just a different synthesis - they likely used a very different voltage threshold as well. While this doesn't provide as dramatic a difference as different process characteristics, it is still very significant - Slide 12 in this presentation has a nice graph although no hard data: http://www.arm.com/files/pdf/AT_-_L...zed_SoC_Implementations_at_40nm_and_below.pdf

If you're willing to use High Vt cells (disadvantages: much lower performance, higher variability, not able to work at voltages as low as normal Vt) with increased channel length in both directions (much lower performance) then leakage would probably become a non-issue. The obvious trade-off is the massively lower performance at a given voltage, and so worse Perf/W at intermediary frequencies where you need a high voltage while normal transistors need a low voltage. But at extremely low frequencies, the very low leakage and ability to still use a low voltage does make it very efficient.

Thanks for the link. I hadn't seen it before.

I guess in my mind I always considered the different transistor types as naturally dictated by different parts of the design and not something you'd change to target different leakage/performance/power characteristics. Not that I had a logical reason for that, guess I wasn't really thinking it through. I could see how HPL would be a better fit for this than LP since it offers a lot more flexibility in transistor type.

As I believe I've said before, my long-term preference is a 4+4+1 hybrid approach where in addition to the 4 A15s/A57s, you have 4 A7s or A53s that are performance-optimised (for highest possible Perf/W at moderately high frequencies and ability not to wake up the A15s for most workloads) and one leakage-optimised A7/A53 that works as a battery saver core. Also since the A7 doesn't have a tightly integrated L2 cache AFAIK, you could go back to Tegra 3's approach of reusing the same L2, although I can't remember if that's still the case for the A53. Ideally you'd also have a different number of the active cores if the kernel can handle it (e.g. 2+4+1 or 4+12+1) as it makes no sense to have more complex OoOE cores the battery can't handle but it could be beneficial to have higher many-core performance.

A7 does have tightly integrated L2. That's how they can boast much lower typical L2 latency than Cortex-A9s. Keeping that latency reasonably low is pretty significant for an in-order core.

I think we'd need to see the performance characteristics of the A7 to determine if it's really worth using a different layout for low leakage usage, since its baseline leakage will be so much lower than the A15s already. Particularly f looking at processes that employ body biasing to dynamically adjust Vt and leakage. The area of having that extra A7 itself wouldn't matter but the logic of switching the cluster is worth avoiding. In nVidia's case they can probably leverage the standard dual cluster A15 configuration to some extent but with three clusters you could no longer do that.

Maybe an alternative similar possibility is to have a 4x A7/A53 cluster where one core (and hence part of the cluster logic) uses the leakage optimized transistors, and you are only allowed to power it or the other 3 independently. I don't know what complications there'd be making sure that the interconnect logic works for both core types.
 
It's still Tegra related and after all we have a promise that we'll have Kepler graphics next year in mobile devices :p

Not saying it isn't (very remotely) on topic, just that it's interesting enough to warrant getting placed somewhere where it won't be as easily missed.
 
Not saying it isn't (very remotely) on topic, just that it's interesting enough to warrant getting placed somewhere where it won't be as easily missed.

LOL let's see if I can tickle your interest with a different approach: comparing on 40nm/TSMC a GTX580 should be on estimate about 140x times faster than the ULP GF in T3.

If I now estimate the ULP GF@T4 results in DXB2.7 the difference to the GTX Titan should shrink to about 60x times in terms of performance (well give or take).
 
Is that demo confirmed to be running on a Tegra 5 chip? If so, wouldn't that mean we could see Tegra 5 devises before the end of the year? Would mean an awful short lifespan for Tegra 4 and almost makes Tegra 4 seem like a waste of resources.

Ah well, I'm probably reading too much into this. He only says that the Battlefield demo that was shown was running on 'mobile Kepler', which doesn't tell us much. For all we know, that demo is being run on a laptop or something and not an early sample tablet chip.
 
It would be pretty silly of him to compare a tablet to a larger device though. I mean what would be the point, even an HD 4000 is superior to the A6X GPU
 
Hexus said:

Isn't it a bit premature of them to conclude that they made a good decision?

The Android Headlines article is putting a bit too much emphasis on having an LTE solution integrated into the mobile chipset though. Yes, that was a major advantage for Qualcomm, but Qualcomm's advantage was that they were pretty much the only one with decent LTE chips. They talk about the next Nexus 7 using a Qualcomm chip because it has integrated LTE, but chances are Google/ASUS will use the Snapdragon 600 and that doesn't have integrated LTE.

Another thing, both articles seem to allude that there will be a Samsung Exynos 5 Octa with integrated LTE. They're not explicitly stating it, but you'd almost read it as such. As far as I'm aware Samsung has never said anything about integrating any basebands into their SoCs.

As for only 10 million chips. Doesn't seem like much, but I can't really think of any other large volume orders for Tegra 3. All other Tegra 3 based devices than the Nexus 7 are relatively low volume. HTC One X, Surface, ASUS Transformers and a couple other Windows RT and Android tablets. Anyone got any numbers on any of those devices? I would have thought that HTC would have sold at least a couple of million One Xs and X+s.
 
Last edited by a moderator:
Welll I wasn't exactly optimistic about T3 volumes (unlike others here *cough*) but 10M sounds weaker then I would had expected even as the most pessimistic worst case scenario. The next question mark would then be Tegra revenue for the past year.

As for the rest of the Hexus writeup: let me try at least to understand what is going on here...Fudo claimed a while ago that T4 was delayed by a quarter, then from what I recall NV replied through Anand that there has been a misunderstanding with quarters and now it's being delayed after all? It's a delayed SoC that hasn't been delayed :oops: Maybe I should start drinking a bit to understand those things a bit better....:rolleyes:
 
Back
Top