NVIDIA Tegra Architecture

The dual core Sandy Bridge Celeron 847 fabricated on a 32nm fabrication process has a max TDP of ~ 17w (when running at 1.1GHz CPU clock operating frequency). So even with a shrink to a 22nm fabrication process and even with a newer architecture, there may not be too much room to increase performance beyond this since the max TDP of the 22nm ULX_Haswell part is much lower at ~ 10w.

Comparing the dual core Sandy Bridge Celeron 847 (operating at 800MHz) in the Acer C7 Chromebook to the dual core Cortex A15 (operating at 1.7GHz) in the Samsung Chromebook XE303 (with both CPU's fabricated on a 32nm fabrication process at Intel and Samsung fabs, respectively) the Sandy Bridge Celeron 847 processor has a ~ 0% performance advantage in RIABench Focus Tests, a ~ 10% performance advantage in BrowserMark, and a ~ 30% performance advantage in SunSpider 0.9.1 and Kraken, but a ~ 50% disadvantage in the web browsing battery life test: http://www.anandtech.com/show/6476/acer-c7-chromebook-review/5 (and this battery life metric is even more lopsided when you consider the fact that the dual core Cortex A15-equipped Chromebook has ~ 20% less battery capacity than the dual core Sandy Bridge Celeron 847-equipped Chromebook!). So even though the dual core Sandy Bridge Celeron 847 CPU clearly has higher IPC (instructions per clock), the dual Cortex A15 CPU appears to have significantly better performance per watt and significantly lower power consumption than the dual core Sandy Bridge Celeron 847 CPU, even when the latter is operating at only 800MHz.

Just a couple of clarifications, the Anandtech web browsing battery life tests show a 42% advantage, not the the 50% listed above, for the Exynos 5250, before factoring in the lower battery capacity, which if course is still significant. Muddying the waters of battery life on the Acer C7, are the additional power draw of the Sata interface / Hard drive vs eMMC on the Samsung.

With a tick and a tock of evolutionary IPC improvements, and Intel's focus on lowering power draw, it will be fascinating to see the level of performance in their sub 10 W range. However the true rival to the A15 / A57 will be Bay Trail, in terns of price / Watts.

The snake in the grass IMO, is the 14nm Broadwell 'SoC' in 2014, will that be the first big core X86 to power fanless Tablet designs?! I personally believe that Intel won't be able to kill of ARM in the short-term, but they will be able to restrict then to the Smartphone sector + low-end tablets.
 
Just a couple of clarifications, the Anandtech web browsing battery life tests show a 42% advantage, not the the 50% listed above, for the Exynos 5250, before factoring in the lower battery capacity, which if course is still significant. Muddying the waters of battery life on the Acer C7, are the additional power draw of the Sata interface / Hard drive vs eMMC on the Samsung.

Actually no, Anandtech's web browsing battery life tests show a 52% advantage for the dual core Cortex A15 [Samsung Chromebook XE303] compared to dual core Sandy Bridge Celeron 847 [Acer C7 Chromebook]: http://images.anandtech.com/graphs/graph6476/52686.png . If you take into account the fact that the dual core Sandy Bridge Celeron 847 Chromebook has 23% more battery capacity, then the dual core Cortex A15 Chromebook has an 88% advantage in the normalized web browsing battery life test.

The snake in the grass IMO, is the 14nm Broadwell 'SoC' in 2014, will that be the first big core X86 to power fanless Tablet designs?! I personally believe that Intel won't be able to kill of ARM in the short-term, but they will be able to restrict then to the Smartphone sector + low-end tablets.

ARM and ARM licensees are certainly not standing still either. Also, to be "restricted" to smartphones and tablet computers is not necessarily a bad thing, as these are the highest growth areas in mobile computing. Anyway, the notion that ARM would be inherently restricted to any specific sector is probably not correct (other than some professionals who rely on x86 processors). ARM processors will be found almost everywhere, including cars, TV's, dishwashers, etc. etc.
 
Last edited by a moderator:
I wonder how much impact a harddrive has on power consumption regarding tests that are (hopefully) not accessing the disk at all.
 
Actually no, Anandtech's web browsing battery life tests show a 52% advantage for the dual core Cortex A15 [Samsung Chromebook XE303] compared to dual core Sandy Bridge Celeron 847 [Acer C7 Chromebook]: http://images.anandtech.com/graphs/graph6476/52686.png . If you take into account the fact that the dual core Sandy Bridge Celeron 847 Chromebook has 23% more battery capacity, then the dual core Cortex A15 Chromebook has an 88% advantage in the normalized web browsing battery life test.


ARM and ARM licensees are certainly not standing still either. Also, to be "restricted" to smartphones and tablet computers is not necessarily a bad thing, as these are the highest growth areas in mobile computing. Anyway, the notion that ARM would be inherently restricted to any specific sector is probably not correct (other than some professionals who rely on x86 processors). ARM processors will be found almost everywhere, including cars, TV's, dishwashers, etc. etc.

Regarding the the 52%, I used percentage difference calculation, but because of the known directionality, a simple percent change seems correct, my bad, apologies.

As someone with no direct affiliation with the big semi-conductor players, I truly hope you're right and ARM / TSMC etc continue to pressure Intel, and keep the landscape competitive. My feeling is that either ARM underestimated the power / thermal profile of the A15 core, or they were expecting more from TSMC's 28nm process, to keep power usage down, whilst maintaining higher clock speed.
 
Last edited by a moderator:
I wonder how much impact a harddrive has on power consumption regarding tests that are (hopefully) not accessing the disk at all.

Knowing how much Chrome loves to cache pages, it is hard to imagine there isn't a fair amount of HD activity. This is what Anandtech has to say about the test.

Going into the iPhone 5 review I knew we needed to change the suite. After testing a number of options (and using about 16.5GB of cellular data in the process) we ended up on an evolution of the battery life test we deployed last year for our new tablet suite. The premise is the same: we regularly load web pages at a fixed interval until the battery dies (all displays are calibrated to 200 nits as always). The differences between this test and our previous one boil down to the amount of network activity and CPU load.

On the network side, we've done a lot more to prevent aggressive browser caching of our web pages. Some caching is important otherwise you end up with a baseband test, but it's clear what we had previously wasn't working. Brian made sure that despite the increased network load, the baseband still had the opportunity to enter its idle state during the course of the benchmark.

We also increased CPU workload along two vectors: we decreased pause time between web page loads and we shifted to full desktop web pages, some of which are very js heavy. The end result is a CPU usage profile that mimics constant, heavy usage beyond just web browsing. Everything you do on your smartphone ends up causing CPU usage peaks - opening applications, navigating around the OS and of course using apps themselves. Our 5th generation web browsing battery life test should map well to more types of smartphone usage, not just idle content consumption of data from web pages.
 
My feeling is that either ARM underestimated the power / thermal profile of the A15 core, or they were expecting more from TSMC's 28nm process, to keep power usage down, whilst maintaining higher clock speed.

I'm not sure that ARM underestimated the power and thermal profile of the Cortex A15 core per se, but rather ARM most likely does not expect Cortex A15-based designs in tablets and smartphones to regularly reach peak CPU operating frequencies. ARM also most likely expects most Cortex A15-based designs to use low power companion core(s) as seen in big.LITTLE, 4+1, OMAP5, etc. where the majority of common day to day tasks (such as video and movie playback, social networking, emailing, texting, calling, etc.) are performed on the low power companion core(s). On the flip side, ARM does seem cognizant of the fact that lower power yet reasonably powerful CPU designs such as Krait and Swift may be more appropriate for the smartphone space, and the R4 Cortex A9 CPU appears to be aimed at that space.
 
Last edited by a moderator:
ams said:
On the flip side, ARM does seem cognizant of the fact that lower power yet reasonably powerful CPU designs such as Krait and Swift may be more appropriate for the smartphone space, and the R4 Cortex A9 CPU appears to be aimed at that space.

Cortex-A9 r4 isn't really aimed at anything Cortex-A9 wasn't already aimed at, it's just a standard revision which probably would have happened regardless of what the rest of the industry did. ARM didn't even make a press release about it (nor do they about any of their other revisions), it's only nVidia pushing it strongly in marketing. While the reported gains in SPECInt are impressive I'm not holding my breath that you'll see them across the board. Bear in mind, most of what it does is increases the configuration options for buffer sizes (micro ITLB, TLB, GHB, BTB - what we don't know is what nVidia picked for any of these, just that the option for bigger was there) and changes the auto prefetcher a bit. Note the operative word here, "changes", not introduces. The main difference is that it tracks all cache accesses and not just misses, and can track far more streams in flight. It's not hard to see how this can make a big difference sometimes, since you can recognize linear patterns long before they cause a cache miss, and just two streams can be limiting in some cases. But I think it's more a case of ARM looking at profiling data and realizing there was potential for tangible improvement from an easy change, or maybe there were bugs preventing it from working this way in the first place.

Of course as someone looking in and guessing my reasoning could be way off, in which case I'll just wait for Laurent to come and smack me :p

ARM has only been gradually learning how to do prefetchers, which is a big advantage Intel has had over them, that'll probably go down in the future.

Aren't the screens different as well?

Yes, although power consumption in screens with similar size and brightness can vary somewhat so it's not that reliable to estimate much from this.

Trying to carefully reason about the CPU power consumption from the battery life test is pretty questionable all around.. better off waiting for some isolated power consumption tests.. I'm sure sites will do some for IB or Haswell tablets sooner or later..
 
Last edited by a moderator:
Cortex-A9 r4 isn't really aimed at anything Cortex-A9 wasn't already aimed at, it's just a standard revision which probably would have happened regardless of what the rest of the industry did. ARM didn't even make a press release about it (nor do they about any of their other revisions), it's only nVidia pushing it strongly in marketing. While the reported gains in SPECInt are impressive I'm not holding my breath that you'll see them across the board. Bear in mind, most of what it does is increases the configuration options for buffer sizes (micro ITLB, TLB, GHB, BTB - what we don't know is what nVidia picked for any of these, just that the option for bigger was there) and changes the auto prefetcher a bit. Note the operative word here, "changes", not introduces. The main difference is that it tracks all cache accesses and not just misses, and can track far more streams in flight. It's not hard to see how this can make a big difference sometimes, since you can recognize linear patterns long before they cause a cache miss, and just two streams can be limiting in some cases. But I think it's more a case of ARM looking at profiling data and realizing there was potential for tangible improvement from an easy change, or maybe there were bugs preventing it from working this way in the first place.

Of course as someone looking in and guessing my reasoning could be way off, in which case I'll just wait for Laurent to come and smack me :p
Heh, I am predictable :)

All of these changes will show speed increases for larger benchmarks such as browser ones (at least if the Android browser benchmarks are as heavy as they should). On the other hand don't expect any speedup for Geekbench or other similar micro-benchmarks except where the improved data prefetcher kicks in.

For the rest, there's not much I can say except that some of the improvements of A9 r4 were done after customer feedback.
 
Heh, I am predictable :)

All of these changes will show speed increases for larger benchmarks such as browser ones (at least if the Android browser benchmarks are as heavy as they should). On the other hand don't expect any speedup for Geekbench or other similar micro-benchmarks except where the improved data prefetcher kicks in.

For those big icache killers using a bunch of library code I could see the option to use a 64 entry micro ITLB instead of a 32 one could be beneficial. Since that's the only new option for the micro TLBs I expect nVidia went with it. I know on Cortex-A8 there were cases where the small 32-entry DTLB burned me, but I don't think I hit it that much with ITLB.

I guess the other enhancements to branch misprediction buffer sizes may also somewhat couple more with larger code footprints so they'd apply as well.

I'd actually expect a lot of those crappy Javascript benchmarks to benefit more than native code, since they have larger than typical hot code footprints before entering system/library code and they have a ton of extra branches for type guards, so extra capacity on the GHB and BTAC will help. It isn't clear to me how the GHB is indexed, if the increase in size implies a larger branch history register or if it's just using more PC bits independently but either one is beneficial.

nVidia is claiming a substantial benefit in SPECInt2k which isn't as code heavy, I bet that's mostly down to the prefetcher improvements which is encouraging. But I also think it might be more data heavy than a lot of code.

For the rest, there's not much I can say except that some of the improvements of A9 r4 were done after customer feedback.

That makes sense. I've always thought that one of the strengths of ARM's model is they can work a little closer with customers in troubleshooting and refining designs.
 
Heh, I am predictable :)

All of these changes will show speed increases for larger benchmarks such as browser ones (at least if the Android browser benchmarks are as heavy as they should). On the other hand don't expect any speedup for Geekbench or other similar micro-benchmarks except where the improved data prefetcher kicks in.

For the rest, there's not much I can say except that some of the improvements of A9 r4 were done after customer feedback.

By customers, I assume you mean SoC makers, right?
 
K6R4EEL.jpg


Nvidia just confirmed at GTC 2013 that Logan is CUDA and OpenGL 4.3 capable Kepler GPU.

Next-gen Tegra SoC is codenamed Parker(Spiderman) with Project Denver CPU and Maxwell GPU and FinFET transistors.
 
Interesting that Project Denver is taking so long to make it nVidia's SoCs. Many figured Logan would have it. Some even thought Wayne would..

I wonder if Logan will have A15s again, or if it'll have A57s. That kind of raises the more general question of when we'll first see tablet manufacturers strongly embrace 64-bit. I wonder if Google even mentioned anything about 64-bit migration for Android..
 
Interesting that Project Denver is taking so long to make it nVidia's SoCs. Many figured Logan would have it. Some even thought Wayne would..

I wonder if Logan will have A15s again, or if it'll have A57s. That kind of raises the more general question of when we'll first see tablet manufacturers strongly embrace 64-bit. I wonder if Google even mentioned anything about 64-bit migration for Android..

I always "thought" that we won't see Denver before 20nm/TSMC and I don't think NV ever left hints for earlier or did they?
 
I always "thought" that we won't see Denver before 20nm/TSMC and I don't think NV ever left hints for earlier or did they?

Just so we're clear, are you saying you expect Logan to be 28nm?

One hint was given here: http://venturebeat.com/2011/03/04/q...his-strategy-for-winning-in-mobile-computing/

nVidia suggested that PD would finish its five year design cycle by the end of 2012. So it should easily be on track for a mid-2014 release, rather than a mid-2015 one. It may still end up hitting compute cards before low power SoCs.
 
What would speak against it?

Tegra 3 first showed up in devices a good 15 months after Tegra 2 did, and that was a good 15 months ago so the gap will be even longer for Tegra 4. I don't know when the first Tegra 4 devices will hit but it could be another month or two from now. Therefore I'd expect the first Logan devices to be mid-2014 at the earliest but could easily drag into late 2014. Just how late do you think TSMC 20nm is going to be?
 
Tegra 3 first showed up in devices a good 15 months after Tegra 2 did, and that was a good 15 months ago so the gap will be even longer for Tegra 4. I don't know when the first Tegra 4 devices will hit but it could be another month or two from now. Therefore I'd expect the first Logan devices to be mid-2014 at the earliest but could easily drag into late 2014. Just how late do you think TSMC 20nm is going to be?

You might also want to remember that Tegra2 and Tegra3 were both manufactured on TSMC's 40nm; else there's no guaratee either that NV is planning to use a new process whenever its available. Theoretically they could had manufactured T3 on 28nm too; so what was it that kept them from it? Could it be a number of reasons that 40nm would still had been a way safer bet?
 
You might also want to remember that Tegra2 and Tegra3 were both manufactured on TSMC's 40nm; else there's no guaratee either that NV is planning to use a new process whenever its available. Theoretically they could had manufactured T3 on 28nm too; so what was it that kept them from it? Could it be a number of reasons that 40nm would still had been a way safer bet?

I didn't forget that Tegra 2 and Tegra 3 were both 40nm. That doesn't mean that nVidia is always going to release two SoCs on the same node back to back, especially not if their product cycles are getting longer.

No one else had 28nm products out when Tegra 3 was released so it's not like they they went against the industry. And it's not like the first 28nm products were especially late vs expectations had for a good year beforehand. Expecting TSMC's 20nm to be ready for a mid to late 2014 release doesn't seem that unrealistic.

You probably could make a good argument that nVidia would have been better off waiting for 28nm, but in that case nVidia needed something more suitable for phones out ASAP - Tegra 2 was a misstep in that direction since it couldn't power gate a single core and lack of NEON was quickly looking like a glaring mistake that needed prompt correction.

The situation is a little different with Tegra 4. The die isn't nearly as small as Tegra 2's so they don't have loads of room to grow in. It also looks like the Cortex-A15s are going to use a lot of power. If they're going to substantially grow peak CPU power consumption with the next generation (as one would certainly expect) and if they're going to do it with Cortex-A15s or maybe Cortex-A57s (which is their only option if it's not using Denver yet) then they really need a lower power process. Right now they don't have the power budget to grow at all, not for any market they're remotely established in.
 
Back
Top