NVIDIA Tegra Architecture

Blazkowicz · Jan 6, 2015

We all like to claim that Apple is form without substance and it is easy to troll about them, then we're now forced to remember they are a hardware company afterall.

Ailuros · Jan 6, 2015

RecessionCone said:
Two reasons from the top of my head:
1. ARM did the work to make the A57 and A53, so NVIDIA doesn't have to spend any time or money making a leakage optimized A57.
2. Marketing droids prefer octacore.

1. They obviously didn't mind several times so far.
2. It's only a true octacore when global task scheduling works even for marketing droids.
3. Since Parker most likely will bounce back to Denver how will they explain to marketing droids the loss of 6 cores exactly?

Apple CPUs, on the other hand, seem to constantly surprise - and AFAIK they don't use big.LITTLE or asynchronous DVFS.

Between lines I more than once suggested that ARM should license CYCLONE from Apple back (not possible, but the sarcasm is obvious). That should also mean that I still have a ton of question marks considering Denver and my so far gut feeling tells me that in real time Cyclone still wins big time against it.

Sure. So if you have a workload with one heavy thread and one light thread, how do you save {power, time} by turning on a little core? You've already burned the power to turn on the big core, so the little core is roundoff error in terms of power, right? Turning on the little core is also roundoff error in terms of performance: you could just keep the big core on and multiplex the heavy thread and the light thread multiplexed on the big core without any performance penalty (otherwise, we have two heavy threads, right?)

Doesn't make much sense to me.

Thought exercise then. Have a look at that one: http://www.ondaforum.com/topic/859-onda-v989-cpu-configuration/ and particularly on the "speed mode". Now it doesn't really operate as a pure quad A15 config only as the author implies it probably just gives priority to the A15 cluster and uses only the A7 cluster if needed. Would you suggest that in the latter case it never uses the A7 quad and how big would you think is the power difference for the two use cases? (normal mode vs. speed mode, watch also the minimum frequencies).

Erinyes · Jan 6, 2015

wco81 said:
Saw a press release where the X1 was touted as advanced car tech. How it would recognize signs and other road objects.

But Nvidia isn't saying they're introducing a self-driving car platform, are they?

Read the Anandtech article for more info on their automotive platforms, DRIVE CX and DRIVE PX. But in short..yes Nvidia is introducing a self-driving car platform.

Exophase said:
So nVidia says that Tegra X1 is using Cortex-A57 + A53 instead of Denver because this was faster and simpler to implement in 20nm. But Anandtech says that they have a completely custom physical implementation. In that case, how would "hardening" A57 and A53 - two CPUs they've never used before, especially the latter - be faster or simpler than using Denver, which they already have a custom implementation of and would require a more straightforward shrink? The only way this makes sense is if A57 was ready long before Denver, but the release timescale of devices using each respective CPU make this seem unlikely. So I'm skeptical that both of these claims are completely accurate.

Actually I was kinda wondering the same thing..despite all the talk from Nvidia that it was for time to market reasons..I suspect that it was due to the fact that Denver was simply not competitive enough. Of course you wouldn't expect them to say that hence the spin on time to market. Anyway it seems all but confirmed that they are working on a Parker SoC with Denver cores on 16FF. Presumably these would be updated cores compared to TK1.

A1xLLcqAgt0qc2RyMz0y said:
Doing it their way instead of ARM's way has benefits:

Yes I saw that PR statement too..but as Nebu said and others have been discussing..it does not appear that it will be beneficial for this generation.

Nebuchadnezzar said:
ARM's way is HMP, not old style cluster migration as on the 5410/5420. I really doubt Nvidia's claims on any kind of benefit of their own CM.

Yep..the claims do seem rather tall. From the AT Article - "NVIDIA claims that Tegra X1 significantly outperforms Samsung System LSI’s Exynos 5433 in performance per watt with 1.4x more performance at the same amount of power or half the power for the same performance on the CPU".

I read your post that 5433 is a bad implementation so this could account for a part of the difference. But my question is..what workload are they referring to here..any idea?

Alexko said:
That said, I think that NVIDIA is finally on to something. They seem to have a good implementation of a 4+4 A57/A53 setup (i.e. pretty much the best available IP at the moment) plus a big Maxwell GPU, which we know to be very efficient, plus support for fast FP16. There's not reason for TX1 to have power-efficiency issues anymore.

The problem is that I'm not sure there are many use cases where TX1 would be all that preferable to, say, a Snapdragon 810 with integrated LTE, or a cheaper Mediatek design.

Having an integrated modem/LTE is not the holy grail. I think people overestimate the importance of it. Look at Samsung and the international versions of their Galaxy S and Galaxy Note lines over the years. And comparing it to Mediatek is apples and oranges..they simply do not compete with this class of chip. AFAIK Mediatek do have an A57 design in the works but it is on 28nm. And Mediatek's graphics implementations are woefully underpowered.

Nebuchadnezzar said:
Cache coherency as per how we questioned and got a response from Nvidia was that a cluster migration is done without any DRAM intervention. Again, I fail to see how this could be more efficient than just migrating via ARM's CCI, even if it's just limited to cluster migration.

I think that goes without saying..it had to be without DRAM intervention..if not, it would be extremely power inefficient wouldn't it? Btw do we have any idea if there is large L3 cache like Apple has?

Nebuchadnezzar said:
I really think most of their power efficiency claims just come from the process advantage and probably better physical libraries compared to the 5433 (I have an article on that one with very extensive power measurements... the 5433 is a comparatively bad A57 implementation if compared to what Samsung has achieved on A15s now on the 5430, many would be surprised).

How much of a process advantage does TSMC 20nm have over Samsung 20nm? Sounds interesting..looking forward to reading that Article! (This year please?

) And yes..given that A57 is supposed to be more power efficient than A15..it is quite bad if the A15's in 5430 beat it.

Nebuchadnezzar said:
I think their interconnect just can't do HMP and this is PR spinning.

I get the same impression. And I had the same response as Alexko..why didn't they just use CCI-400 then. The strange part is they elected to put in a lot of engineering investment to design their own interconnect..which appears to be inferior to one already available from ARM ¯\_(ツ)_/¯

RecessionCone said:
Qualcomm doesn't have a great reputation for CPU design. Arguments about their DVFS are unconvincing.

Apple CPUs, on the other hand, seem to constantly surprise - and AFAIK they don't use big.LITTLE or asynchronous DVFS.

Krait called..it wants its Mojo back. Seriously..Krait 300 and 400 have proven dominant over the past few years and you claim Qualcomm dosen't have a great reputation for CPU design? And arguably their DVFS has resulted in Snapdragon 800 being the most power efficient platform on the market, and big.LITTLE, despite all the marketing hype..hasn't been able to beat it.

Apple do not use big.LITTLE..they just use fewer..very wide cores. However, I would be surprised if they did not use asynchronous DVFS.

Ailuros said:
1. They obviously didn't mind several times so far.
2. It's only a true octacore when global task scheduling works even for marketing droids.
3. Since Parker most likely will bounce back to Denver how will they explain to marketing droids the loss of 6 cores exactly?

1. Exactly..Nvidia have used standard ARM cores for years now..nothing new or surprising that they've gone for A57. Qualcomm have also done the same, despite having been in the custom CPU game for a lot longer.
2. Marketing will spin it as it best suits them. Samsung also claimed "Octa-core" for the cluster migration 5410.
3. Good question..though my speculation is that it will be a loss of 4 cores and not 6. Qualcomm will face the same problem with their next custom core.

Ailuros · Jan 6, 2015

I would suggest that we lay marketing rubbish aside and continue this highly interesting conversation on a pure technical level.

Arun · Jan 6, 2015

Denver is huge. It's like a truck... the kind that's big enough to carry another truck when it has broken down, and leave room for a private jet with swimming pool on top.

But *if* Denver had higher performance than A57 and equal or better power efficiency, then I think that's not really a problem TBH. The mobile world doesn't need a large number of high IPC cores; what you really want is ~2 extremely fast cores then a large number of smaller cores for parallel workloads. Think of CELL but massively faster and hilariously easier to program (i.e. transparent if you want it to be).

If I was NVIDIA, I would do the following:
- 16FF process
- 2 x Denver (2.5GHz+ with revisions)
- 4 x A53 (2GHz+ - optimised for speed)
- 4 x A53 (1.2GHz+ - optimised for power)
- All cores can be used at the same time...

That gives you a 10-core (yay marketing!) processor with incredibly high single-threaded *and* multi-threaded performance. The *really* hard part is getting all cores to work at the same time - I don't think a 3-cluster system makes any sense otherwise. And NVIDIA apparently already can't do this with a standard A57+A53 config, then it becomes even slightly harder with Denver because of the different internal ISA (probably not too bad as long as Denver has similar IPC to A53 on ARM ISA code pre-transcoding).

One possible advantages of simultaneous Denver+A53 is that you could keep more stuff on the A53s and reduce instruction cache pressure on the Denver. Another is that you could use the A53s for the transcoding, although I would assume NVIDIA has some dedicated helpler instructions for that which might make it counter-productive (unless they customised the A53 which ARM doesn't let its partners do, and a from-scratch design would be expensive).

Overall it's clear that making the best possible business use of Denver is a complex technical problem. I can only hope that the relatively lack of success of K1-64 motivates NVIDIA to double down and find interesting solutions to these and other problems, rather than let this interesting architecture linger in disrepair and fail.

Nebuchadnezzar · Jan 6, 2015

Erineyes: Their power/perf claims are based on SPECint measurements. I've avoided using that for power measurement as to not anger the people who provide us the suite, but I can tell you it's basically just a use-case with just 1 core having any load, furthering my suspicion that most gains are unrelated to the CM scheme and we're talking just process here.

Samsung's 20nm seems about as efficient as 28HPM when strictly talking about voltages.

Ailuros · Jan 6, 2015

By the way according to my observations when big & little clusters get combined, the amount of active little cores is usually higher.

Arun,
Thanks for the uttar.UTTAR config from hell

Alexko · Jan 6, 2015

Erinyes said:
Having an integrated modem/LTE is not the holy grail. I think people overestimate the importance of it. Look at Samsung and the international versions of their Galaxy S and Galaxy Note lines over the years. And comparing it to Mediatek is apples and oranges..they simply do not compete with this class of chip. AFAIK Mediatek do have an A57 design in the works but it is on 28nm. And Mediatek's graphics implementations are woefully underpowered.

It may not be the Holy Grail but it's Qualcomm's main competitive advantage, and they do dominate the smartphone market. They do pretty well in tablets too. They also have Krait and Adreno, but neither seems particularly better than Cortex or Mali/PowerVR respectively. The market performance of their Cortex-powered S810 should bring us more controlled information about the competitive value of their modems, unless Adreno 430 turns out to be spectacular.

As for Mediatek, I don't know what they're working on exactly for 2015, but I imagine they must have some sort of 4+4 A57/A53 setup with decent graphics. I'm not trying to say that they compete with NVIDIA on graphics performance.

Rather, I'm arguing that there's really not much you can do on a Tegra device that you can't do on a (cheaper) Mediatek one. Whatever it is that you can do on Tegra and not on an MT chip, I doubt it's enough for Tegra to be viable as a tablet product. Since (from what I've read) JHH spent far more time talking about cars than tablets when presenting Erista, he just might agree with me.

Deleted member 2197 · Jan 6, 2015

Jon Peddie ...

When Nvidia announced the Tegra K1, the company pointed out that it was based on the same architecture as the company’s PC-based Kepler GPU. Going forward, all Nvidia’s mobile devices will be based on the same GPU core as those in the higher-performance PC parts.

From this statement it makes sense to use an off-the-shelf ARM processor in the interest of time to market.

As a result of the new process and the move to its Maxwell GPU core, Nvidia was able to double the performance from the prior-generation Tegra K1, delivering more than a teraflop while holding the TDP at the K1 level, and enlarging the die size only slightly.

The X1 has two streaming multiprocessor (SM) blocks, giving the chip 256 Cuda cores, 16 texture units, 16 ROP units, and twice the performance/watt of the K1. Since the GPU is based on the Maxwell architecture, it is compatible with all the popular mobile APIs such as DirectX 12, AEP, OpenGL 4.5, and Open GL ES 3.

The company also emphasized the chip’s ability to handle 4K video capture and display. It can decode up to 500 megapixels per second with its hardware H.265 codec (500 Mpixels/s is 4K at 60 frames/s), and stream 4K out via a HDMI 2.0 interface at 60 f/s.

The Maxwell SM is partitioned into four processing blocks, each with its own dedicated resources for scheduling and instruction buffering. This new configuration along with scheduler and data-path changes saves power and delivers more performance per core. Also, the SM shared memory is now dedicated instead of being shared with the L1 cache.

The chip’s memory bus is 64 bits wide and can run LPDDR4 at 3,200 Mtransfers/s. It uses Maxwell’s lossless color compression that runs end-to-end. The company says that all the advanced architectural features available on desktop GeForce GTX 980 will be available on mobile Tegra X1 as well.

http://www.eetimes.com/author.asp?section_id=36&doc_id=1325158

Ailuros · Jan 6, 2015

Alexko said:
It may not be the Holy Grail but it's Qualcomm's main competitive advantage, and they do dominate the smartphone market. They do pretty well in tablets too. They also have Krait and Adreno, but neither seems particularly better than Cortex or Mali/PowerVR respectively. The market performance of their Cortex-powered S810 should bring us more controlled information about the competitive value of their modems, unless Adreno 430 turns out to be spectacular.

QCOM's hw is fine IMHO; it's the sw that usually leaves a lot to be desired.

As for Mediatek, I don't know what they're working on exactly for 2015, but I imagine they must have some sort of 4+4 A57/A53 setup with decent graphics. I'm not trying to say that they compete with NVIDIA on graphics performance.

MTK has a big.LITTLE setup planned as you describe it with a higher clocked G6200. Mediatek has been getting increasingly better on the sw side since the MT6595.

Rather, I'm arguing that there's really not much you can do on a Tegra device that you can't do on a (cheaper) Mediatek one. Whatever it is that you can do on Tegra and not on an MT chip, I doubt it's enough for Tegra to be viable as a tablet product. Since (from what I've read) JHH spent far more time talking about cars than tablets when presenting Erista, he just might agree with me.

If you'd want a high end Android gaming tablet you'd obviously go for a Tegra than a Mediatek tablet. In peak GFLOP values the X1 GPU will be over 4x times ahead of the forementioned G6200, but the trouble then is that there aren't enough as demanding games available to make the X1 tablet shine even more. The majority of tablets out there have nowadays MaliT760, Rogue G62x0 and others which are roughly on the same performance level. If you bounce over to the iOS world on the other hand how high is the persentage exactly of iPad Air and iPad Air2 tablet owners out of the total Apple iPad owners? Even much much worse ULP mobile games are short and restricted due to storage and download restrictions; ie you have short $9 games at best. Assuming you're an ISV would you really develop a game only for enthusiast tablet users?

Even worse now that NVIDIA also supports FP16 ops in its ALUs there's no chance in hell they can support console quality games *snicker* :runaway:

Last but not least: did I misread something or did Jensen really claim that X1 has the performance of the Xbox One console? I call for bullshit no matter how much tear and feathers are on their way

Deleted member 13524 · Jan 6, 2015

Arun said:
If I was NVIDIA, I would do the following:
- 16FF process
- 2 x Denver (2.5GHz+ with revisions)
- 4 x A53 (2GHz+ - optimised for speed)
- 4 x A53 (1.2GHz+ - optimised for power)
- All cores can be used at the same time...

What are the 2GHz A53 doing in there? Why not just 2xDenver + 4x A53?

Exophase · Jan 6, 2015

Arun said:
If I was NVIDIA, I would do the following:
- 16FF process
- 2 x Denver (2.5GHz+ with revisions)
- 4 x A53 (2GHz+ - optimised for speed)
- 4 x A53 (1.2GHz+ - optimised for power)
- All cores can be used at the same time...

Oh oh, let's do this and really confuse everyone:

- 2x Denver
- 2x Cortex-A57
- 4x Cortex-A53

Ailuros · Jan 6, 2015

Pffff amateurs.....on the upper right tab there will be a "high performance" mode which will be enabled after the tablet reboots. It'll switch the CPU to 3* 2nd generation Denver cores at frequencies ranging exclusively from 2.4 to 3.0GHz and will serve only for popular public benchmarks. Do not attempt to conduct any battery life tests when in that mode as it'll be detected by our nPower sw and will fail; please also ensure that your battery life is higher than 40% in order to avoid random crashes. In "normal mode" you'll be running on the 4*A57+4*A53 CPU config with a minimum frequency of 400MHz for the first A53 core and there you are allowed only to run battery life tests but not any benchmarks

There happy now?

THAT'S a solution chinese manufacturers would fall all over

Arun · Jan 6, 2015

ToTTenTranz said:
What are the 2GHz A53 doing in there? Why not just 2xDenver + 4x A53?

It's similar to the Snapdragon 615 using A53@1.7GHz + A53@1.0GHz. There is quite a large power/area efficiency difference between a ~2GHz and a ~1GHz core.

Plus it gives you a 10-core config which everyone knows is universally better than 6-core or 8-core!

3dilettante · Jan 6, 2015

Arun said:
And NVIDIA apparently already can't do this with a standard A57+A53 config, then it becomes even slightly harder with Denver because of the different internal ISA (probably not too bad as long as Denver has similar IPC to A53 on ARM ISA code pre-transcoding).

The chip in general does not behave like an ARM chip, although it presents the illusion of it to software.
It may be more complicated than putting the cores in, although I admit I'm speculating on this.

The cache behavior for Denver is definitely different, with some kind of transactional memory setup for the data cache in order to protect the system from seeing data that could be rolled back.
The optimization cache is a decent chunk of system memory set aside with highly specialized permissions, which in the case of Denver's Transmeta ancestors required special instructions to interact with.

Nvidia's interconnect might have needed to be customized to permit this behavior, and to enforce the security of the optimization cache. While I don't have evidence of it being the case, I think there are also possibilities for optimization or better functionality if some of the dynamic translation or more speculative nature of Denver's architecture leaks out. Things like tighter modification checks for the instruction caches or coherence behavior that takes the transactional model into account might make the interconnect behave in a non-standard manner. As long as the other cores in the system are Denver, they would know how to play ball.

The interface could be customized to guard non-Denver cores from seeing that kind of traffic, although the comparatively more straightforward implementation would be a custom interconnect that might be configured at design time for various target architectures, rather than one that tries to support bridging architectures on the fly.

One possible advantages of simultaneous Denver+A53 is that you could keep more stuff on the A53s and reduce instruction cache pressure on the Denver.

Perhaps instruction footprint would be less of a problem if the ISA wasn't allegedly capable of emulating arbitrary architectures like x86. An ARM-specific emulation ISA with all the extra bells and whistles could probably find economies in areas of functionality that ARM may not need, but which x86 would. Having the expanded uop format in memory seems to imply safeguarding against decode costs that ARM is not noted for, as an example.
The skewed pipeline that allows a load, dependent ALU, and dependent store to exist in the same bundle, which in-order cores sometimes utilize and which x86 cores also like given their use of memory. This, along the transactional model, seems to make allowances for an architecture that is heavy on read/modify/write atomics or for linked loads and checked stores. This is growing the obligations of the ISA and architecture in a way that straddles how two disparate architectures handle atomics.

They could also add a renamer or find a way to go out of order so they don't blow out loops due to putting the bulk output of a straightforward OoO optimization like automatic register renaming in the instruction stream.

liolio · Jan 6, 2015

Damned, what internet has become

We are no longer in a situation to know if a new CPU architecture from a major company is a dud or a hit.

ninelven · Jan 6, 2015

Last but not least: did I misread something or did Jensen really claim that X1 has the performance of the Xbox One console?

Eh, it is not that far off. The CPU is probably as good or better, while the GPU is roughly half the performance (or much closer if using fp16). Bandwidth is the only real killer, but if one is willing to accept 720p instead of 1080p, I imagine Xbox One games could run just fine on it with minimal adjustment. It is certainly much better than the Wii U at any rate....

Exophase · Jan 6, 2015

ninelven said:
Eh, it is not that far off. The CPU is probably as good or better, while the GPU is roughly half the performance (or much closer if using fp16). Bandwidth is the only real killer, but if one is willing to accept 720p instead of 1080p, I imagine Xbox One games could run just fine on it with minimal adjustment. It is certainly much better than the Wii U at any rate....

Individually the higher clocked A57s can probably outpace the Jaguars in normal situations, but I don't think that plus the 4 Cortex-A53s is enough to match the 8 cores. That, and I think there will be few if any actual devices that will allow all four A57s to be run at peak clock simultaneously, at least for any meaningful length of time. Even the original Shield handheld which had a fan had a hard cap of 1.4GHz for the four A15 cores if they were all active.

A lot of XB1 games are already not 1080p, and the typical bandwidth gap is way more than 2.25x anyway...

ninelven · Jan 7, 2015

That, and I think there will be few if any actual devices that will allow all four A57s to be run at peak clock simultaneously, at least for any meaningful length of time. Even the original Shield handheld which had a fan had a hard cap of 1.4GHz for the four A15 cores if they were all active.

And if you gave the chip the X1's heatsink and power supply?

Ailuros · Jan 7, 2015

Its not double fp16 under any circumstance, but under conditionals.

NVIDIA Tegra Architecture

Blazkowicz

Ailuros

Epsilon plus three

Erinyes

Ailuros

Epsilon plus three

Arun

Unknown.

Nebuchadnezzar

Ailuros

Epsilon plus three

Alexko

Deleted member 2197

Guest

Ailuros

Epsilon plus three

Deleted member 13524

Guest

Exophase

Ailuros

Epsilon plus three

Arun

Unknown.

3dilettante

liolio

Aquoiboniste

ninelven

PM

Exophase

ninelven

PM

Ailuros

Epsilon plus three

Similar threads