NVIDIA Tegra Architecture

Laurent06 · Jan 4, 2014

Alexko said:
ARM apparently expects A53 to be used around 2GHz:

http://www.arm.com/products/processors/cortex-a50/cortex-a53-processor.php

I'm not sure it's any less capable of reaching high clock than A9. Besides, I imagine T4i will often have to throttle down.

The graphics probably doesn't represent all devices with the same process. Also note that ARM provides conservative frequencies.

Exophase · Jan 4, 2014

liolio said:
Again there is the investment in silicon 5 A9 cores takes a lot more room than 4 A7, I don't think it is worse it.

I think you're overstating the die area of Cortex-A9. Look at RK3188, it's 25mm^2 and the quad core A9 cluster takes maybe a quarter of that. An A7 equivalent might save another mm^2 or two but at this low end that's really not worth an awful lot.

Power consumption is a bigger deal but if they can make quad A15s work in phones, and to some extent they have, then this shouldn't be a huge barrier.

You're right that nVidia doesn't want to go totally low end, and given the current players with LTE integration that makes sense. It may be inevitable that they'll be overshadowed by Qualcomm but it doesn't help to be overshadowed by MediaTek too...

Alexko · Jan 5, 2014

Laurent06 said:
a) The graphics probably doesn't represent all devices with the same process.
b) Also note that ARM provides conservative frequencies.

a) That's my understanding as well, and it explains the 1.6GHz clock for A9 when we know it can go much higher on 28nm.

b) Indeed. ARM mentions 1.7GHz for A7 but Mediatek's MT6592 apparently runs at 2GHz. All in all, I imagine A53 should be able to match T4i's 2.3GHz clock without too much trouble. Even if it were to fall a bit short of that, it should still be a little faster and much more power-efficient.

DSC · Jan 6, 2014

Nvidia has announced Tegra K1 with quad-core A15 and something special....

http://www.nvidia.com/object/tegra-k1-processor.html

http://www.youtube.com/watch?v=GkJ5PupShfk

Wynix · Jan 6, 2014

The Denver version looks interesting.
I'm glad that they didn't go for a ridiculous octo-core chip.
Also this:

I refuse to believe this is a 28nm chip.

liolio · Jan 6, 2014

Is it me or I can't spot on either die shots an antenna?
Another interrogation is the TDP? Ain't 5 Watts a bit high?

My thoughts, are those chips intended for tablets and tablets alone (+cars).

Exophase · Jan 6, 2014

liolio said:
Is it me or I can't spot on either die shots an antenna?
Another interrogation is the TDP? Ain't 5 Watts a bit high?

nVidia is well known for painting over die shots. That said, I don't think any SoC has an integrated antenna. Or even an RF transceive. I think you're thinking of an integrated baseband, which is mostly logic with some mixed signal interface stuff.

And no, 5W isn't high if we're talking full tilt for the whole SoC or even just the CPU cores. Several mobile SoCs today can hit around that point and are allowed to maintain it for at least some duration of time. For tablets that may even be okay. Qualcomm for example has stated that Snapdragon 800 is meant to consume 5W in tablet environments, IIRC.

Let's say we're really saying just those two CPUs cores at 5W, or 2.5W each. For such a big bad wide CPU at 2.5GHz, on let's say TSMC 20nm, I would be surprised to see such a low power consumption. On that process note, Anandtech says that nVidia got silicon back and it therefore means the Denver K1 must be 28nm. I call bunk, nVidia could have gotten back 20nm silicon for engineering samples by now.

Here's a copy + paste of the post I made on Anandtech's forums, apologies for cross-posting if that bugs anyone

"Very surprised to hear Denver is hitting an SoC before Parker. This is a pretty aggressive move for nVidia, at least suggesting a six month cadence with genuinely different SoCs that are both targeting the high end. With the first one kind of laughably following not far off the heels of Tegra 4i.

These "K1v2" figures do seem... weird. 7-way superscalar? I can't think of a single even remotely general purpose CPU that's so wide at the decode level, not even IBM's POWER processors. It might be technically feasible, especially if they can only handle that throughput in AArch64 mode. But the cost of being able to actually rename the operands of 7 instructions is high, finding enough parallelism to actually even come close to using that decode bandwidth even a small percentage of the time is slim, and they'd need a much wider than 7 ports backend to facilitate all that execution which means a lot in terms of register file ports, forwarding network complexity, and so on. I also don't think they'd get terribly far without quite a bit of L1 cache parallelism which isn't cheap. I could possibly see them sort of reaching this width if it involves SMT, but even then it seems pretty overboard. Maybe not if it's not full SMT and there are limits to what a single thread can utilize.

What I'm suspicious of is that they're counting the width of A15 and Denver at different parts of the pipeline. It makes sense to have 7 execution ports/pipelines. In that case, A15 has 8. I've seen some (scarce, unfortunately) mention that A57 is consolidating the number of ports, which is almost certainly a power consumption optimization. So 7 seems like more than enough even for a pretty high-end aggressive design - afterall, Ivy Bridge only had 6 (while Haswell extended it to 8).

I know Cyclone was purported to be capable of decoding 6 instructions per cycle (and sustaining 6 IPC execution) but until I see the exact methodology of this test I'm skeptical of it as well.

One other consideration is that some of that number may be accounted for by instruction fusion. This could include x86-style branch fusion but possibly other classes of instructions as well, although none immediately spring to mind.

The 128KB L1, presumably instruction cache figure is also far out there. The only place I recall seeing such a large L1 instruction cache was Itanium where the VLIW-ish nature of the instructions led to some relatively low density. A possible consideration here is that some of the frontend, including the L1 icache, is shared between the two cores, Bulldozer style. Would be interesting, to say the least, although even Steamroller still doesn't hit such a big shared L1 icache. I hope they're not actually storing decoded instructions in some wider format, that seems like it'd be pretty wasteful even for a strategy to support AArch32 + AArch64..

With such big caches and such a wide execution at such a (relatively) high clock speed we could be looking at some long pipeline lengths and long L1 latencies, coupled with some really deep OoOE buffering to try to keep up with it. We could be looking at some relatively gigantic cores, which is more or less what you'd expect with nVidia only offering two of them. Unlike Apple, they have the most to stand behind since they offered the first mobile quad core SoC with Tegra 3 and defended it pretty aggressively. I don't think they'd be going dual core here unless the cost difference was huge; I think they'd go quad core even if it meant they could only run all four at a greatly reduced frequency (which Tegra 4 is basically doing that anyway).

Also, no mention of a companion core for Denver. And I don't think we'll see one. Pairing an A7 cluster with that would be very interesting, and would mean quite a particular design investment on nVidia's part, which I don't see it happening. But who knows, we didn't learn about Tegra 3's companion core until pretty late in the game.

So two things nVidia has to eat some crow on.. which I doubt we'll see much actual discourse on, but that'd be pretty fun...

Two final thoughts: I wonder if the Denver part is legitimately meant to replace the A15 part, or if the former is going to be targeting phones while the latter targets tablets or even beyond that. IF that's the case then it's possible nVidia will continue to license ARM cores for some parts, and this isn't just a time to market feasibility thing. Lastly, I noticed that nVidia had quite good documentation in its anti-Qualcomm propagada.. er.. technical white papers, where they went into a fair amount of details of how A15 operated. Now that they're using their own core I sure hope we get something even more thorough. Great take that to Qualcomm who says nary a thing about their tech. Fingers crossed."

OlegSH · Jan 6, 2014

liolio said:
Is it me or I can't spot on either die shots an antenna?
Another interrogation is the TDP? Ain't 5 Watts a bit high?

My thoughts, are those chips intended for tablets and tablets alone (+cars).

5 watts related to perf numbers in the table(365 Gflops), GPU should work at ~1Ghz to achieve those flops, it's definitely not for smartphones or even tablets, but rather for cars and chromebooks, but considering last summer demo with 1 watt GPU power consumption it should be scalable down to phones, that's why you can observe POP package here - http://www.nvidia.com/object/tegra-k1-processor.html

Esrever · Jan 6, 2014

5965 horsepower = 4.45 million watts
This thing could power a small city

Alexko · Jan 6, 2014

Exophase said:
[…]
Two final thoughts: I wonder if the Denver part is legitimately meant to replace the A15 part, or if the former is going to be targeting phones while the latter targets tablets or even beyond that. IF that's the case then it's possible nVidia will continue to license ARM cores for some parts, and this isn't just a time to market feasibility thing. Lastly, I noticed that nVidia had quite good documentation in its anti-Qualcomm propagada.. er.. technical white papers, where they went into a fair amount of details of how A15 operated. Now that they're using their own core I sure hope we get something even more thorough. Great take that to Qualcomm who says nary a thing about their tech. Fingers crossed."

I think you're right about this, at least if Fudo is basing the following assertions on JHH's statements and not just a hunch:

The second variant will be the A15-based 32-bit version. The 32-bit version is based around four A15 cores. The Denver variant turns out to be aimed at the supercomputing market and will ship in second part of 2014.

http://www.fudzilla.com/home/item/33565-project-denver-becomes-real

In any case, it stands to reason that NVIDIA would have trouble targeting supercomputer and high-performance machines (as they've hinted at before) and cellphones with the same design. Then again, without LTE support, you could argue that the A15-based Tegra K1 is more of a tablet design anyway.

McHuj · Jan 6, 2014

Wynix said:
The Denver version looks interesting.
I'm glad that they didn't go for a ridiculous octo-core chip.
Also this:
I refuse to believe this is a 28nm chip.

That slide is typical NVIDIA spin. The footnote says it's at peak operation, mobile chips hardly ever run at peak, under typical usage it will probably throttle the clock a lot. They're also comparing a full system to a single chip.

ams · Jan 6, 2014

Wynix said:
I refuse to believe this is a 28nm chip.

The Tegra K1 variant with quad core Cortex A15 has already been confirmed by NVIDIA to be using a 28nm fabrication process node (fab. process details on the variant with Denver are unknown, but since the GPU is unchanged and there is pin compatibility too, I would suspect 28nm process node for that too).

The variant with Denver is really the antithesis of what NVIDIA has promoted for the last two years! Maybe the battery saver core and quad-core didn't help in practice as much as they thought it would. Considering how much success Apple has had with improving dual relatively large cores, it may be a good decision moving forward. That said, they will surely have quad-core Denver variants in the future (perhaps starting with Parker?)

DSC · Jan 6, 2014

http://www.anandtech.com/show/7622/nvidia-tegra-k1

Tegra K1 ships with a newer revision of the Cortex A15 (r3p3) than what was in Tegra 4 (r2p1). ARM continuously updates its processor IP, with each revision bringing bug fixes and sometimes performance improvements. In the case of Tegra K1’s A15s, the main improvements here have to do with increasing power efficiency. With r3p0 (which r3p3 inherits) ARM added more fine grained clock gating, which should directly impact power efficiency.

The combination of the newer Cortex A15 revision and the move to 28nm HPM give Tegra K1 better performance at the same power consumption or lower power consumption at the same performance level. The reality tends to be that mobile OEMs will pursue max performance and not optimize for a good performance/power balance, but it’s at least possible to do better with Tegra K1 than with Tegra 4.

Exophase · Jan 6, 2014

Alexko said:
I think you're right about this, at least if Fudo is basing the following assertions on JHH's statements and not just a hunch:

http://www.fudzilla.com/home/item/33565-project-denver-becomes-real

In any case, it stands to reason that NVIDIA would have trouble targeting supercomputer and high-performance machines (as they've hinted at before) and cellphones with the same design. Then again, without LTE support, you could argue that the A15-based Tegra K1 is more of a tablet design anyway.

I think he's wrong. Pin-compatible usually means it can be a drop-in replacement, meaning it has all of the same external peripheral support. The only place Denver makes sense in HPC is as something placed alongside a large GPU. This chip doesn't work for that, a lot of the functionality makes no real sense. Including the embedded Kepler SMX.

Pin-compatible doesn't strictly mean it can't be 20nm, but that does seem like it'd be tricky.

Deleted member 13524 · Jan 6, 2014

I was expecting a ~500MHz GK107, but instead they put half a GK107 at ~1GHz.
It must be the highest-clocked iGPU in an ARM SoC ever.
Regardless, we're looking at around twice the GPU performance of Tegra 4, which may become really hard to attain for any other SoC vendor during 2014 (assuming they're all pushing 28nm to the limit already).

That and Tegra K1 undoubtedly passes the threshold for better performance than last-gen consoles. It marks the beginning of a new era

Exophase said:
I think he's wrong. Pin-compatible usually means it can be a drop-in replacement, meaning it has all of the same external peripheral support. The only place Denver makes sense in HPC is as something placed alongside a large GPU. This chip doesn't work for that, a lot of the functionality makes no real sense. Including the embedded Kepler SMX.

Pin-compatible doesn't strictly mean it can't be 20nm, but that does seem like it'd be tricky.

How about a server rack with 40 Denver Tegra K1?
That would be 14.6 SP TFLOPs for 200W (+ ~50W for memory, and rest?), plus using the 80 Denver cores for full precision.

Turbotab · Jan 6, 2014

Surprised that they managed to pack so much into the SoC using a 28nm process, with uARCH improvements and a move to 20nm, could we see over 700 Gflops from Tegra model year 2015. Adding TSV stacked memory and LPDDR4 would dramatically increase memory bandwidth. I'd say by 10nm, or 2017 ish Tegra could easily match next-gen console GPU wise, and destroy them in terms of CPU.

Now if the Mantle movement can inspire some form of low-level API for Android, they would be awesome. I wonder if G-sync will show up on a Shield or Tegra Note in the next couple of years.

DSC · Jan 6, 2014

Parker will be built on FinFET process according to the Nvidia Tegra roadmap, this implies TSMC 16nm process since 20nm is still planar. Can't wait for Maxwell GPU with unified virtual memory support inside Parker also.

Nebuchadnezzar · Jan 6, 2014

ASTC support is what stood out for me. I hope more vendors follow suit.

Exophase · Jan 6, 2014

ToTTenTranz said:
How about a server rack with 40 Denver Tegra K1?
That would be 14.6 SP TFLOPs for 200W (+ ~50W for memory, and rest?), plus using the 80 Denver cores for full precision.

Sounds like a terrible configuration. This SoC probably doesn't even have an external cache coherent interconnect, and giving each of these cores a separate memory pool is not efficient in several metrics. Not sure where FLOPs and GPU performance even enter the picture in most actual server work, as opposed to HPC. You'd also still be wasting a lot of other on-chip peripherals.

Server offerings have been about maximizing cores per die and per chip and tying it with a big unified memory controller. A credible Denver-based server chip would have at least 8 CPU cores and maybe a fair amount of L3 cache, with only a

Sure, on paper given the assumption of 5W power consumption for this idea sounds like a great HPC idea. But it makes no sense. How could this configuration possibly offer so much higher perf/W than a high end GK104 or GK110 based configuration? You're talking 3x better, by using a sea of smallish chips. That's not how things work, unless nVidia has done something horribly wrong with their big discrete GPUs. I really, really doubt that nVidia seriously means 5W with the GPU running anywhere close to full tilt.

xpea · Jan 6, 2014

k1 looks good. Lets see now the independent benches and what kind of design win they will get this round...

But I'm more impressed by the K1 VCM dedicated to the car market. And NV is part of the Open Automotive Alliance, with Google, Audi, GM, Honda, Hyundai :
http://www.openautoalliance.net/#about

They seem to be very very serious about this market. They claim 4.5M units already sold, with 20+ brands / 100+ models in the pipe.

NVIDIA Tegra Architecture

Laurent06

Exophase

Alexko

DSC

Wynix

liolio

Aquoiboniste

Exophase

OlegSH

Esrever

Alexko

McHuj

ams

DSC

Exophase

Deleted member 13524

Guest

Turbotab

DSC

Nebuchadnezzar

Exophase

xpea

Similar threads