NVIDIA Tegra Architecture

Erinyes · Aug 22, 2016

Ailuros said:
Why not ~1.2GHz for everything? 4*128SPs (8*64 sounds unlikely) for each Parker SoC GPU and 8*128 for each GP106. A loooooong time ago when I asked the answer was along "over 1GHz for everything" which if it wasn't just random BS and their target frequencies were at 1100+, reaching 1200 after tape out is quite feasable.

I thought of that but honestly, 4*128 SMs just seemed too much for an SoC (Then again TX1 surprised me with 16 ROPs). Either ways..the below link says 256SPs. How do you get 1.25 Tflops from that :S

itaru said:
https://www.computerbase.de/2016-08/nvidia-tegra-parker-denver-2-arm-pascal-16-nm/

So that's settled then, 2*128 SPs, same as Maxwell. Seems like the focus was more on automotive. And strangely enough, they went for wider 128 bit LPDDR4 bus.

Ailuros · Aug 22, 2016

How? Look:

(4* [4*SIMD16]?) = 256 SPs * 4 FP16 FLOPs * 1.22GHz = 1.249 TFLOPs FP16

Erinyes · Aug 22, 2016

Ailuros said:
How? Look:

(4* [4*SIMD16]?) = 256 SPs * 4 FP16 FLOPs * 1.22GHz = 1.249 TFLOPs FP16

Ahh well..if its FP16 then yeah. But we were discussing FP32 so far. And what about that 2.5 Tflops discrete GPU then?

Benetanegia · Aug 22, 2016

Wow. That's very disappointing. Low number of GPU units and low clocks? Now that they are completely out of the phone/tablet market and thus they could finally make a "huge" die with high TDP, they make the smallest GPU improvement they could posibly make? It makes little sense to me. Even if they wanted to keep it small, I don't get the superlow clocks compared to bigger Pascals. It all makes even less sense in light of them going for a 128-bit memory controller.

As a saving grace, what are the chances of it being 4x64 instead of 2x128 and would that have any impact on performance? My reasoning here is that they may be re-using GP100 SMs (minus FP64 units), because those already have 2xFP16, unlike consumer Pascals.

AlNom · Aug 22, 2016

Why didn't they roll with A72's? Or is that just a marketing thing.

ToTTenTranz said:
Then again, the first Denver cores were huge and probably so should be the new ones. It's possible they dedicated the larger die area mostly for the two additional LPDDR4 channels and the Denver cores.

But perhaps this is a discussion for the Tegra thread.

mm... indeed. They're replacing the A53's, so the trade-off in die space is probably skewed a fair bit (+ 4x L2 cache vs quad A53). Uncore stuff might add to the additional space too (HW VT?, car stuff, etc)
edit:

Anyways. :V

Ailuros · Aug 23, 2016

Erinyes said:
Ahh well..if its FP16 then yeah. But we were discussing FP32 so far. And what about that 2.5 Tflops discrete GPU then?

Consider them as possibilities not as facts; as for GP106 (?) 8* 128 * 2 * 1.22GHz = 2.499 TFLOPs (FP32 this time). Or it's "just" a GP108-whatever and they're quoting all figures in FP16? Sounds unlikely since the chips shown with Pascal cores would be huge for something as humble as a GP108.

Benetanegia said:
Wow. That's very disappointing. Low number of GPU units and low clocks? Now that they are completely out of the phone/tablet market and thus they could finally make a "huge" die with high TDP, they make the smallest GPU improvement they could posibly make? It makes little sense to me. Even if they wanted to keep it small, I don't get the superlow clocks compared to bigger Pascals. It all makes even less sense in light of them going for a 128-bit memory controller.

As a saving grace, what are the chances of it being 4x64 instead of 2x128 and would that have any impact on performance? My reasoning here is that they may be re-using GP100 SMs (minus FP64 units), because those already have 2xFP16, unlike consumer Pascals.

Given 16FF+ TSMC, perf/mW for the SoCs should be quite a bit higher than on the X1. Unless I'm reading their material wrong cores can be scaled for PX2 solutions from one to 4 cores. For a lower end solution a single Parker SoC with (assuming it is clocked at 1.2+GHz for the GPU) somewhere over 600 GFLOPs FP32 is quite a big figure. Solutions above that like 2 SoCs or 1 SoC + 1 dedicated GPU or 2 SoCs + 2 dedicated GPUs are nothing anyone can offer or even come remotely close to that kind of performance.

The Renesas RCar H3 has a peak value for its GPU at 230 GFLOPs/s if memory serves well and its mass production is slated for March 2018 under 16FF+.

CSI PC · Aug 23, 2016

Ailuros said:
Consider them as possibilities not as facts; as for GP106 (?) 8* 128 * 2 * 1.22GHz = 2.499 TFLOPs (FP32 this time). Or it's "just" a GP108-whatever and they're quoting all figures in FP16? Sounds unlikely since the chips shown with Pascal cores would be huge for something as humble as a GP108..

It seems unlikely to be GP106 due to the TFLOPS, the GP106 still hits the 3.8 TFLOPS at base clocks using just 61W, so sounds like it would be something like a GP107/GP108 to hit the figures quoted with the Drive PX 2.

Now how they are working out the performance is a dogs mess, made worst by an update computerbase.de received.
The Nvidia slides show 8 TFLOPS and 24 DL TOPS as a summary, but to achieve this it means that they have combined FP16 and FP32 TFLOPS, because 8 TFLOPS gives 32 DL TOPS (if this is the same form of INT8 DL found in Pascal Titan X - there it is called TOPS INT8 and 11 TFLOPS FP32 gives 44 TOPS INT8).
The Nvidia website shows Dual Tegra X2 providing 2.5 TFLOPS, and the dual discrete GPUs providing 5 TFLOPS.
So initially it seemed to me that to achieve around that 24 DL TOPS figure the discrete GPUs is 5 TFLOPS FP32 (gives 20 DL TOPS), and the Tegra 'X2's are 2.5 TFLOPS FP16 (gives 5 DL TOPS).

However the update from Computerbase.de says (allow for a crud Google translation): https://www.computerbase.de/2016-08/nvidia-tegra-parker-denver-2-arm-pascal-16-nm/

In question time after the presentation, Nvidia was indeed still very buttoned in terms of power, but gave the best that the graphics performance is expected to increase by at least 50 percent compared to the Tegra TX1 and 1.5 TFLOPS (FP16) is lying.Furthermore, Nvidia responded to the question of further fields of application for the new Tegra chip overlooking VR / AR with a short "Yes" .

And that is even more of a headscratcher now because the dual X1 has an FP16 figure of 2.3 TFLOPS, so the 'X2' cannot be 2.5 TFLOPS FP32 but Nvidia is saying it should be 50% faster than previous gen.
None of the figures add up apart from the summary ones and splitting the performance between FP32 for the GPUs and FP16 for the 'X2's, but this also then goes against the update.

TBH I cannot see how Nvidia can manage a 50% graphics performance increase using the same number of 256 Cuda cores against the Maxwell X1 and it also makes even more of a fudge for the numbers they provided with X2; with products already in the real world Pascal discrete GPUs manage their performance boost over Maxwell equivalent by having more cores (15% to 25%) and importantly a massive clock speed increase (50% higher clock speeds), neither of which is applicable here.

Cheers

AlNom · Aug 23, 2016

CSI PC said:
but Nvidia is saying it should be 50% faster than previous gen.

Maybe it's some extra fudge involving what they expect in real-world performance?

Benetanegia · Aug 23, 2016

Ailuros said:
Given 16FF+ TSMC, perf/mW for the SoCs should be quite a bit higher than on the X1.

Yeah and that's pivotal to the point that I was trying to make. I never heard that the X1's power was too high for its intended uses in the PX. But if desktop Pascals and specially Tom's Hardware's underclocking experiments with the 1060 are anything to go by, the GPU on Parker should easily come at less than 50% of the power, while still being slightly faster. Sure that's not a small feat, and maybe it makes a lot of business sense to Nvidia, but it looks like such a wasted oportunity and for me it's very disappointing.

And it's worse (to me) that this move comes at a time when they have supposedly abandoned the mobile market for good, even their own tablets/devices, so they are not competing anymore and I don't think margins in the Drive PX that adding an extra <4mm^2 SM would be a problem. Such a move would have been so much better on the TK1 or TX1 instead of stubornly trying to be the fastest by far and thus "skyrocketing" TDP by clocking them so close to desktop GPUs...

ninelven · Aug 23, 2016

TBH I cannot see how Nvidia can manage a 50% graphics performance increase using the same number of 256 Cuda cores against the Maxwell X1

Well, the biggest single chunk will come from the +100% bandwidth advantage. The 2 Denver2 CPUs will also help a fair amount I imagine. The rest is probably a bunch of small stuff (arch) and clocks. But I really don't see why they couldn't clock the GPU at 1.4 GHz (like they do in laptops...) for automotive applications or anywhere it is going to be plugged in...

And it's worse (to me) that this move comes at a time when they have supposedly abandoned the mobile market for good

Perhaps they haven't, or at least wanted to leave that door open. Generally, maintaining open doors is a poor strategy, but if anybody can afford to do it....

Benetanegia · Aug 23, 2016

ninelven said:
Perhaps they haven't, or at least wanted to leave that door open. Generally, maintaining open doors is a poor strategy, but if anybody can afford to do it....

But didn't they actually say they were leaving that market in an investor meeting not too long ago? That's what I read somewhere (ofc doesn't make it true if it's not true).

And about clocks that you mention, I forgot to add that in my previous post. Doesn't make any sense to me either. If we were reading the specs of a tablet, I would probably say "Finally", but on a platform like Drive PX?

CSI PC · Aug 23, 2016

ninelven said:
Well, the biggest single chunk will come from the +100% bandwidth advantage. The 2 Denver2 CPUs will also help a fair amount I imagine. The rest is probably a bunch of small stuff (arch) and clocks. But I really don't see why they couldn't clock the GPU at 1.4 GHz (like they do in laptops...) for automotive applications or anywhere it is going to be plugged in...

Perhaps they haven't, or at least wanted to leave that door open. Generally, maintaining open doors is a poor strategy, but if anybody can afford to do it....

Is there any information about clocks for ARM/Tegra SoCs?
I would thought this is the data needed rather than thinking about laptops as there is a lot of sub-functions and components integrated with the Tegra/Arm SoC.

If it was the bandwidth then I would expect it be reflected in the TFLOPS of the dual 'X2', but the Nvidia figures really do not match up to what they should unless it is 2.5 TFLOPS FP16 as dual Tegra X2.
At the moment I am taking everything said so far by Nvidia about the Tegra X2 with a pinch of salt, their figures really do not match with each other and do not match to their statement about 50% faster GPU performance over TX1.

Cheers

ninelven · Aug 23, 2016

Nvidia figures really do not match up to what they should unless it is 2.5 TFLOPS FP16 as dual Tegra X2.

That is almost certainly the case. Which gives clocks of 1.25 GHz for the GPU, or a 25% bump. So that is half of the performance claim right there. The doubling of bandwidth + better compression should easily be worth another 20% (minimally), which brings us to +45%. Misc small stuff could easily be worth another 5%. So I'm not sure why you don't think their figures line up?

CSI PC · Aug 23, 2016

ninelven said:
That is almost certainly the case. Which gives clocks of 1.25 GHz for the GPU, or a 25% bump. So that is half of the performance claim right there. The doubling of bandwidth + better compression should easily be worth another 20% (minimally), which brings us to +45%. Misc small stuff could easily be worth another 5%. So I'm not sure why you don't think their figures line up?

Do we know if the Tegra X1 was bandwidth limited for GPU work with the Drive PX setup and would this not affect the TFLOPs numbers?
How are you working out the clock speed when Drive PX has 2.3 TFLOPS and Drive PX2 2.5 TFLOPS, both being FP16.
The Drive PX seems to have a higher clock rate for the Tegra X1(1.15 TFLOPS FP16) than standard X1 (1 TFLOPS FP16).
This point is important because we can only use Drive X1 compared to Drive 'X2' (only information we have) when talking about the SoCs in comparison.
Thanks

ninelven · Aug 23, 2016

This point is important because we can only use Drive X1 compared to Drive 'X2' (only information we have) when talking about the SoCs in comparison.

Not really. The "official" GPU clock of the X1 was 1GHz. I would guess the "official" GPU clock of the X2 will be 1.25GHz. Certainly, Nvidia could go higher if they wanted to, but if there is no need/pressure, then obviously keeping clocks as low as possible makes the most financial sense.

CSI PC · Aug 23, 2016

ninelven said:
So I'm not sure why you don't think their figures line up?

Well several reasons.
1. Nvidia state in the slide 8 TFLOPS and 24 DL TOPS, this could cause confusion and could be seen as misleading because the 8 TFLOPS has to be a mix of FP32 and FP16, and they should not necessarily be combined as some will assume this is all FP32.
2. Nvidia state for the Drive platforms 'X2' has 2.5 Teraflops while the X1 has 2.3 Teraflops, however Nvidia responded to computerbase.de that confuses the situation even more stating it has 50% more GPU performance (which then cannot be TFLOPS and so this figure is now meaningless) and furthermore said it is wrong to suggest it has 1.5 Teraflops FP16 (no idea why Nvidia had to respond on this to them).

Just my take anyway on why this is pretty bad information so far.
Cheers

CSI PC · Aug 23, 2016

ninelven said:
Not really. The "official" GPU clock of the X1 was 1GHz. I would guess the "official" GPU clock of the X2 will be 1.25GHz. Certainly, Nvidia could go higher if they wanted to, but if there is no need/pressure, then obviously keeping clocks as low as possible makes the most financial sense.

You have to compare Drive PX to Drive PX2.
We do not have any data for Tegra 'X2' outside of this solution.
As I mentioned the official figure for Drive PX using Tegra X1 is 1.15GHz.
So it is likely the Drive PX2 also has higher clocks than normal for the 'X2'.

How can the older Drive PX officially manage 2.3 TFLOPS with 1GHz clocks?
And yeah I agree outside of Drive platform Tegra X1 is also officially 1 Teraflops with 1GHz clocks, but as I say both figures are official from Nvidia suggesting they actually boost Tegra SoC in the Drive platform.

Cheers

Edit:
Just to add it is worth remembering the Tegra X1 is Maxwell and this has strong BW compressions.
Just seems to me much of the improvement with Parker relates to CPU operations rather than GPU.

ninelven · Aug 23, 2016

Just my take anyway on why this is pretty bad information so far.

I think you are seeing what you want to see. I would guess that +50% GPU performance refers to real world metrics, in which case on a chip to chip basis X1 to X2 makes almost perfect sense.

You have to compare Drive PX to Drive PX2.

No, I really don't. And neither does Nvidia. If you desire to remain fixated on this, well that is your business...

CSI PC · Aug 23, 2016

ninelven said:
I think you are seeing what you want to see. I would guess that +50% GPU performance refers to real world metrics, in which case on a chip to chip basis X1 to X2 makes almost perfect sense.

No, I really don't. And neither does Nvidia. If you desire to remain fixated on this, well that is your business...

But you are comparing the higher clocked Drive platform to the standard clocked X1 and using that as a conclusion the 'X2' has 25% faster clocks.
It does not because the X1 Drive uses clocks of 1.15 GHz.
So that is only around an 8.5% bump for 'X2'.
I am not fixated on anything just pointing out why you need to compare platform to platform because Nvidia is using higher clocks in that solution.

Cheers

ninelven · Aug 23, 2016

But you are comparing the higher clocked Drive platform...

No I am not. I thought the allusion I made was pretty clear, but apparently I need to spell it out. The clocks used in PX2 may not be any higher than the reference X2. But I am done here, believe what you will...

NVIDIA Tegra Architecture

Erinyes

Ailuros

Epsilon plus three

Erinyes

Benetanegia

AlNom

Moderator

Ailuros

Epsilon plus three

CSI PC

AlNom

Moderator

Benetanegia

ninelven

PM

Benetanegia

CSI PC

ninelven

PM

CSI PC

ninelven

PM

CSI PC

CSI PC

ninelven

PM

CSI PC

ninelven

PM

Similar threads