NVIDIA Tegra Architecture

Ailuros · Oct 8, 2014

ams said:
Since the graph is just GFLOPS per watt normalized to Tegra 2, we cannot actually determine the true perf. or the true power consumption from this graph. That said, we can get an estimate of the GFLOPS per watt performance improvement (in the form of a ratio) of Erista vs. Logan vs. older generations of Tegra.

Exactly; however it could still be that Erista doesn't have as many raw GFLOPs after all and bitchslaps all the SoCs it'll succeed by varying degrees but exactly as the slide states.

xpea · Oct 8, 2014

Erinyes said:
Wouldn't two SMMs at lower clocks be more power efficient than one higher clocked SMM? They wouldn't really be die constrained as 20SoC would allow much higher density.

That aside, the performance would not be good enough IMHO..they would be targeting at least a 50% improvement from one generation to the next.

^^ This

If (and it's a big IF) I was Nvidia, I will go 2 SMM and decrease a bit GPU frequency (around 750MHz) to offer good perf increase (50%) and keep TDP lower than TK1.
2 SMM at 750MHz will still have no competition in GPU benchmarks through all 2015 year and beyond...

Erinyes · Oct 8, 2014

Ailuros said:
Unless Maxwell ULP has "only" 1 quad TMU per SMM, if they go for 2 SMMs the typical 16 TMUs aren't exactly small either

Well GX6650 has 12 TMUs dosen't it..so 16 TMU's for a 2015 part dosen't sound like overkill. Besides..if they're targeting high end, high res tablets, it wouldn't hurt.

And what is the area cost of TMU's these days? GM204 has 128 of them so cant be too much I would think. Compared to the TMU area on 28nm TK1, percentage of die size would probably remain the same on 20SoC anyway.

If NV keeps the same strategy as with K1 yes they probably used 2 clusters; if however they're aiming for a rather conservative overall performance increase but a quite high perf/W difference 1 cluster could also be a viable scenario.

I disagree. If they are aiming for higher per/W..a 2 cluster makes more sense as two lower clocked clusters could consume less power than one higher clocked one.

mczak said:
FWIW I don't share the views of some here it will be a >2x perf/w improvement based on some extrapolated desktop numbers, not least because I suspect not all power savings came from architecture changes but rather some lower level tweaks. Which most likely gk20a already all has anyway (even gk208 was quite a bit more power efficient already).
But it's going to be difficult to get much of a performance increase at all with just 1 SMM (at least for the larger form factors).

True..and the graph shows increase of "only" ~1.6x anyway. I think the increased L2 cache on GK208 played a part and which Maxwell has built upon. Erista would probably go to 2MB L2 as well.

Ailuros said:
GM2x0 cores have 40% higher ALU efficiency than Kepler. All slides for GM107 claimed a 35% improvement. Yes it's hairsplitting but I think it's likelier they stay at first Maxwell generation level.

Why? It taped out months after the first GM2x0 part..why would they go with first gen?

That hopefully yes; and that obviously will benefit the GPU (amongst all other SoC units) irrelevant of the amount of clusters it'll have after all. Again the 2SMMs are very likely, however I'm not sure if they have something completely different in mind this time.

Yep..LPDDR4 memory and 2 MB L2 seem likely.

xpea said:
^^ This

If (and it's a big IF) I was Nvidia, I will go 2 SMM and decrease a bit GPU frequency (around 750MHz) to offer good perf increase (50%) and keep TDP lower than TK1.
2 SMM at 750MHz will still have no competition in GPU benchmarks through all 2015 year and beyond...

Exactly..seems more plausible than ~1 ghz clocks on 20SoC anyway.

Ailuros · Oct 8, 2014

Erinyes said:
Well GX6650 has 12 TMUs dosen't it..so 16 TMU's for a 2015 part dosen't sound like overkill. Besides..if they're targeting high end, high res tablets, it wouldn't hurt.

Good point; but unless Apple has truly an A8X for the rumored >12" tablet planned I don't see it in consumer products all that soon.

And what is the area cost of TMU's these days? GM204 has 128 of them so cant be too much I would think. Compared to the TMU area on 28nm TK1, percentage of die size would probably remain the same on 20SoC anyway.

Look at the Apple A8 die shot; there's one where they've marked the shared quad TMUs between two clusters at a time. It's my understanding that those are 16*16k TMUs with the difference that Apple's drivers support only max 4k*4k textures. In any case given that the A8 GPU is being estimated to be 19.1mm2@20SoC by Anandtech it shouldn't be too hard to extrapolate from the die shot how much approximately 2 quad TMUs cost roughly in hw. They're definitely not small.

I disagree. If they are aiming for higher per/W..a 2 cluster makes more sense as two lower clocked clusters could consume less power than one higher clocked one.

Agreed and good point.

Why? It taped out months after the first GM2x0 part..why would they go with first gen?
Yep..LPDDR4 memory and 2 MB L2 seem likely.

Do they really need >DX11.0 already for the ULP SoC space? I could of course the possible funky "oh look it's DX12" marketing trip, but that doesn't come for free in hw either. I was estimating through dumb layman's math the GM204 to be around 370mm2; what possibly threw me off was the added >DX11.0 feature budget.

Exactly..seems more plausible than ~1 ghz clocks on 20SoC anyway.

Whereby 750MHz sounds already aggressive for consumer products, but yes two clusters at a way more reasonable peak frequency than for GK20A makes sense.

Blazkowicz · Oct 8, 2014

Ailuros said:
Do they really need >DX11.0 already for the ULP SoC space? I could of course the possible funky "oh look it's DX12" marketing trip, but that doesn't come for free in hw either. I was estimating through dumb layman's math the GM204 to be around 370mm2; what possibly threw me off was the added >DX11.0 feature budget.

Kepler Tegra already supports DX12 whether we care about it or not, of course with a lowish feature level if that's what you refer to (and that needs DX12 included in Windows 10 ARM)

To compare with desktop :
Perhaps low end desktop GPU always have had too many features, but they're copy-pasted from the higher end ones and the general public appreciated cards compatible with everything even if slow (now Intel GPUs can run everything)

Khronos announced "OpenGL Next", which will replace both OpenGL and OpenGL ES. The existence of a separate API was an anomaly, even if practical. OpenGL Next is also said to be a clean break (get rid of all the cruft from the 90s) and will be a DX12 competitor so to speak.

Ailuros · Oct 9, 2014

Blazkowicz said:
Kepler Tegra already supports DX12 whether we care about it or not, of course with a lowish feature level if that's what you refer to (and that needs DX12 included in Windows 10 ARM)

Yep; GM20x goes beyond DX11.0 feature level.

To compare with desktop :
Perhaps low end desktop GPU always have had too many features, but they're copy-pasted from the higher end ones and the general public appreciated cards compatible with everything even if slow (now Intel GPUs can run everything).

Khronos announced "OpenGL Next", which will replace both OpenGL and OpenGL ES. The existence of a separate API was an anomaly, even if practical. OpenGL Next is also said to be a clean break (get rid of all the cruft from the 90s) and will be a DX12 competitor so to speak.

The question is whether GL Next will require as many additional functionalities as DX12 in the end.

rpg.314 · Oct 9, 2014

Ailuros said:
The question is whether GL Next will require as many additional functionalities as DX12 in the end.

I would hope so.

Ailuros · Oct 9, 2014

rpg.314 said:
I would hope so.

OK and when is GL_Next estimated to get finalized? A simple estimate would do.

Erinyes · Oct 9, 2014

Ailuros said:
Good point; but unless Apple has truly an A8X for the rumored >12" tablet planned I don't see it in consumer products all that soon.

Well we will find out soon enough..there is an iPad event scheduled for the 16th of October apparently. Quad core CPU + GX6650 anyone?

You also have to remember that Erista will be around for some time. Depending on the 16/14nm ramp goes, it may well have to hold down the fort till Q1 or Q2'16 so higher performance wouldn't be a bad thing. And it gives them flexibility to play around with power/clocks.

Look at the Apple A8 die shot; there's one where they've marked the shared quad TMUs between two clusters at a time. It's my understanding that those are 16*16k TMUs with the difference that Apple's drivers support only max 4k*4k textures. In any case given that the A8 GPU is being estimated to be 19.1mm2@20SoC by Anandtech it shouldn't be too hard to extrapolate from the die shot how much approximately 2 quad TMUs cost roughly in hw. They're definitely not small.

I dont have photoshop so did a very rough analysis. At max, each quad TMU seems to be about 1/10th of the size of the GPU, i.e. ~1.9 mm2. So at max, two quad TMU's would be ~3.8 mm2. I agree its not chump change, but in a ~100 mm2 SoC, its not too much either wouldn't you say?

Do they really need >DX11.0 already for the ULP SoC space? I could of course the possible funky "oh look it's DX12" marketing trip, but that doesn't come for free in hw either. I was estimating through dumb layman's math the GM204 to be around 370mm2; what possibly threw me off was the added >DX11.0 feature budget.

Well, as I have alluded to earlier, marketing is a big factor. Even if we take ~10% as the extra die area required for DX12 support, and taking 30mm2 as the area of the GPU block (both of which are high estimations IMHO), the extra die area required is just ~3 mm2. Secondly, the improved efficiency of 2nd gen Maxwell could alone have been enough of a selling point. Another point to consider that if it does indeed have DX12 support, it could be a good launch vehicle for Windows 10 based tablets (or phones possibly?). And as I've already stated above, Erista could be around for a while and towards the middle or end of its life cycle other IHV's could release a SoC with a DX12 compliant GPU.

All this is my own personal speculation mind you..gotta love being an armchair CEO right?

Ailuros · Oct 9, 2014

Erinyes said:
Well we will find out soon enough..there is an iPad event scheduled for the 16th of October apparently. Quad core CPU + GX6650 anyone?

I'd rather bet for dual core at a rounded up frequency.

You also have to remember that Erista will be around for some time. Depending on the 16/14nm ramp goes, it may well have to hold down the fort till Q1 or Q2'16 so higher performance wouldn't be a bad thing. And it gives them flexibility to play around with power/clocks.

Unless I'm missing something it doesn't sound like much more than a year from SoC to SoC either.

I dont have photoshop so did a very rough analysis. At max, each quad TMU seems to be about 1/10th of the size of the GPU, i.e. ~1.9 mm2. So at max, two quad TMU's would be ~3.8 mm2. I agree its not chump change, but in a ~100 mm2 SoC, its not too much either wouldn't you say?

I'm aweful with die shots, but luckily the boys at Anandtech marked the TMU blocks so I estimated roughly around 2mm2 for each quad TMU. I don't know how big they're planning for Erista. Historically after Tegra2 they almost religiously stayed at around 80mm2 and K1 broke every of their SoC area record so far. Back to "base" or it doesn't matter simply?

Besides 3.5-4mm2 aren't exactly small for the ULP SoC world; given the A8 die area and possible transistor density those 2 more quad TMUs could be well over 80Mio transistors.

Well, as I have alluded to earlier, marketing is a big factor. Even if we take ~10% as the extra die area required for DX12 support, and taking 30mm2 as the area of the GPU block (both of which are high estimations IMHO), the extra die area required is just ~3 mm2. Secondly, the improved efficiency of 2nd gen Maxwell could alone have been enough of a selling point. Another point to consider that if it does indeed have DX12 support, it could be a good launch vehicle for Windows 10 based tablets (or phones possibly?). And as I've already stated above, Erista could be around for a while and towards the middle or end of its life cycle other IHV's could release a SoC with a DX12 compliant GPU.

10% here and 10% there and before you know it you're way over budget in the end. As for DX12 in smartphones it'll take eons before you see anything even OGL_ES3.x related, let alone anything close to DX11. Tablets are in a better power realm but still also far from ending up at anything else but functionalities being mostly decorational.

I've nothing against progress, au contraire but under such extreme power constraints I'd rather have transistors invested in higher efficiency. Besides Microsoft's penetration in the ULP market is still extremely small.

All this is my own personal speculation mind you..gotta love being an armchair CEO right?

Do you see many around here that aren't speculating with these things?

Turbotab · Oct 9, 2014

Erinyes said:
Well we will find out soon enough..there is an iPad event scheduled for the 16th of October apparently. Quad core CPU + GX6650 anyone?

Possibly, but with 1GB of RAM

Ailuros · Oct 9, 2014

Turbotab said:
Possibly, but with 1GB of RAM

We're way OT but nope. If such a thing is even planned for now, native resolution isn't likely to end up below 3072*2304.

tangey · Oct 9, 2014

It's possible that the A8 will remain, and with the larger form factor, the CPU will be clocked up a little, the GPU clocked up more.

Is there consensus as to a guesstimate of the A8 GPU clock in the iphone ? if around 450, and the A8 has been designed for 600Mhz in a larger form factor, that provides a theoretical 33% improvement, assuming there aren't major bandwidth limitations.

rpg.314 · Oct 10, 2014

Ailuros said:
OK and when is GL_Next estimated to get finalized? A simple estimate would do.

My hope would be next year.

Ailuros · Oct 10, 2014

rpg.314 said:
My hope would be next year.

I don't think there's anyone that doesn't share that hope :smile: If it should be late 15' though or later it won't help Tegra marketing much if the Erista GPU should exceed feature level DX11.0.

Blazkowicz · Oct 10, 2014

Ailuros said:
10% here and 10% there and before you know it you're way over budget in the end. As for DX12 in smartphones it'll take eons before you see anything even OGL_ES3.x related, let alone anything close to DX11. Tablets are in a better power realm but still also far from ending up at anything else but functionalities being mostly decorational.

That's very true.
Favorable case would be middleware (e.g. Unreal Engine, Unity, etc.) easily allows you to have a DX12 path, OpenGL 4.5 path, OpenGL Next etc. even if the assets and engine otherwise target old OpenGL ES.
If that allows reduced CPU usage that's mildly good and useful.
Ideally the app store sends you a package that only includes the code path you need, can Android, Windows RT/Phone do that? would be useful for ARMv7 vs ARMv8 vs x86 too.

Ailuros · Oct 10, 2014

I'd actually love to know how many Watts K1 needs for very demanding engines like the latest Unreal one or something as complicated as DX11 tessellation.

Measuring power or battery lifetime in comparably boring stuff as in T-Rex Kishonti with mostly alpha tests@overload is fine and dandy, but when you market N product with X capabilities I'd also like to know how things look like at full tilt.

Ailuros · Oct 10, 2014

http://www.brightsideofnews.com/2014/10/09/htc-nexus-9-launches-399-oct-15th-available-nov-3rd/

5 days and counting....*yayyyyyyyy*

Erinyes · Oct 14, 2014

Ailuros said:
I'd rather bet for dual core at a rounded up frequency.

Probably..I was just joking. Though there are some cases where a quad core would be useful, especially in the rumoured "Pro" version of the iPad.

Unless I'm missing something it doesn't sound like much more than a year from SoC to SoC either.

Well if we take Q1'15 and Q2'16 as the probable release dates, it would be 5 quarters, which is a bit more that the gap between the previous tegra chips afaik.

Though another thing is I see Erista sticking around for quite a while even after its 16nm succesor is released. 16nm capacity will be expensive and availability would be limited initially. There is no significant increase in density from 20nm to 16nm so the configurations would largely be similar. There would be some gains in power/and or performance, but at considerably higher production costs. So Erista may suffice for a lot of customers as it would be priced significantly lower and would not trail behind in performance all that much. So it could have a reasonably long life cycle. So if its a part which is intended to be sold through 2016, or even later..16 TMU's and DX12 compatibility dont seem to be overkill. (Its a similar situation to how 28nm will remain the dominant process node in 2015 for anything mid-range or lower)

I'm aweful with die shots, but luckily the boys at Anandtech marked the TMU blocks so I estimated roughly around 2mm2 for each quad TMU. I don't know how big they're planning for Erista. Historically after Tegra2 they almost religiously stayed at around 80mm2 and K1 broke every of their SoC area record so far. Back to "base" or it doesn't matter simply?

Besides 3.5-4mm2 aren't exactly small for the ULP SoC world; given the A8 die area and possible transistor density those 2 more quad TMUs could be well over 80Mio transistors.

Well I got 1.9mm2 so we aren't too far off. Also note that Apple has traditionally been more concerned with power and has traded area for lower power consumption. Nvidia should have higher density, and the actual area for them should be even less.. And see my post above for reasons why the extra TMU's may be justified.

10% here and 10% there and before you know it you're way over budget in the end. As for DX12 in smartphones it'll take eons before you see anything even OGL_ES3.x related, let alone anything close to DX11. Tablets are in a better power realm but still also far from ending up at anything else but functionalities being mostly decorational.

I've nothing against progress, au contraire but under such extreme power constraints I'd rather have transistors invested in higher efficiency. Besides Microsoft's penetration in the ULP market is still extremely small.

Agreed..but in this case it would be ~5-6 mm2 in total so its not all that high in the context of a ~100 mm2 SoC. And again I'll point you to my post above as to the reasons why I think it wont be completely useless. Given the timeframe, the features could very well be utilised towards the later part of its lifecycle.

ams · Oct 14, 2014

Erinyes said:
Well if we take Q1'15 and Q2'16 as the probable release dates, it would be 5 quarters, which is a bit more that the gap between the previous tegra chips afaik.

If first silicon for Tegra M1 Erista was available in July 2014, then that would be ~ 1 year after first silicon for Tegra K1 Logan was available. So I would realistically expect to see Tegra M1 Erista in devices by Q2 2015. And if that first Tegra M1 variant has Cortex A57 + A53 cores, then I would expect to see a second Tegra M1 variant with Denver cores in devices by Q4 2015. And then the cycle would repeat in 2016 with Tegra M2 [Maxwell GPU], and in 2017 with Tegra P1 [Pascal GPU].

NVIDIA Tegra Architecture

Ailuros

Epsilon plus three

xpea

Erinyes

Ailuros

Epsilon plus three

Blazkowicz

Ailuros

Epsilon plus three

rpg.314

Ailuros

Epsilon plus three

Erinyes

Ailuros

Epsilon plus three

Turbotab

Ailuros

Epsilon plus three

tangey

rpg.314

Ailuros

Epsilon plus three

Blazkowicz

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Erinyes

ams

Similar threads