NVIDIA Tegra Architecture

I love how nVidia just finished trying to convince everyone that Tegra 4 being OES2.0 based is a big win due to being more die area and power efficient compared to an OES3.0 GPU. And now a year later with Logan they are saying the exact opposite by jumping over OES3.0 by adding full desktop OGL4.3 support with all its associated cruft and promoting it as a great design decision for mobile.

Well for the additional features I'd be the first that would want to stand up and scream "finally". Now where's the catch exactly, before I even bother and make a fool out of myself?
 
Well for the additional features I'd be the first that would want to stand up and scream "finally". Now where's the catch exactly, before I even bother and make a fool out of myself?
Well for one thing, nVidia is saying that when Tegra 4 ships in Q2 2013, FP20 provides sufficient image quality for mobile usage and it allows them significant savings in die area and power. Their white paper promotes architectural efficiency as focused on using transistors and power for performance rather than raw feature support and goes on to try to promote Tegra 4 in this regard against competing mobile GPUs. Is it really likely then that in 9 months with Logan in Q1 2014, mandatory FP64 as part of OpenGL 4.x can be made as efficient in power and transistor cost as FP20 such that it's essentially free or that developers and users are clambering for it such that they have to offer it? Certainly I like seeing more features as well and I'd happily take a reasonable increase in area and power consumption for more functionality. Perhaps things like FP32 and FP64 aren't actually that expensive, but I just find it interesting that their marketing is currently portraying that it is and emphasizing a lean GPU design and they can so effectively switch gears in a few months. I guess it's all part of selling what you have.
 
Last edited by a moderator:
Ailuros said:
Well thank you for admitting yourself in the end (even unwillingly) that the scale is based on SoC level performance.
I wasn't aware I ever stated otherwise?

As for the "unwilling" part... you are just seeing stuff that isn't there man.
 
I love how nVidia just finished trying to convince everyone that Tegra 4 being OES2.0 based is a big win due to being more die area and power efficient compared to an OES3.0 GPU. And now a year later with Logan they are saying the exact opposite by jumping over OES3.0 by adding full desktop OGL4.3 support with all its associated cruft and promoting it as a great design decision for mobile.

That is not really what they said. What NVIDIA said is that, for this generation with Tegra 4, it made more sense for them to stick with a non-unified shader architecture and use FP20 pixel shader precision in order to keep power consumption (and die size and cost) as low as possible. But NVIDIA did indicate that Tegra 4 supports many of the key features of OpenGL ES 3.0, and they did indicate at CES 2013 that the time was near to have CUDA support on Tegra. With respect to Tegra 5 (Logan), since the GPU is a Kepler derivative, it is only natural that it would support OpenGL 4.3 and CUDA. Some developers must have complained about lack of full support for anything beyond OpenGL ES 2.0 in Tegra 4, so now NVIDIA is providing to them the Kayla platform to support the same featureset as what will be seen in their next gen Tegra GPU.
 
Can in your opinion a Tegra4 max out its real performance today?

Sure. It'll easily be able to utilize a single core at 1.9GHz for short periods of time while launching apps, loading web pages, etc, where there's enough work to be done that it's worth taking less perceived time. Only in some specialized scenarios like bulk compression will it make sense for it to use all of its four cores at whatever peak clock is allowed, but they could exist. It will probably be able to max the GPU in some games if allowed to render to a really high resolution, particularly if it has optional features that can increase GPU load, or has them turned on for Tegra 4s.

This of course doesn't mean the pathological case of running the CPUs and GPU both at full capacity simultaneously but this applies to anything on the market (and why I think it's disingenuous when review sites start doing this)

That roadmap isn't new they just re-adjusted it slightly recently. I was always breaking my head in the past years WTF T4 is placed so close to T3, while the distance between T3 and T2 is bigger. If you level the placement in that diagram (yes I know marketing yadda yadda...) to quite simple tasks with a very specific perf/W ratio then of course power consumption between T4 and T3 for those use cases is not going to change significantly but so won't performance despite the T4 carrying a by several times faster GPU than the T3.

Not really sure what you're saying, CPU performance won't change significantly for Tegra 4 vs Tegra 3? Of course that isn't true. Maybe perf/W doesn't improve a lot, but I wanted to get across that you needed some significant mixture of improving perf/W and perf. Not necessarily a strong amount of both, it's enough to increase peak perf a lot while not decreasing perf/W by much, if you have the (short term) power budget to utilize it.

I haven't digged too much into A5x to be honest, but it was my impression so far that besides 64bit one of the important changes were much higher perf/W.

So you think Cortex-A57 substantially increases peak perf and perf/W on the same node while not using so much more area that it becomes impractical to put four of them on an SoC? That doesn't sound realistic.

Of course perf/W isn't an absolute comparison, you can have a different curve shape and therefore be better at some parts and worse at others. But from what we know of A9 vs A15, I'm going to say that ARM by and large A15 has notably worse perf/W than A9 given a similar node and implementation. nVidia says its power optimized 845MHz core uses 40% less power than a 1.6GHz performance optimized A9 in Tegra 3 (although that A9 perhaps had the benefit of better dynamic power consumption vs poorer static). They say they achieve the same performance but I expect that in most cases this won't actually be true. 40% is about what you'd get from a shrink. This huge difference in clock speed represents what would probably be a best case for a perf/W comparison.

Point is, ARM clearly sacrificed power efficiency at the expense of peak perf, and they had no intention of making A15 supersede A9 but offer another point optimized for different devices/usage scenarios.

A57 will likely only get more complex/aggressive (although I could see it more outright replacing A15). It is possible that A15 had some glaring problems or poor balance issues power-wise and that A57 fixed them, you would expect at least some level of optimization and ARM is still pretty new at some of these wider/heavier CPU structures. A7 probably achieves substantially better perf/W than A8 (at the same node) while offering competitive peak perf. So improvements do happen, but you can't really make any assumptions about them and ARM hasn't really said anything about big perf/W breakthroughs for A57.

Cortex-A53 isn't really on the table atm, I doubt nVidia is even interested if they didn't go for A7, but who knows - maybe they were too far into Tegra 4's design by the time it would have even been up for consideration.

I'm confident that the result might resemble a lot from the outside to a reduced Kepler cluster; in reality it won't be anything else IMHO then a SoC GPU block fine tuned for SFF markets with all the lessons they learned with Kepler included.

Perhaps, but I'm mainly referring to DX11 features (or OpenGL 4.3 at least, which they did specify) and unified precision in shaders, which you yourself have often said come at a big area cost.

Trick question then: is it likelier that first desktop Maxwell chips will arrive on 28 or on 20HP and why?

Couldn't guess because I have no idea when those are supposed to be released.

Well that's one of those typical marketing oxymorons you hit on all of those kind of slides whereever they come from. Someone in another forum tried even a funkier explanation and claimed the scale is for GFLOPs with the Parker GPU ending up at a glorified 1 TFLOP. The unfortunate thing is that its nonsense to compare FP20 with FP32 ALUs as the primary point and the next best being that the ULP GF in T2 delivered just 5.33 GFLOPs, so no that scale won't work just according to some folks convenience. Despite it being a marketing slide there is a reasoning behind it however twisted it might be due to its marketing nature.

Frankly I barely pay attention to nVidia's marketing, it's pretty consistently ridiculous :p
 
How should that be possible.
According to heise-news they plan to announce Logan already 2013 and have it in phones or tablets in 2014.
Link (german) http://www.heise.de/newsticker/meldung/GTC-2013-Erste-Details-zu-Tegra-5-und-Tegra-6-1826134.html
Oh, and if this is true then Tegra4 seems to be death in the water . .

[edit] Link: http://blog.gsmarena.com/nvidia-reveals-their-tegra-roadmap/#more-46498

So only 6-8 months between Tegra4 and Logan. Tegra4 really seems death before arrival.

damn...firefox is really bad for writing comments. The format is completely wrong.[edit again]: javascript helps :)

I will both buy and eat my hat if Logan shows up in a device in Q1 2014. Having a blip on a roadmap at 2014 doesn't mean "shortly after 2014 starts." If that's the case Tegra 4 is already looking late.

Now as for announcing it in 2013, sure, they can announce it whenever they want (you could say they already have), it doesn't mean anything about when devices will hit.

And even if nVidia does give their expectations for a shipping product using Logan you can expect them to be hopelessly optimistic, like their claims that there'd be Tegra 2 tablets in spring 2010 and Tegra 3 devices in summer 2011.
 
I will both buy and eat my hat if Logan shows up in a device in Q1 2014.

Yesterday at GTC 2013, NVIDIA said that details on Logan will be announced later this year, and will definitely be in production in early 2014. As for when Logan will show up in a commercial device, who knows, but Q1-Q2 2014 is the likely date. Tegra 4 (Wayne) went into production about three months later than expected, so this general timeframe for Tegra 5 is not out of the question.
 
I will both buy and eat my hat if Logan shows up in a device in Q1 2014
LOL. Careful... While I think you are probably right, it isn't too unlikely that they could have a Shield update or a chromebook out by the end of March next year.
 
Yesterday at GTC 2013, NVIDIA said that details on Logan will be announced later this year, and will definitely be in production in early 2014. As for when Logan will show up in a commercial device, who knows, but Q1-Q2 2014 is the likely date. Tegra 4 (Wayne) went into production about three months later than expected, so this general timeframe for Tegra 5 is not out of the question.

So let me get this straight: Tegra 4 went into production three months later than expected but Tegra 5 will definitely enter production exactly when nVidia says it will? Never mind that a Tegra product has never come out as early as nVidia said it will (okay, no idea about Tegra 1 but that showed up in so little that I have to wonder)

Of course, you can probably distort "early 2014" to mean a variety of things.

If nVidia is really pushing for a Tegra 4 to show up in devices ~8 months after Tegra 3 and on the same process then I have to really question their decisions, or wonder if they desperately feel that something on Tegra 4 needs fixing. It could be pressure for OpenGL ES 3 and OpenCL but I doubt it.

ams said:
Did you notice that Stark has disappeared and is now replaced by a slightly higher performance variant named Parker? I wonder what could have changed in the design of Tegra 6 to warrant a name change.

Maybe nVidia marketing decided Iron Man went on one too many drunken binges and Spiderman is a better role model ;p
 
Not really sure what you're saying, CPU performance won't change significantly for Tegra 4 vs Tegra 3? Of course that isn't true. Maybe perf/W doesn't improve a lot, but I wanted to get across that you needed some significant mixture of improving perf/W and perf. Not necessarily a strong amount of both, it's enough to increase peak perf a lot while not decreasing perf/W by much, if you have the (short term) power budget to utilize it.

Not that's not what I meant. I'm merely trying to find the real reasoning behind the placement in that slide that's all.

So you think Cortex-A57 substantially increases peak perf and perf/W on the same node while not using so much more area that it becomes impractical to put four of them on an SoC? That doesn't sound realistic.

From a marketing standpoint it's obviously not realistic.

Of course perf/W isn't an absolute comparison, you can have a different curve shape and therefore be better at some parts and worse at others. But from what we know of A9 vs A15, I'm going to say that ARM by and large A15 has notably worse perf/W than A9 given a similar node and implementation. nVidia says its power optimized 845MHz core uses 40% less power than a 1.6GHz performance optimized A9 in Tegra 3 (although that A9 perhaps had the benefit of better dynamic power consumption vs poorer static). They say they achieve the same performance but I expect that in most cases this won't actually be true. 40% is about what you'd get from a shrink. This huge difference in clock speed represents what would probably be a best case for a perf/W comparison.

I've no clue how A50s look like. I frankly didn't pay that much attention either, but for some reason I'm left with the impression that they don't increase that much performance compared to A15, but rather power consumption.


Perhaps, but I'm mainly referring to DX11 features (or OpenGL 4.3 at least, which they did specify) and unified precision in shaders, which you yourself have often said come at a big area cost.

Simple example: while DX11 TMUs are times more expensive like today's DX9L1 TMUs in up to T4, would you leave all 16 TMUs ie 4 quad TMUs in a SMX you'd be yielding for tablets and smartphones? Once you start making a list of what you can scratch and/or reduce to come to a reasonable die area and power consumption budget the question remains how much Kepler is then left after that.

Couldn't guess because I have no idea when those are supposed to be released.

Rumor has it that NV most likely will manufacture the first Maxwell iterations still on 28nm and then move gradually to 20nm. I wouldn't expect AMD to do anything difference since they've claimed to that they plan to remain at 28nm for quite some time to come.

Frankly I barely pay attention to nVidia's marketing, it's pretty consistently ridiculous :p

If you remove "NVIDIA's" from that sentence it's still valid you know ;)
 
For the Parker SoC, I wonder if there will be a companion SoC(like Grey) or a reference design(like Phoenix) called Watson(see who can guess this reference but it's really easy anyway).
 
http://www.pcgameshardware.de/GTC-E...bislang-geheimer-Kepler-GPU-D15M2-20-1061349/

2 SMX, 384 CUDA cores. Also note it's Compute Capability 3.5, same as GK110, not 3.0 like GK104/106/107.

BqV9gCh.jpg


Z74Amnx.jpg
 
Last edited by a moderator:
"Capability approaching Logan SoC"

They are pretty careful with their wording there. They're not saying that performance of Logan will be anywhere close.
 
"Capability approaching Logan SoC"

They are pretty careful with their wording there. They're not saying that performance of Logan will be anywhere close.

They're also pointing out that Logan will be more power efficient, albeit I'd word it as more power conscious in the given case but that's hair splitting. Logan obviously won't have a cooling fan and a heatsink as Kayla ;)
 
From a marketing standpoint it's obviously not realistic.

Not sure what you mean. I don't think it's that realistic from an engineering standpoint, unless A15 sucked.

I've no clue how A50s look like. I frankly didn't pay that much attention either, but for some reason I'm left with the impression that they don't increase that much performance compared to A15, but rather power consumption.

ARM hasn't said very much at all about A57 except that it's ARMv8/64-bit and that it's pipeline is similar to A15's. From the information here http://www.arm.com/products/processors/cortex-a50/cortex-a57-processor.php it looks like most of it is the same as A15 (same number of instructions in flight, execution slots, fetch and decode rates). What I can gather is that they doubled the size of the L2 TLB, increased the L1 ITLB size by 1.5x, and added large page support to the L1 DTLBs (there are separate load and store ones). I'm not actually sure how large the BTB is on A15 but it looks like they made it larger here too, and removed the 1-cycle taken branch penalty that probably didn't generally hurt you. It seems like they also added an L1D prefetcher (Cortex-A15 may only have L2 prefetchers, it's hard for me to tell)

Anyway, I can't find anything ARM has said about power consumption, just performance, so I get the opposite impression you do..

Simple example: while DX11 TMUs are times more expensive like today's DX9L1 TMUs in up to T4, would you leave all 16 TMUs ie 4 quad TMUs in a SMX you'd be yielding for tablets and smartphones? Once you start making a list of what you can scratch and/or reduce to come to a reasonable die area and power consumption budget the question remains how much Kepler is then left after that.

I'm not going to argue that these are going to be Kepler SMX verbatim, I wouldn't expect that either. But that doesn't really matter, I'm just saying that they're going to need substantially more transistors for the next gen GPU just to maintain the same performance levels, never mind to increase it dramatically (especially without increasing power consumption a lot too). They're going to need at least some more for the CPU as well, and I don't expect nVidia to want to make a chip that's much larger than Tegra 4. So I really think they need the shrink to make this practical.

Rumor has it that NV most likely will manufacture the first Maxwell iterations still on 28nm and then move gradually to 20nm. I wouldn't expect AMD to do anything difference since they've claimed to that they plan to remain at 28nm for quite some time to come.

Okay, so what's the expected timeframe for Maxwell? I don't really follow GPU pre-release speculation, I just know that a new generation comes out roughly yearly.. So is Maxwell supposed to be out Q2 2014 or what?

AMD may well have to stick with 28nm for as long as they can because AMD doesn't have a lot of money.

If you remove "NVIDIA's" from that sentence it's still valid you know ;)

True to an extent, but I really do find nVidia's to be worse than average.
 
.................. A57 except that it's ARMv8/64-bit .............. performance, so I get the opposite impression you do..

I just sent you a PM about it. I'll work on it a bit and come back to it tomorrow.

I'm not going to argue that these are going to be Kepler SMX verbatim, I wouldn't expect that either. But that doesn't really matter, I'm just saying that they're going to need substantially more transistors for the next gen GPU just to maintain the same performance levels, never mind to increase it dramatically (especially without increasing power consumption a lot too). They're going to need at least some more for the CPU as well, and I don't expect nVidia to want to make a chip that's much larger than Tegra 4. So I really think they need the shrink to make this practical.

I'd personally prefer ~100GFLOPs worth GPU performance from a Kepler alike SFF GPU than what the ULP GF delivers at the time with a comparable GFLOP theoretical peak.

Okay, so what's the expected timeframe for Maxwell? I don't really follow GPU pre-release speculation, I just know that a new generation comes out roughly yearly.. So is Maxwell supposed to be out Q2 2014 or what?

It could be Q2 14' but definitely nothing from the top dog in sight for quite some time after that for the similar reasons as today. I wouldn't be in the least surprised if they kickstart the Maxwell family release with the performance SKU of it and even that one being manufactured under 28HP.

AMD may well have to stick with 28nm for as long as they can because AMD doesn't have a lot of money.

We are in times where any IHV has to save and keep every penny it can and NV isn't an exception here especially with their costs exploding soon for Tegra.

True to an extent, but I really do find nVidia's to be worse than average.

For Tegra so far it seems like they're the most ignorant so far.
 
So let me get this straight: Tegra 4 went into production three months later than expected but Tegra 5 will definitely enter production exactly when nVidia says it will? Never mind that a Tegra product has never come out as early as nVidia said it will (okay, no idea about Tegra 1 but that showed up in so little that I have to wonder)

Heh, yeah, that is what they said (and as always, take it with a grain of salt). The way I look at it is that, even if there is a few month delay, there should be a new Tegra device out every ~ 12-15 months. So if the first Tegra 4 devices are available in May 2013 (after already being delayed by about three months), then it is not out of the question that the first Tegra 5 devices will be available in Mar-May 2014. It is true that availability in Q1 2014 may be pushing it, but availability by Q2 2014 is quite conceivable and frankly expected given their yearly cadence with Tegra.

Maybe nVidia marketing decided Iron Man went on one too many drunken binges and Spiderman is a better role model ;p

:LOL:
 
Back
Top