NVIDIA Tegra Architecture

silent_guy · Jan 8, 2013

NathansFortune said:
Late to the discussion, but Tegra 4 doesn't have an integrated modem and it doesn't have a SM4 GPU. What the fuck are they playing at?!? What were they doing for the past year.

As much as I like reading and talking about GPU stuff, as a phone/tablet user, I don't give a hoot about the features as long as it's fast enough to render things smoothly. Give me a slide that says 'This GPU is x times faster than that one' and I'm happy. I'm pretty sure this is true for 99% of the population out there and, for now, developers as well.
Advanced SM features (I assume you mean things like tessellation etc?) is something that's not very useful for low end desktop GPUs with way more performance than mobile GPUs, so I don't see why they are needed here for some time to come? Are developers really waiting for this stuff in the near term?

CPU performance seems to be more if a big deal (though you may question if 4 A15s isn't a bit much for a phone...)

Helmore · Jan 8, 2013

Will they be able to fit this in a phone? I fear that TDP will be too high to keep cool in a phone form factor.

Arun · Jan 8, 2013

NathansFortune said:
Late to the discussion, but Tegra 4 doesn't have an integrated modem and it doesn't have a SM4 GPU. What the fuck are they playing at?!? What were they doing for the past year. How hard can it be to stuff 80 or so Kepler ALUs into this and integrate the baseband!

Regarding the baseband: a lot harder than you think, and even harder than it is for Intel to integrate Infineon's. i500 is on 28HP while Tegra 4 is on 28HPL. Icera always used structured custom (custom placement, automatic routing) to improve efficiency. That means a lot of the work has to be done to port it to 28HPL... And Icera's arguments regarding the higher area *and* power efficiency of 28HP for their architecture are very persuasive, so you'd end up with a slightly inferior baseband too.

They'll obviously integrate the baseband for Grey which I assume will be launched at MWC where there is a clear cost benefit. In the ultra-high-end though, the benefit is not as obvious since it's a much lower percentage of total cost (and completely negligible power for the bus - ignore all the marketing rubbish, how expensive can 20MB/s be when your DRAM is nearly 1000x faster?) and there's a greater market for WiFi-only tablets you need to target. I think the APQ8064 clearly proves even Qualcomm agrees.

silent_guy said:
I imagine that, with very high maximum clock speeds, you can save quite a bit of power by clocking down and lowering VDD as well.

The problem is all current handheld CPUs already do that. You could reduce your maximum clock speed so that it happens at a lower voltage, but you're still going to hit your minimum voltage when in lower operating modes.The problem of the A15 is that if your performance at nominal voltage is 2GHz, then if let's say your performance at minimum voltage is 800MHz, that's still a ridiculously fast processor and you don't need something anywhere near that fast for most tasks. The whole point of big.LITTLE is to extend to curve and get optimal power efficiency at lower performance levels (although since it's an inherently more power efficient architecture, the break-even point may be higher than minimum voltage for the A15). The "companion core" approach from NVIDIA helps in the similar way although only when you don't need more than one thread - still that's the bulk of workloads today, so it should help a lot already compared to a standard A15 implementation.

Ideally what you'd want is big.LITTLE with both sets of cores active and visible to the OS (with the right kernel logic to make it work) combined with a single A7/LITTLE companion core that could be active at the same time for lower leakage/performance (i.e. High Vt with longer gate channel lengths and power-optimised synthesis) so that the main cores could be implemented with a higher performance/leakage process (for lower active power by undervolting). So I can imagine something like 4xA57+5xA53 being very interesting on 20nm...

Although it seems to me at that point you'd probably want a shared L3 and per-core L2, and then there's also the question of where SMT fits on ARM's roadmap (it's not clear to me that it's a good idea to run a second thread on a big core if you've got a spare LITTLE core that's still free and more power efficient, although I suppose it depends on the memory hierarchy and the percentage of leakage vs active power since you don't have the time to power gate on a cache miss).

silent_guy said:
As much as I like reading and talking about GPU stuff, as a phone/tablet user, I don't give a hoot about the features as long as it's fast enough to render things smoothly.

But you should care about image quality. Tegra 3 didn't support framebuffer compression, so to save bandwidth they only supported (or at least exposed?) a pitiful 16-bit depth buffer. That leads to quite a lot of depth precision issues... I hope they at least added framebuffer compression and 24-bit depth for this generation, or it'll be a complete joke (and ideally MSAA). In my mind this leads to a fundamental flaw of the architecture: it's an IMR but it's not fast enough at doing a Z-Only prepass (because of bandwidth, depth rate, and geometry performance) so you need (unrealistically?) good front-to-back ordering to get good performance on complex workloads. And even then I suspect they waste more time than they should on rejecting pixels for perfectly front-to-back ordered scenes that have high overdraw as they have no Hier-Z of any kind...

BTW, one remaining question for Tegra 4 (if they've kept basically the exact same architecture) is whether the higher number of "cores" is a result of a higher ALU:[TMU/ROP] ratio since I can't see how they'd have enough bandwidth for that many units otherwise...

Mariner · Jan 8, 2013

I always find it surprising that NV have included relatively sub-par GPUs in their Tegra line of chips when you would think this would be the area where they would be keen to excel.

I guess it is a deliberate decision to execute each generation as quickly as possible and get the first chip out with a marketable feature (i.e. first dual-core A9, first quad-core A9, first quad-core A15) without spending too many resources on the GPU side of things.

Of course, the fragmentation in Android devices allows this to be a reasonably effective strategy as no games are ever designed just for the bleeding edge and even relatively old chips such as Tegra 2 still fare well enough on most games.

Nebuchadnezzar · Jan 8, 2013

Those cached results from earlier:

Turbotab said:
RESULTS
GL_VENDOR NVIDIA Corporation
GL_VERSION OpenGL ES 2.0 17.01235
GL_RENDERER NVIDIA Tegra

From System spec, It runs Android 4.2.1, a Min frequency of 51 MHz and Max of 1836 Mhz

Nvidia DALMORE
GLBenchmark 2.5 Egypt HD C24Z16 - Offscreen (1080p) : 32.6 fps

iPad 4
GLBenchmark 2.5 Egypt HD C24Z16 - Offscreen (1080p): 49.6 fps

GL BENCHMARK - High Level

http://webcache.googleusercontent.c...p?D=Dalmore+Dalmore+&cd=1&hl=en&ct=clnk&gl=uk

GL BENCHMARK - Low Level

http://webcache.googleusercontent.c...e&testgroup=lowlevel&cd=1&hl=en&ct=clnk&gl=uk

GL BENCHMARK - GL CONFIG

http://webcache.googleusercontent.c...Dalmore&testgroup=gl&cd=1&hl=en&ct=clnk&gl=uk

GL BENCHMARK - EGL CONFIG

http://webcache.googleusercontent.c...almore&testgroup=egl&cd=1&hl=en&ct=clnk&gl=uk

GL BENCHMARK - SYSTEM

http://webcache.googleusercontent.c...ore&testgroup=system&cd=1&hl=en&ct=clnk&gl=uk

OFFSCREEN RESULTS

http://webcache.googleusercontent.c...enchmark.com+dalmore&cd=4&hl=en&ct=clnk&gl=uk

Exophase · Jan 8, 2013

Mariner said:
I always find it surprising that NV have included relatively sub-par GPUs in their Tegra line of chips when you would think this would be the area where they would be keen to excel.

At least PowerVR has been doing graphics nearly as long as nVidia, but focusing on low power for a lot longer. The same may apply to the Adreno team to a lesser extent. Not so sure about the others (Mali pre-ARM acquisition, Vivante..)

The GPUs seem pretty competitive for the die area spent. They can't as easily offer iPad sized SoCs because they don't have nearly the same volume as Apple to tap into for relatively expensive tablets with high end graphics. The same goes for all the other SoC manufacturers.

Deleted member 13524 · Jan 8, 2013

The slowest T30L implementations (LG Optimus 4X) got around 8.5 FPS in offscreen Egypt HD C24Z16, and the fastest T33 (TF700 with 1600MHz DDR3) got 13.2 FPS.

This means Tegra 4's GPU is between 2.5 and 3.8x the performance of Tegra 3's GPU.
For now, it's in the Adreno 320 and iphone 5's SGX543MP3 ballpark.
Except those don't need heatsinks and cooling fans (which probably has a lot more to do with the four Cortex A15 than anything else).

Exophase · Jan 8, 2013

Arun said:
BTW, one remaining question for Tegra 4 (if they've kept basically the exact same architecture) is whether the higher number of "cores" is a result of a higher ALU:[TMU/ROP] ratio since I can't see how they'd have enough bandwidth for that many units otherwise...

Probably. Everyone else is transitioning to higher ALU:TMU/ROP ratios too, right?

I wonder if this really is the naive 6x Tegra 3. ie 12 vec4 FP20 pixel shader pipelines and 6 vec4 FP32 vertex shaders, or perhaps they changed the PS:VS ratio again too?

ToTTenTranz said:
For now, it's in the Adreno 320 and iphone 5's SGX543MP3 ballpark.
Except those don't need heatsinks and cooling fans (which probably has a lot more to do with the four Cortex A15 than anything else).

I wonder. It looks like the Mali-T604 on Exynos 5250 peaks at higher power consumption than the 2 Cortex-A15 cores. Of course Tegra 4 has double the cores and higher peak clocks at least in the single threaded scenario, but I wonder how what clock it'll really let all those cores run at. For the usage they're looking at it'd be a completely reasonable design choice to not allow the 4 cores a significantly higher sustained power budget than the 2 cores.

BTW, does Shield really have a fan? It didn't look like it does in the videos, even though we know it can draw at least 8W with most of it being concentrated in the controller part.

ltcommander.data · Jan 8, 2013

ams said:
Compared to Tegra 3, it appears (but has not yet been confirmed) that Tegra 4 has 6x more pixel shader execution units (ie. 48 pixel shader execution units vs. 8 pixel shader execution units) and 6x more vertex shader execution units (ie. 24 vertex shader execution units vs. 4 vertex shader execution units), for a grand total of 72 pixel/vertex shader execution units in Tegra 4 vs. a grand total of 12 pixel/vertex shader execution units in Tegra 3.

Since they aren't using unified shaders the choice of pixel shader to vertex shader ratio will be very important to performance for current and future games. Tegra 2 started out with a 1:1 ratio (4PS:4VS) and Tegra 3 moved to 2:1 (8PS:4VS) to support more pixel shader heavy games. Before desktop GPUs went to unified shaders, nVidia had a 3:1 ratio in the G71 with 24PS:8VS while ATI was even more aggressive with a 6:1 ratio at 48PS:8VS in the R580. Since things trend toward being more pixel shader heavy, I don't think nVidia will be sticking to a 2:1 PS:VS ratio for the Tegra 4. I think a 5:1 60PS:12VS (15 vec4 FP20 PS and 3 vec4 FP32 VS) would make a good forward looking combination. A 56PS:16VS (14 vec4 FP20 PS and 4 vec4 FP32 VS) might be a better balance with current workloads though.

Exophase said:
Probably. Everyone else is transitioning to higher ALU:TMU/ROP ratios too, right?

I wonder if this really is the naive 6x Tegra 3. ie 12 vec4 FP20 pixel shader pipelines and 6 vec4 FP32 vertex shaders, or perhaps they changed the PS:VS ratio again too?

G71 ended up with 24PS:8VS:24TMU:16 ROP while R580 ended up with 48PS:8VS:16TMU:16ROP. I think the R580 ended up being TMU bound at least in the then current games when it launched so perhaps a more TMU heavy ratio like the G71 would be good guidance.

Exophase · Jan 8, 2013

Hm, well assuming these are still allocated as vec4, which is what I would expect, 2:1 and 5:1 are the only actual options. They'd still have 3x the vec4 vertex shaders that Tegra 3 has so it'd probably be okay.

EDIT: Okay, I don't what's wrong with me, of course other ratios work >_>

Ailuros · Jan 8, 2013

Could be, however when you still range your design more or less at the "lowest common denominators" you typically wouldn't shoot for too high geometry rates either. Kishonti upped the ante with geometry a bit in the initial 2.5 release of its GLBenchmark and followup revisions trimmed geometry quite a bit downwards which helped all architectures to gain some additional performance. I wouldn't suggest it's a pure coincidence.

Even Mali400MP4 with its Vec2 VS performs quite decent in it, whereby when it reaches ULP GF/T3 frequencies (T3@520MHz, latest Mali400MP4@533MHz) it's even by a margin faster than Tegra3 despite the latter having on paper a twice as fat vertex shader ALU.

T604 on the other hand despite being a USC doesn't strike me from the GLBenchmark geometry results as any geometry powerhouse either. Could be they revamped their ALU seriously in T624 otherwise I couldn't imagine how they promise 50% higher efficiency compared to 604 with the same amount of units.

All in all considering how relatively low geometry rates are still in mobile games I wouldn't be in the least surprised if the Wayne GPU contains even "just" 2 Vec4VS; Anand quoted that the majority of units are FP20 and that alone excludes in my mind any high VS to PS ALU ratio.

Exophase · Jan 8, 2013

Ailuros said:
All in all considering how relatively low geometry rates are still in mobile games I wouldn't be in the least surprised if the Wayne GPU contains even "just" 2 Vec4VS; Anand quoted that the majority of units are FP20 and that alone excludes in my mind any high VS to PS ALU ratio.

A majority technically only has to mean more than half, although I'm sure Anand was going for something more than that. Still, an 8:1 ratio is really extreme.

Ailuros · Jan 8, 2013

Exophase said:
A majority technically only has to mean more than half, although I'm sure Anand was going for something more than that. Still, an 8:1 ratio is really extreme.

Of course is it extreme (whereby extreme is relative; others already mentioned the G7x as a desktop paradigm), but it's as extreme for Mali400MP4. I'm not saying it will be, I just wouldn't be particularly surprised if it's the case. In the end "if" it's the case it would be 4x times the ALU lane amount as in the Mali400MP4.

Exophase · Jan 8, 2013

Ailuros said:
Of course is it extreme (whereby extreme is relative; others already mentioned the G7x as a desktop paradigm), but it's as extreme for Mali400MP4. I'm not saying it will be, I just wouldn't be particularly surprised if it's the case. In the end "if" it's the case it would be 4x times the ALU lane amount as in the Mali400MP4.

I don't really think Mali-400 is a great parallel, it's hard to properly balance a design where only PS count scales with cores and still make it make sense to configure both MP1 and MP4 solutions.. 8:1 would be more extreme than any fixed design we've seen, AFAIK.

ltcommander.data · Jan 8, 2013

On the issue of FP20 PS and OpenGL ES 3.0 compatibility mandating FP32. Could nVidia be using ganging or emulation to achieve FP32? OpenGL 4.0 mandates FP64 support which I believe is often achieved through ganging FP32 ALUs or emulation in the case of lower-end GPUs rather than dedicated FP64 ALUs or all ALUs supporting FP64. The vec4 FP20 PS operating as vec2 FP32 PS would give 1/2 speed OpenGL ES 3.0 support which isn't glamourous, but is better than nothing. Perhaps nVidia is anticipating that by the time OES3.0 really takes off Tegra 5 will be available, so Tegra 4 being focused on OES2.0 with OES3.0 more for a compatibility and feature checkbox standpoint than performance may make sense.

Ailuros · Jan 8, 2013

Pretty far fetched scenario which I'm not so sure it would even be realistic. Besides there's far more to OGL_ES3.0 than just FP32 ALUs: http://www.imgtec.com/News/Release/index.asp?NewsID=717

ltcommander.data · Jan 8, 2013

Ailuros said:
Pretty far fetched scenario which I'm not so sure it would even be realistic. Besides there's far more to OGL_ES3.0 than just FP32 ALUs: http://www.imgtec.com/News/Release/index.asp?NewsID=717

Yeah, assuming Tegra 4's PS already had the other OES3.0 features and FP20 was the main impediment.

On Imagination, I guess the long talked about MRT support will finally be exposed in shipping drivers

Exophase · Jan 8, 2013

ltcommander.data said:
The vec4 FP20 PS operating as vec2 FP32 PS would give 1/2 speed OpenGL ES 3.0 support which isn't glamourous, but is better than nothing.

Does anyone know what the actual format of nVidia's FP20 is? FP32 is 1.8.23, FP16 is 1.5.10, I've seen FP24 that is 1.8.15.. could nVidia be using 1.8.11? Since there's an extra implicit bit of precision in normalized numbers, and AFAIK OpenGL ES needn't support denorms. In this case it seems like you may indeed be able to do an FP32 FMUL with 4 FP20 FMA, although I haven't really fully thought through the math.

silent_guy · Jan 9, 2013

Arun said:
The whole point of big.LITTLE is to extend to curve and get optimal power efficiency at lower performance levels (although since it's an inherently more power efficient architecture, the break-even point may be higher than minimum voltage for the A15). The "companion core" approach from NVIDIA helps in the similar way although only when you don't need more than one thread - still that's the bulk of workloads today, so it should help a lot already compared to a standard A15 implementation.

I understand the appeal of big/little, but isn't that a theoretical option at the moment? Is big/little available as a solution now or is actual silicon still a year out?
It'd be interesting to see how much power you can gain (*if* it's on the same process) by just synthesizing with lower clock. Tegra 3 can't provide any insight in that.

Ideally what you'd want is big.LITTLE with both sets of cores active and visible to the OS (with the right kernel logic to make it work) combined with a single A7/LITTLE companion core that could be active at the same time for lower leakage/performance (i.e. High Vt with longer gate channel lengths and power-optimised synthesis) so that the main cores could be implemented with a higher performance/leakage process (for lower active power by undervolting). So I can imagine something like 4xA57+5xA53 being very interesting on 20nm...

What do you mean 'with both sets active'? Is this: both the big and the little of the same combo are running their own independent code? So you're basically running 8 CPUs? That's crazy.

Is this what ARM is currently promoting?
What's the performance difference between big and little anyway?

But you should care about image quality. Tegra 3 didn't support framebuffer compression, so to save bandwidth they only supported (or at least exposed?) a pitiful 16-bit depth buffer. That leads to quite a lot of depth precision issues...

I think this is a developers headache more than anything else, not something most people will actively experience as a reduction in image quality.

But more important: again as a user, I don't think it matters for the vast majority of games that are currently out there. Cut the Rope, Angry Birds, board games etc. They are all at the top of the sales ladder. They have cute graphics and they better be smooth these days, but 16-bit Z fights are not going to be a concern.

So I completely understand why they'd want to focus on CPU performance: that's where 99% of the users can still see a difference: the time to load an app, the time to load a web page. I haven't seen many people rave about the 2x faster GPU on the iPad 4. It's just not very noticeable.

In my mind this leads to a fundamental flaw of the architecture: it's an IMR but it's not fast enough at doing a Z-Only prepass (because of bandwidth, depth rate, and geometry performance) so you need (unrealistically?) good front-to-back ordering to get good performance on complex workloads. And even then I suspect they waste more time than they should on rejecting pixels for perfectly front-to-back ordered scenes that have high overdraw as they have no Hier-Z of any kind...

I understand the first order bandwidth implications of an IMR, but it strikes me that the effects are less pronounced in the real world that what you'd expect them to be. Take Tegra3: it has the same 32-bit wide MC as Tegra2, needs to feed double the CPUs and a more demanding GPU, yet the results are really much better than you'd expect them to be, with a pretty small die size to boot. Take the A5X, (http://www.anandtech.com/show/5688/apple-ipad-2012-review/15, bottom of page): it's a ridiculous 160mm2 vs 80mm2. And the GPU only area ratio should be even more out of whack, yet for equivalent resolutions it's only 2.5x faster. The A5X has not only way more external BW, but the disparity is even higher taking into account the on-chip RAMs. I'm really not impressed by this 2.5x and I don't understand why it's so little.

Simon F · Jan 9, 2013

Exophase said:
At least PowerVR has been doing graphics nearly as long as nVidia,

If you just mean 3D graphics, then possibly longer. Wikipedia says NV was founded in 1993 but IMG demoed a PowerVR prototype at SIGGRAPH 1993. If you also include 2D graphics/multimedia, then IMG have been going far longer.

NVIDIA Tegra Architecture

silent_guy

Helmore

Arun

Unknown.

Mariner

Nebuchadnezzar

Exophase

Deleted member 13524

Guest

Exophase

ltcommander.data

Exophase

Ailuros

Epsilon plus three

Exophase

Ailuros

Epsilon plus three

Exophase

ltcommander.data

Ailuros

Epsilon plus three

ltcommander.data

Exophase

silent_guy

Simon F

Tea maker

Similar threads