NVIDIA Tegra Architecture

Power consumption and heat tend to limit mobile graphics performance before die area. Apple makes that die area trade off to come out on top in performance, so you really shouldn't expect to see the advantage in performance between a better design versus a lesser design be proportionate to the die area difference.

It'll tend to track the difference in the respective architecture's power efficiency and management more than anything, in a lot of cases.
 
I understand the appeal of big/little, but isn't that a theoretical option at the moment? Is big/little available as a solution now or is actual silicon still a year out?
http://www.eetimes.com/electronics-...g-little--no-Haswell--Project-Denver-at-ISSCC

It'd be interesting to see how much power you can gain (*if* it's on the same process) by just synthesizing with lower clock. Tegra 3 can't provide any insight in that.
It's nearly certainly also using a lower Vt rather than only different synthesis. The problem on 40LP is that even Low Vt wasn't that fast and not as power efficient as 40G High Vt, while there also was a big difference in nominal voltage (1.1v vs 0.9v). I think 28HPL has a much higher base performance (such that you might want a lot more logic on High Vt rather than normal Vt presumably) and the nominal voltage difference isn't as big (1.0v vs 0.9v).

I really expected them to use 28HPM since it felt like the natural replacement to 28HPL, but I suppose they concluded they didn't need the 28HPM Ultra-High-Vt option and the higher performance wasn't worth the cost, or maybe it's just a timeframe issue since 28HPM is lagging behind quite a lot (and already was before it got delayed by TSMC).

What do you mean 'with both sets active'? Is this: both the big and the little of the same combo are running their own independent code? So you're basically running 8 CPUs? That's crazy. :) Is this what ARM is currently promoting?
Yes. In fact they promoted it from the start, see the "big.LITTLE MP Use Model" from this paper: http://www.arm.com/files/downloads/big.LITTLE_Final.pdf

And they're promoting an asymmetric number of cores as a real option for the Cortex-A5x generation: http://images.anandtech.com/doci/6420/Screen Shot 2012-10-30 at 12.22.25 PM.png (although it remains to be seen whether the software will be ready by then, specifically the Linux kernel).

What's the performance difference between big and little anyway?
I'd say ~2x in ILP and ~1.5x in clock speed for A15 vs A7, so ~3x at a given voltage. The power efficiency advantage at a given voltage is obviously massively higher than that...

I think this is a developers headache more than anything else, not something most people will actively experience as a reduction in image quality.
I disagree. You can minimise the problem by setting your depth range properly (and games/engines designed on other hardware and not properly tested on Tegra might not bother), but even then there are always some scenes where 16-bit isn't enough and you *will* have artifacts. You could make an argument that many people might not notice that, but then they might not notice the higher graphics performance either...

Furthermore the power efficiency cost of an uncompressed depth buffer is significant due to higher bandwidth, and 24-bit with framebuffer compression would nearly certainly be more power efficient than 16-bit without compression. It's really just a way to save area, and I don't think it makes any sense whatsoever for this generation of hardware. So hopefully they at least bothered with this even if it's basically the same architecture...

But more important: again as a user, I don't think it matters for the vast majority of games that are currently out there. Cut the Rope, Angry Birds, board games etc. They are all at the top of the sales ladder. They have cute graphics and they better be smooth these days, but 16-bit Z fights are not going to be a concern.
Agreed, although you'd be surprised by how many 3D games are in the Top 10 of the iOS sales chart nowadays although Android is still lagging behind obviously.

Anyway the problem here is that if you want high CPU performance, then the SoCs usually have high GPU performance as well, which you clearly don't need outside of 3D games. There's an interesting question of what the "mainstream" of the market will move to a few years from now based on the fact 3D performance is only a differentiator for part of the market, and realistically a browser doesn't need a quad-core OoOE monster either.

Personally I can see something like 1xA57+4xA53 with a fairly cheap GPU being a very attractive solution in the 20nm timeframe if the industry is willing to take the risk.

I haven't seen many people rave about the 2x faster GPU on the iPad 4. It's just not very noticeable.
I'd argue the benefit to Apple isn't people raving about how fast it is, but rather preventing people/competitors raving about how fast other products are. If you told me 5 years ago that NVIDIA wouldn't even talk about relative GPU performance compared to an in-house Apple SoC because it's clearly lagging behind, I'd have laughed you out of the room.

I understand the first order bandwidth implications of an IMR, but it strikes me that the effects are less pronounced in the real world that what you'd expect them to be. Take Tegra3: it has the same 32-bit wide MC as Tegra2, needs to feed double the CPUs and a more demanding GPU, yet the results are really much better than you'd expect them to be, with a pretty small die size to boot.
I don't know if I'd say "same 32-bit wide MC" given that the memory clock speed increased significantly and the highest performance products like the Prime Infinity use DDR3-1600 vs LPDDR2-667 for Tegra 2, but yeah, Tegra's performance is still surprisingly good with LPDDR2-800.


Take the A5X, (http://www.anandtech.com/show/5688/apple-ipad-2012-review/15, bottom of page): it's a ridiculous 160mm2 vs 80mm2. And the GPU only area ratio should be even more out of whack, yet for equivalent resolutions it's only 2.5x faster. The A5X has not only way more external BW, but the disparity is even higher taking into account the on-chip RAMs. I'm really not impressed by this 2.5x and I don't understand why it's so little.
Part of the reason has to be that the clock speed is much lower, very likely both because Tegra's RTL is targeted at surprisingly high clock speeds for a handheld product (remember the GPU is on the LP voltage rail, not the G one) and because Apple focused the synthesis/voltage choice much more on power consumption than NVIDIA. So the question is whether the A5X is indeed more power efficient and I think the anecdotal evidence is that it is, so if you targeted the same power efficiency on the same process (not 45 vs 40) then it might be quite a bit faster.

Even then Tegra might still be more area efficient than SGX. All I'm going to say on the matter is this based strictly on public information: SGX supports a branch granularity of 1 (true MIMD scheduling although with Vec4 ALUs for peak performance) with FP32 ALUs while Tegra doesn't support control flow at all in the pixel shader (predication only) and uses FP24 ALUs. Unfortunately benchmarks (and many workloads) do not benefit from MIMD/FP32 and it should be obvious that you can achieve much better area efficiency in the shader core with NVIDIA's trade-offs...
 
Interesting infos to Tegra 4 provided by Nvidia Marketing (german newspage) as reaction to leaked benchmark results:
http://www.heise.de/newsticker/meldung/3D-Leistung-Tegra-4-unterliegt-der-GPU-im-iPad-4-1780135.html

Translate (some is not translated properly!):
http://translate.google.de/translat...a-4-unterliegt-der-GPU-im-iPad-4-1780135.html


Power consumption around 6W under heavy load (this does not seem to be the max. TDP!)
48PS 24VS
Nvidia says they still don't use unified shaders because of complexity / power consumption.
They did not confirm how many TMUs, but at least they said 4 Pixels output per clock.
They also say caches have been increased.
 
Nvidia says they still don't use unified shaders because of complexity / power consumption.

Isn't that marketing speak for:

a) Our engineers have no clue how to make these power efficient at architectural level ?
b) We rather spent our graphics IP engineers on desktop and CUDA GPUs.
c) We never heard of doing stuff like voltage islands, threshold manipulation, clock scaling, clocking gating or selectively placing HVT cells etc.

If IMG can do it, why can't NV do it ?
 
Interesting infos to Tegra 4 provided by Nvidia Marketing (german newspage) as reaction to leaked benchmark results:
http://www.heise.de/newsticker/meldung/3D-Leistung-Tegra-4-unterliegt-der-GPU-im-iPad-4-1780135.html

Translate (some is not translated properly!):
http://translate.google.de/translat...a-4-unterliegt-der-GPU-im-iPad-4-1780135.html


Power consumption around 6W under heavy load (this does not seem to be the max. TDP!)
48PS 24VS
Nvidia says they still don't use unified shaders because of complexity / power consumption.
They did not confirm how many TMUs, but at least they said 4 Pixels output per clock.
They also say caches have been increased.
So they did just do a straight 6x shader increase with the same 2:1 PS:VS ratio. I wonder some of those vertex shaders will end up consistently underutilized/wasted?

Can any native German speakers translate the statement on OpenCL and OpenGL ES 3.0 support? Google translate seems to fall apart on that important sentence.
 
If IMG can do it, why can't NV do it ?

How about Qualcomm Adreno, ARM Mali T6xx and Vivante GPU IP? I wonder if there are any graphics IHVs for the SFF market left that don't have USC ALUs in their GPUs apart from NVIDIA.
 
I'd say ~2x in ILP and ~1.5x in clock speed for A15 vs A7, so ~3x at a given voltage. The power efficiency advantage at a given voltage is obviously massively higher than that...

Although it'll vary a lot depending on load I don't think Cortex-A15 will deliver such a big increase in ICP over A7 on average. ARM said they expected 50% better IPC in integer code than Cortex-A9 but in practice I'm not seeing numbers anywhere close to this very often, at least not with Exynos 5. And I think the expectation that A7 will be roughly A8 level and often better is credible - although it can't pair as many instruction combinations it does have a shorter branch mispredict penalty, claims very low latency caches (we'll see) and AFAIK benefits from auto prefetch and what have you.

So I'd give a number like ~1.8x in the best case but maybe as low as 1.4-1.5x. Very hand-waved, of course. Maybe will change a little with compilers.

ltcommander.data said:
So they did just do a straight 6x shader increase with the same 2:1 PS:VS ratio. I wonder some of those vertex shaders will end up consistently underutilized/wasted?

Gee, I wonder what else they didn't change. I wonder if the 2:1 ratio played a larger role in the design organization.

Would be interesting if these shaders were partially unified but heterogeneous. Like, the 24 shaders could run VS or PS, while the 48 shaders can only run PS. That'd mean the former need access to the TMUs though (is this an OpenGL ES 3 requirement?).. Of course if that were the case I'm sure they would have said so instead just making excuses that unified shaders are less power efficient and harder to do.

At least they revealed it has 32-bit color output capability now (right?) so probably also has 32-bit depth/stencil..
 
Would be interesting if these shaders were partially unified but heterogeneous. Like, the 24 shaders could run VS or PS, while the 48 shaders can only run PS. That'd mean the former need access to the TMUs though (is this an OpenGL ES 3 requirement?).. Of course if that were the case I'm sure they would have said so instead just making excuses that unified shaders are less power efficient and harder to do.

At least they revealed it has 32-bit color output capability now (right?) so probably also has 32-bit depth/stencil..

I would consider the odds for a wonky partially unified solution to be quite low. That being said, and given the Tegra GPU's heritage (G7x something), I'd expect the VS to be able to interact with the samplers in order to implement Vertex Texture Fetch (do we know if older Tegras didn't support this? it'd go against what G7x can do). IIRC though, there were some strict bounds on just how you could sample - possibly no filtering? It's been a while - sidenote, amusing how in the handheld / embedded world most of what is old is new again:D
 
couldn't nvidia use their kepler core in tegra, how difficult could it have really been, they already have the design at tsmc for both kepler and tegra, it also looks like kepler is more efficient, so would they stick with the older design?
 
couldn't nvidia use their kepler core in tegra, how difficult could it have really been, they already have the design at tsmc for both kepler and tegra, it also looks like kepler is more efficient, so would they stick with the older design?

These chips are in the design board until more than a year before they get into our tablets or smartphones. Kepler has been in the market for less than a year.

They will probably unify their shader architecture between desktops, servers and handhelds, but I don't think many people expected it to happen with Tegra 4.

Sure, everyone expected them to come out with an unified architecture because that's what's been used pretty much everywhere by every modern GPU designer for the last 5 years (AFAIK the 3DS is the only other exception, but the "new" Nintendo really digs using ultra-old and ultra-weak stuff in their consoles). But it doesn't mean it had to be Kepler shader units per se.
 
Can any native German speakers translate the statement on OpenCL and OpenGL ES 3.0 support? Google translate seems to fall apart on that important sentence.

That just says that the SoC will not be able to run OpenCL and GLES 3 due to non unified shaders (or other limitations).
Next sentence is that NVidia is way behind the competition :rolleyes:

I just saw a bit more about that here:
http://www.brightsideofnews.com/new...cs-disappoint2c-nvidia-in-defensive-mode.aspx

Quote NVidia:
"Today's mobile apps do not take advantage of OCL (OpenCL), CUDA or Advanced OGL (OpenGL), nor are these APIs exposed in any OS. Tegra 4's GPU is very powerful and dedicates its resources toward improving real end user experiences."
To put this answer in perspective, Nvidia - a company almost always known for innovation in the desktop and mobile computing space - does not consider that API's such as OpenCL and its own CUDA are important for ultra-efficient computing. This attitude already resulted in a substantial design win turn sour, as the company was thrown out of BMW Group, a year and a few quarters after it triumphantly pushed Intel out of BMW's structure.

While the company is embedded with the Volkswagen AG group and will probably end up shipping Tegras in each of the VAG brands (Audi, Bentley, Bugatti, Ducati, Seat, Skoda, Volkswagen), it was Freescale than won the new-new BMW deal for next-gen hardware because of one small thing - Vivante's GPU featureset.
 
I heard from one developer that single SGX Vec4 SIMD unit issues 4 FP16 MADs per cycle and only 2 FP32 MADs, does anybody know if this true or not? And does it mean that 76 theoretical gflops of SGX554MP4 in iPad4 are equal to 38 FP32 Gflops?
 
I heard from one developer that single SGX Vec4 SIMD unit issues 4 FP16 MADs per cycle and only 2 FP32 MADs, does anybody know if this true or not? And does it mean that 76 theoretical gflops of SGX554MP4 in iPad4 are equal to 38 FP32 Gflops?

SGX uses a conventional sort of SIMD so a single instruction can express 4x8-bit, 3x10-bit, 4x10-bit (over a limited number of 40-bit registers), 2x16-bit, or 1x32-bit operations. So normally this would be correct. However, according to a post by John H (sorry, don't have a link) there are instructions with a shared 32-bit operand that can perform two FMADDs in one cycle. Although he wasn't any more specific about what this meant my guess is that it's some form of (a * b) + (a * c) + d. This instruction would be used in the kernel of matrix * vector multiplication ie linear transformations, which is the fundamental building block for geometry transformation & lighting.

Series 5XT could dual-issue the above instructions so you get 4xFP16 or limited 4xFP32.
 
This instruction would be used in the kernel of matrix * vector multiplication ie linear transformations, which is the fundamental building block for geometry transformation & lighting.

Series 5XT could dual-issue the above instructions so you get 4xFP16 or limited 4xFP32.
And what's about gpu skinning? Seems like Tegra's 24 FP32 non unified vertex FPUs could be equal to whole 554MP4 on FP32 arifmetics if Tegra is clocked high enough
 
And what's about gpu skinning? Seems like Tegra's 24 FP32 non unified vertex FPUs could be equal to whole 554MP4 on FP32 arifmetics if Tegra is clocked high enough

Skinning is done with linear transformations too, isn't it? Like with matrix palettes?

Per-clock SGX554MP4 can do 128 limited FP32 FMADDs and 64 unrestricted ones. So you'd need 2.5x or 5x the clock and I doubt even the lower end of that (750MHz?) is happening. But are high end mobile games really pushing 1/3rd the GPU load on vertex shading?

If you want to compare the two you should give SGX the advantage from the other side too, where it has double the 8/10-bit ALU throughput per clock. That is normally enough for a lot of color component stuff, although maybe that's already changing in mobile too?
 
If you want to compare the two you should give SGX the advantage from the other side too, where it has double the 8/10-bit ALU throughput per clock. That is normally enough for a lot of color component stuff, although maybe that's already changing in mobile too?
As far as I remember Tegra GPU has 2x FP10 / int10 throughput, but I suppose it useful mostly for programmable blending, rather than shading
 
As far as I remember Tegra GPU has 2x FP10 / int10 throughput, but I suppose it useful mostly for programmable blending, rather than shading

I have never read anything about that, and I read that pretty detailed optimization manual nVidia released.

Whole thing might be moot, since according to an earlier post by Arun in this thread (http://forum.beyond3d.com/showpost.php?p=1659557&postcount=271) it looks like the 8/10-bit operations may not dual-issue in Series5XT..
 
Last edited by a moderator:
Back
Top