NVIDIA Tegra Architecture

Alexko said:
Which is a bit surprising. Emphasizing perf/mm² at the expense of perf/W would make sense for a chip aimed at cheap tablets like the Nexus 7
I didn't see anything to indicate it was at the expense of perf/watt?

How long will Tegra 4(i)-based phones last while gaming compared to the competition at similar performance levels?
I would wager 4i will do very well indeed, 4 not so much...
 
Last edited by a moderator:
nVidia is emphasizing on showing how great their perf/mm^2 is, when it's obvious Apple (to name one) is heavily investing in perf/W at the expense of perf/mm^2. I mean, A6X could have probably had double the GPU clock and half the cores, just to name the most obvious - but I'm sure the emphasis goes a lot deeper than that and partially has to do with how IMG designed the cores. nVidia isn't even trying to make a perf/W comparison here, but I expect it's going to be substantially worse.
The perf/mm^2 also doesn't take into account that the SoC are on different processes, 32nm for Samsung and Apple and 28nm for nVidia and Qualcomm. Of course, the difference won't be huge and its not like transistor counts are widely available so there's not much they can do there.

Tile based renderers have on chip depth and color buffers whereas IMRs farm that out to VRAM as well as additional logic to manage the tiling so perhaps that places TBRs at an area disadvantage?
 
The perf/mm^2 also doesn't take into account that the SoC are on different processes, 32nm for Samsung and Apple and 28nm for nVidia and Qualcomm. Of course, the difference won't be huge and its not like transistor counts are widely available so there's not much they can do there.

They said it's "normalized" to 28nm but I'd bet my hat that TSMC has superior density on their 28nm node vs Samsung's.

Tile based renderers have on chip depth and color buffers whereas IMRs farm that out to VRAM as well as additional logic to manage the tiling so perhaps that places TBRs at an area disadvantage?

Tegra has a color cache, but I don't know which one is bigger. You do get more usable memory for the same area with a TBDR since you don't need the tags or cache controller. The buffer sizes are probably configurable to an extent and we don't know what Apple's selecting (we do know that the buffers are pretty large for Adreno). You have a good point on the tiling engine, and there are probably some area disadvantages for unified shaders, especially when compared against fragment shaders with cut down precision.
 
Alexko said:
I meant that in the sense that it has a small number of units running at high clocks relative to PowerVR GPUs, especially in Apple SoCs.
But that isn't necessarily bad for perf/watt...
 
But that isn't necessarily bad for perf/watt...

You don't seriously think that nVidia has this massive perf/area advantage at zero perf/W disadvantage, do you? If they did I'm confident they would have added that. You can't run this thing at over twice the clock speed and get the same power efficiency, it doesn't work that way.
 
Which is a bit surprising. Emphasizing perf/mm² at the expense of perf/W would make sense for a chip aimed at cheap tablets like the Nexus 7 (tight budget, big battery) but for a phone SoC?

Suprising for marketing slides from any IHV? :LOL:

How long will Tegra 4(i)-based phones last while gaming compared to the competition at similar performance levels?

Damn there's no perf/mW or perf/W slide there....:oops: ;)

Anyway if the ULP GF in T4/4i should be mostly ALU bound, the T4i might yield under a best case scenario around 5200 frames in GLB2.5. Not bad at all.
 
Suprising for marketing slides from any IHV? :LOL:

Damn there's no perf/mW or perf/W slide there....:oops: ;)

Anyway if the ULP GF in T4/4i should be mostly ALU bound, the T4i might yield under a best case scenario around 5200 frames in GLB2.5. Not bad at all.

Oh, the slides certainly don't surprise me, but the technical choices do, at least a little.

That said, NVIDIA's architecture isn't unified, and its pixel shaders are only 20-bit vs. 32-bit for the competition. Since multipliers grow quadratically with precision, on a first approximation NVIDIA enjoys 1 - (20/32)² = 61% silicon savings on said multipliers.

I'm not sure how much this amounts to for a complete GPU, but in any case it should help with power.
 
Oh, the slides certainly don't surprise me, but the technical choices do, at least a little.

That said, NVIDIA's architecture isn't unified, and its pixel shaders are only 20-bit vs. 32-bit for the competition. Since multipliers grow quadratically with precision, on a first approximation NVIDIA enjoys 1 - (20/32)² = 61% silicon savings on said multipliers.

I'm not sure how much this amounts to for a complete GPU, but in any case it should help with power.

24 bit precision isnt it?
 
FP20 for the pixel shaders, FP32 for the vertex shaders.

[URL=http://www.anandtech.com/show/6666/the-tegra-4-gpu-nvidia-claims-better-performance-than-ipad-4]AnandTech[/URL] said:
Tegra 4 features six Vec4 vertex units (FP32, 24 cores) and four 3-deep Vec4 pixel units (FP20, 48 cores).
 
Exophase said:
You don't seriously think that nVidia has this massive perf/area advantage at zero perf/W disadvantage, do you?
I am saying I haven't seen any actual evidence that this is the case. However, Nvidia cut corners in other areas which directly aid perf/area.

And I am well aware of how clockspeed, voltage, and power consumption are related. But when it comes to perf/watt from different architectures matters are significantly more complex. If you have actual evidence that T4i is performing significantly worse than its competition in perf/watt by all means present it...
 
Anand made a mistake? Anyway tegra 4 introduces some nice features at least..including multiple render targets.

Which is most likely as decorational as it is on other hw from a performance perspective. For the record MRTs weren't absent from prior ULP GeForces either.
 
Tegras 3 and 4 are s16e7 (24-bit) for pixel shading.

I've seen this incongruence for a while now.. nVidia documents a 20-bit format:

"The Tegra fragment unit supports two levels of fragment variable precision: fp20 (an s.6.13 floating-point format) and fx10 (two’s complement s.1.8 format)"

http://docs.nvidia.com/tegra/data/Optimize_OpenGL_ES_2_0_Performance_for_Tegra.html

Given that it goes on to say that it can store twice as many FX10 format temporaries, varyings, and uniforms, maybe FP20 is merely the storage limit of these variables and the output of the interpolator (it does go on to say that explicitly that it can interpolate 4 fp20 or 8 fx10 values per cycle). I'm not sure what the distinction is here between temporaries and registers but maybe the registers or only some of them are 24-bits, and the ALU precision is as well. Since we're just looking at more significant bits it'd be trivial to convert between the two.
 
In that case I would love to know how it runs Windows 8 RT, since that mandates ps_2_0, which in turn requires 24-bit.
 
Back
Top