NVIDIA Tegra Architecture

ninelven · Feb 25, 2013

Alexko said:
Which is a bit surprising. Emphasizing perf/mm² at the expense of perf/W would make sense for a chip aimed at cheap tablets like the Nexus 7

I didn't see anything to indicate it was at the expense of perf/watt?

How long will Tegra 4(i)-based phones last while gaming compared to the competition at similar performance levels?

I would wager 4i will do very well indeed, 4 not so much...

ltcommander.data · Feb 25, 2013

Exophase said:
nVidia is emphasizing on showing how great their perf/mm^2 is, when it's obvious Apple (to name one) is heavily investing in perf/W at the expense of perf/mm^2. I mean, A6X could have probably had double the GPU clock and half the cores, just to name the most obvious - but I'm sure the emphasis goes a lot deeper than that and partially has to do with how IMG designed the cores. nVidia isn't even trying to make a perf/W comparison here, but I expect it's going to be substantially worse.

The perf/mm^2 also doesn't take into account that the SoC are on different processes, 32nm for Samsung and Apple and 28nm for nVidia and Qualcomm. Of course, the difference won't be huge and its not like transistor counts are widely available so there's not much they can do there.

Tile based renderers have on chip depth and color buffers whereas IMRs farm that out to VRAM as well as additional logic to manage the tiling so perhaps that places TBRs at an area disadvantage?

Alexko · Feb 25, 2013

ninelven said:
I didn't see anything to indicate it was at the expense of perf/watt?

I meant that in the sense that it has a small number of units running at high clocks relative to PowerVR GPUs, especially in Apple SoCs.

Exophase · Feb 25, 2013

ltcommander.data said:
The perf/mm^2 also doesn't take into account that the SoC are on different processes, 32nm for Samsung and Apple and 28nm for nVidia and Qualcomm. Of course, the difference won't be huge and its not like transistor counts are widely available so there's not much they can do there.

They said it's "normalized" to 28nm but I'd bet my hat that TSMC has superior density on their 28nm node vs Samsung's.

ltcommander.data said:
Tile based renderers have on chip depth and color buffers whereas IMRs farm that out to VRAM as well as additional logic to manage the tiling so perhaps that places TBRs at an area disadvantage?

Tegra has a color cache, but I don't know which one is bigger. You do get more usable memory for the same area with a TBDR since you don't need the tags or cache controller. The buffer sizes are probably configurable to an extent and we don't know what Apple's selecting (we do know that the buffers are pretty large for Adreno). You have a good point on the tiling engine, and there are probably some area disadvantages for unified shaders, especially when compared against fragment shaders with cut down precision.

ninelven · Feb 25, 2013

Alexko said:
I meant that in the sense that it has a small number of units running at high clocks relative to PowerVR GPUs, especially in Apple SoCs.

But that isn't necessarily bad for perf/watt...

Exophase · Feb 25, 2013

ninelven said:
But that isn't necessarily bad for perf/watt...

You don't seriously think that nVidia has this massive perf/area advantage at zero perf/W disadvantage, do you? If they did I'm confident they would have added that. You can't run this thing at over twice the clock speed and get the same power efficiency, it doesn't work that way.

Ailuros · Feb 25, 2013

Alexko said:
Which is a bit surprising. Emphasizing perf/mm² at the expense of perf/W would make sense for a chip aimed at cheap tablets like the Nexus 7 (tight budget, big battery) but for a phone SoC?

Suprising for marketing slides from any IHV?

How long will Tegra 4(i)-based phones last while gaming compared to the competition at similar performance levels?

Damn there's no perf/mW or perf/W slide there....

Anyway if the ULP GF in T4/4i should be mostly ALU bound, the T4i might yield under a best case scenario around 5200 frames in GLB2.5. Not bad at all.

Alexko · Feb 25, 2013

Ailuros said:
Suprising for marketing slides from any IHV?

Damn there's no perf/mW or perf/W slide there....

Anyway if the ULP GF in T4/4i should be mostly ALU bound, the T4i might yield under a best case scenario around 5200 frames in GLB2.5. Not bad at all.

Oh, the slides certainly don't surprise me, but the technical choices do, at least a little.

That said, NVIDIA's architecture isn't unified, and its pixel shaders are only 20-bit vs. 32-bit for the competition. Since multipliers grow quadratically with precision, on a first approximation NVIDIA enjoys 1 - (20/32)² = 61% silicon savings on said multipliers.

I'm not sure how much this amounts to for a complete GPU, but in any case it should help with power.

french toast · Feb 25, 2013

Alexko said:
Oh, the slides certainly don't surprise me, but the technical choices do, at least a little.

That said, NVIDIA's architecture isn't unified, and its pixel shaders are only 20-bit vs. 32-bit for the competition. Since multipliers grow quadratically with precision, on a first approximation NVIDIA enjoys 1 - (20/32)² = 61% silicon savings on said multipliers.

I'm not sure how much this amounts to for a complete GPU, but in any case it should help with power.

24 bit precision isnt it?

Alexko · Feb 25, 2013

FP20 for the pixel shaders, FP32 for the vertex shaders.

[URL=http://www.anandtech.com/show/6666/the-tegra-4-gpu-nvidia-claims-better-performance-than-ipad-4]AnandTech[/URL] said:
Tegra 4 features six Vec4 vertex units (FP32, 24 cores) and four 3-deep Vec4 pixel units (FP20, 48 cores).

ninelven · Feb 25, 2013

Exophase said:
You don't seriously think that nVidia has this massive perf/area advantage at zero perf/W disadvantage, do you?

I am saying I haven't seen any actual evidence that this is the case. However, Nvidia cut corners in other areas which directly aid perf/area.

And I am well aware of how clockspeed, voltage, and power consumption are related. But when it comes to perf/watt from different architectures matters are significantly more complex. If you have actual evidence that T4i is performing significantly worse than its competition in perf/watt by all means present it...

french toast · Feb 25, 2013

Alexko said:
FP20 for the pixel shaders, FP32 for the vertex shaders.

Cheers.

Rys · Feb 26, 2013

Tegras 3 and 4 are s16e7 (24-bit) for pixel shading.

french toast · Feb 26, 2013

Rys said:
Tegras 3 and 4 are s16e7 (24-bit) for pixel shading.

Anand made a mistake? Anyway tegra 4 introduces some nice features at least..including multiple render targets.

Rys · Feb 26, 2013

Don't think they made a mistake. From http://www.anandtech.com/show/6787/...re-deep-dive-plus-tegra-4i-phoenix-hands-on/5 (emphasis mine):

"The big omission here is the lack of full OpenGL ES 3.0 support. NVIDIA’s pixel shader hardware remains FP24, while the ES 3.0 spec requires full FP32 support for both pixel and vertex shaders."

Ailuros · Feb 26, 2013

french toast said:
Anand made a mistake? Anyway tegra 4 introduces some nice features at least..including multiple render targets.

Which is most likely as decorational as it is on other hw from a performance perspective. For the record MRTs weren't absent from prior ULP GeForces either.

french toast · Feb 26, 2013

Rys said:
Don't think they made a mistake. From http://www.anandtech.com/show/6787/...re-deep-dive-plus-tegra-4i-phoenix-hands-on/5 (emphasis mine):

"The big omission here is the lack of full OpenGL ES 3.0 support. NVIDIA’s pixel shader hardware remains FP24, while the ES 3.0 spec requires full FP32 support for both pixel and vertex shaders."

Oh right gotcha...

Exophase · Feb 26, 2013

Rys said:
Tegras 3 and 4 are s16e7 (24-bit) for pixel shading.

I've seen this incongruence for a while now.. nVidia documents a 20-bit format:

"The Tegra fragment unit supports two levels of fragment variable precision: fp20 (an s.6.13 floating-point format) and fx10 (two’s complement s.1.8 format)"

http://docs.nvidia.com/tegra/data/Optimize_OpenGL_ES_2_0_Performance_for_Tegra.html

Given that it goes on to say that it can store twice as many FX10 format temporaries, varyings, and uniforms, maybe FP20 is merely the storage limit of these variables and the output of the interpolator (it does go on to say that explicitly that it can interpolate 4 fp20 or 8 fx10 values per cycle). I'm not sure what the distinction is here between temporaries and registers but maybe the registers or only some of them are 24-bits, and the ALU precision is as well. Since we're just looking at more significant bits it'd be trivial to convert between the two.

ams · Feb 26, 2013

Rys said:
Tegras 3 and 4 are s16e7 (24-bit) for pixel shading.

Rys, according to the recently posted Tegra 4 GPU architecture whitepaper, the pixel shader precision is FP20 (see page 9): http://www.nvidia.com/object/white-papers.html

Rys · Feb 26, 2013

In that case I would love to know how it runs Windows 8 RT, since that mandates ps_2_0, which in turn requires 24-bit.

NVIDIA Tegra Architecture

ninelven

PM

ltcommander.data

Alexko

Exophase

ninelven

PM

Exophase

Ailuros

Epsilon plus three

Alexko

french toast

Alexko

ninelven

PM

french toast

Rys

Graphics @ AMD

french toast

Rys

Graphics @ AMD

Ailuros

Epsilon plus three

french toast

Exophase

ams

Rys

Graphics @ AMD

Similar threads