NVIDIA Tegra Architecture

Jubei · Dec 18, 2012

ToTTenTranz said:
The problem is that if it's 6x Tegra 3, then it can't be 20x Tegra 2, because Tegra 3 isn't 3.33x faster than Tegra 2.

Why not? Granted this is not an official announcement so it should be taken with a grain of salt. But we dont know how Nvidia is measuring the performance to come up with those results.

Just as an example.. Tegra 3 T33 with 520 MHz GPU is 3.33x faster than Tegra 2 AP20H with 300 MHz GPU in Egypt 2.1 (40 fps vs 12.8 fps)

Likewise their 6X faster Tegra 4 could be the chip clocked at 2 GHz while the 1.8 GHz version wont be 6X and their comparison could be against T30L or something like that

DSC · Dec 21, 2012

http://www.phoronix.com/scan.php?page=news_item&px=MTI1OTA

http://lkml.indiana.edu/hypermail/linux/kernel/1212.2/01898.html

This patchset adds initial support for the NVIDIA's new Tegra 114
SoC (T114) based on the ARM Cortex-A15 MP. It has the minimal support
to allow the kernel to boot up into shell console. This can be used as
a basis for adding other device drivers for this SoC. Currently there
are 2 evaluation boards available, "Dalmore" and "Pluto".

ams · Dec 23, 2012

ToTTenTranz said:
The problem is that if it's 6x Tegra 3, then it can't be 20x Tegra 2, because Tegra 3 isn't 3.33x faster than Tegra 2.

NVIDIA states that 3D performance of Tegra 3 is up to 3x faster than Tegra 2 based on GLBenchmark 2.0 Egypt: http://www.nvidia.com/object/tegra-3-processor.html

NVIDIA also has a whitepaper that shows a few examples of 3D gaming performance on Tegra 3 that is 2.3-2.7x faster than Tegra 2: http://1.androidauthority.com/wp-content/uploads/2011/09/Tegra-3-Benchmarks.jpg

This probably means that Tegra 4 is just some 3-4x faster than Tegra 3 in GPU performance: just barely enough to level with the Snapdragon's Adreno 320 in smartphones, but it will have a hard time dealing with the Mali T604 in Exynos 5 for tablets.

A 4x improvement in GPU performance across the board (which is easier said than done of course) in Tegra 4 vs. Tegra 3 would be enough for Tegra 4 to handily outperform Adreno 320 and Mali T604 in each and every GLBenchmark test: http://www.anandtech.com/show/6425/google-nexus-4-and-nexus-10-review/2 (let alone any real-world gaming benchmarks, or more graphically intensive synthetic benchmarks such as the upcoming Futuremark 3dmark mobile test suite). Do note that the maximum number of pixel shading execution units is rumored to be 8-9x more in Tegra 4 than Tegra 3. Since the GLBenchmark 2.5 Egypt HD test appears to scale fairly linearly with any increases in ALU performance (see A5X vs. A6X as an example), that means that Tegra 4 should dramatically outperform Tegra 3 on that test.

At the end of the day, the dramatic increase in execution units, the move from non-unified to unified shader architecture, the move from a very old architecture to a new architecture with Kepler DNA, the move from a 40nm fabrication process to a 28nm fabrication process, the move from a relatively old software stack to a relatively new software stack, etc. means that Tegra 4 should be a huge improvement in all areas vs. Tegra 3.

ams · Dec 23, 2012

Arun said:
Using the exact same layout as the Tegra 3 diagram in a way that could easily be photoshopped and saying "6x graphics" with "72 cores" (i.e. 6 times Tegra 3's 12 cores) which doesn't fit Kepler at all makes this very suspicious. It could be right, but I don't buy it so far.

So what exactly would make sense to you for Tegra 4's configuration? What makes you think that 72 CUDA "cores" doesn't fit Kepler at all?

In the Kepler mobile/desktop GPU lineup, the ratio of CUDA cores to TMU's appears to be 12:1 in all cases. A hypothetical Tegra 4 GPU with 72 CUDA cores has a number of cores that is evenly divisible by 12. In fact, one problem with the previous 64/32 core rumors for Tegra 4 that I did not realize until now is that neither of these numbers are evenly divisible by 12.

Now, clearly NVIDIA cannot use the exact same SMX configuation with Tegra as they did with Kepler mobile/desktop parts (where each SMX has 192 CUDA cores), due to much stricter power consumption limits for Tegra-equipped devices. So while Tegra 4 almost certainly has Kepler DNA, surely the design had to be heavily customized for the task.

P.S. I did some digging, and there is a Tegra 2 whitepaper listed on NVIDIA's website that has a diagram that is very similar in layout compared to this leaked Wayne diagram: http://androidandme.com/wp-content/uploads/2011/01/tegra-geforce-gpu.jpg . So there is a chance that this diagram did come straight from NVIDIA (although I don't get why there are 240 green squares shown on the Wayne diagram for the Geforce GPU).

ams · Dec 23, 2012

On a side note, I am guessing that the GPU in Tegra 4 will most closely resemble a very slimmed down version of the Geforce GT 640M LE graphics card. GT 640M LE has a 500MHz core clock operating frequency (with no GPU boost feature), has DDR3 memory, and has 384 CUDA cores with a 20w TDP. Assuming that the Tegra 4 GPU truly does have 72 CUDA cores, then by extrapolation the power consumption should be low enough using a 28nm fabrication process for use in handheld tablet devices. The GFLOPS throughput would be 72 GFLOPS, which is basically in line with the SGX 554MP4 used in the A6X SoC.

Arun · Dec 23, 2012

ams said:
So what exactly would make sense to you for Tegra 4's configuration? What makes you think that 72 CUDA "cores" doesn't fit Kepler at all?

Kepler has 32-wide ALUs and a branch granularity of 32, and 72 isn't a multiple of 32. I see no reason for them to change that (they haven't changed the branch granularity since G80, possibly because it is directly exposed to CUDA programmers) and even if they did it'd be a very bad idea not to keep it a power of 2.

It's possible it's really 64 MADDs and they decided to count "cores" differently although that would be surprising. They could count 72 as "64 MADDs + 8 Special Function Units" or "64 MADDs + 8 Texture Units" or "64 FP32 MADDs + 8 FP64 MADDs" but none of that makes much sense. The only thing that seems plausible is 64 GPU cores + 8 CPU cores (4xA15+4xA7) but that's in direct contradiction with this slide...

My bet remains 96 GPU cores, i.e. half a SMX. As hardware.fr correctly pointed out at the GK104's launch, the two halves of a SMX are actually independent except for the 64KB of shared memory (and maybe some of the PolyMorph functionality). It would make sense for NVIDIA to modify the shader core itself as little as necessary (so they can reuse the compiler) and focus more on optimising the rest of the chip. It will likely require pretty big changes to target a much lower level of performance and power consumption - the central parts of a GPU tend to be the hardest to scale down as it's not as simple as just reducing the number of parallel units.

So yeah, 500MHz+ 96 ALUs/8 TMUs/4 ROPs with 64-bit 1600MHz (LP)DDR3 still seems like the most likely configuration to me. This is strictly speculation based on how you'd logically want to scale down Kepler rather than any insider information though...

P.S. I did some digging, and there is a Tegra 2 whitepaper listed on NVIDIA's website that has a diagram that is very similar in layout compared to this leaked Wayne diagram: http://androidandme.com/wp-content/uploads/2011/01/tegra-geforce-gpu.jpg . So there is a chance that this diagram did come straight from NVIDIA (although I don't get why there are 240 green squares shown on the Wayne diagram for the Geforce GPU).

Thanks for the link, that's the diagram I meant when I said it used the "exact same layout as the Tegra 3 diagram in a way that could easily be photoshopped" (not sure if there's a similar one for Tegra 3 or if I'm just getting old and confused the two).

Honestly I think that actually makes it less credible. The weird 240 blocks on the GeForce block also look suspicious. It might be for real, but if so it's pretty lazy marketing. And I'm still expecting something more exciting than those specs, but maybe that's just the former NVIDIA fanboy still hiding in me!

ams · Dec 23, 2012

Arun said:
So yeah, 500MHz+ 96 ALUs/8 TMUs/4 ROPs with 64-bit 1600MHz (LP)DDR3 still seems like the most likely configuration to me. This is strictly speculation based on how you'd logically want to scale down Kepler rather than any insider information though...

While I agree that this would be an elegant slimmed down version of Kepler, I am not convinced that this would be enough to satisfy power consumption and die size requirements for the Tegra 4 SoC. This proposed 96 CUDA core configuration would cut down execution units to 1/4 that of GT 640M LE, but TDP really needs to be cut down closer to 1/6 that of GT 640M LE for use in Tegra-equipped handheld devices. Also note that 96 CUDA cores would be the only CUDA core amount below 100 that is evenly divisible by 32 (for branch granularity of 32) and evenly divisible by 12 (for ALU-to-Tex ratio of 12:1), so any lower power/lower performance variants would need to be rearchitected away from the 12:1 ALU-to-Tex ratio.

Arun · Dec 23, 2012

ams said:
While I agree that this would be an elegant slimmed down version of Kepler, I am not convinced that this would be enough to satisfy power consumption and die size requirements for the Tegra 4 SoC.

Hmm? An entire SMX takes ~17mm2 AFAICT. If you reduce it to less than 10mm2 by cutting it in half, that seems small enough to me. I agree the power consumption might be a bigger issue but they could try optimising the low-level implementation more for power (sacrificing some area) or clock it even lower for smartphones (lower voltage means higher power efficiency). I still think the bigger question is how much they can optimise everything outside the SMX...

so any lower power/lower performance variants would need to be rearchitected away from the 12:1 ALU-to-Tex ratio.

Sure. One redesign that makes sense to me would be to remove the second decoder and the 3rd MADD, essentially getting a single dual-issue decoder feeding 2xVec32 MADDs with no opportunity to co-issue a load/store or texture instruction for free. I'm not convinced that's worth the level of effort required in both the hardware and the compiler though. And while it would reduce absolute power, I'm not convinced it would improve perf/watt.

If I wanted to save area in the SMX, I'd be much more tempted to reduce TMU performance (do we really need full-speed FP16 filtering in this generation of handheld hardware?) and memory latency tolerance (if you target lower clock rates, you can reduce the number of threads and registers) for example. Finally the TDP includes memory in notebooks but not in handhelds, and obviously it doesn't matter as much as average power.

We'll know soon enough!

Ailuros · Dec 25, 2012

If I've learned something in the past few years while following things in the mobile market is that NVIDIA didn't need a crapload of unit amounts against its competition in order to stay competitive, rather the exact opposite. It's rather the sheer efficiency of their compiler/drivers that do most of the "magic" involved.

How many ALU lanes do I have on an Adreno225 compared to the ULP GF in Tegra3 and why isn't there a clear analogy between unit amounts and performance levels in the end?

Also while I'm obviously not an engineer you'll have a damn hard time convincing me that the Wayne GPU won't be a complete custom design on similar rails than its predecessors and not just a simple shrunk down version of any current high end desktop GPU architecture with some minor power consumption tweaks. I'll say latency and again latency.

Isn't it idiotic to say that N hw configuration sounds too little or too much without even having a hint of its real performance/efficiency? Assuming the slide is real, NV claims that the Wayne GPU is 6x times more efficient compared to Tegra3 and that's roughly more or less in line with what competitors claim for their next generation GPUs.

Finally - as always - expect too much and prepare to get fairly dissapointed or expect too little to get pleasantly surprised. Pick your poison

Blazkowicz · Dec 26, 2012

I had been writing a post a few days ago but let it get lost, but if a branch granularity of 32 has been done since the G80, the number of ALUs wasn't necessarily a multiple of 32, it was a multiple of 16 on two and a half gen (plus GT21x and GF119), down to just 8 ALU on the G98.

This idea of using only one SM cut in half is precisely what was done with the G98, but you still had the big front end etc. and a TDP of 25 watts. So, your low end GPU is "big", but you deal with it. your 8400GS has more transistors and worse performance than your 6800GT, but with the shrinks it uses less power and is smaller.

So, I can't fully rule out that half-Kepler idea for the Tegra 4, it sort of works thanks to a lavish transistor budget and making it as low-powery as you can but why not a new architecture, if you don't want the GTX 680 front-end on your Tegra.

72 "cores" would translate to 9 units of 8 ALU, and a 8:1 ALU to TMU ratio (?) (I guess some texture rate doesn't hurt with some devices at very high res).

Ailuros · Dec 27, 2012

I guess it's the point where sarcasm hits in and I have to ask whether you folks expect the Wayne GPU to be DX11.1 or almost DX11.1

Observer · Jan 3, 2013

Hi guys, first post here.
I noticed in the glbenchmark database this one:
http://www.glbenchmark.com/phonedetails.jsp?benchmark=glpro25&D=Dalmore+Dalmore&testgroup=overall

"System" says that it uses a Tegra platform, clocked between 51 and 1836mhz. Scores are far better than Tegra 3, but would be disappointing for a Tegra 4 as both Mali t604 and PowerVr554 mp4 destroy it.
What do you think?

DSC · Jan 7, 2013

http://www.nvidia.com/object/tegra-4-processor.html

Tegra 4 officially announced.

http://www.anandtech.com/show/6550/...00-5th-core-is-a15-28nm-hpm-ue-category-3-lte

Tegra 4 is built on TSMC's 28nm HPm process (low power 28nm with High-K + Metal Gate)

The fifth/companion core is also a Cortex A15, but synthesized to run at lower frequencies/voltages/power. This isn't the same G in and island of LP process that was Tegra 2/3.

The fifth/companion core isn't visible to the OS, it's not big.LITTLE but it'll work similarly to how Tegra 3 worked. This probably means no companion core in Windows RT.

The four Cortex A15s will run at up to 1.9GHz.

Dual-channel memory interface, LP-DDR3 is supported

Nebuchadnezzar · Jan 7, 2013

Observer said:
Hi guys, first post here.
I noticed in the glbenchmark database this one:
http://www.glbenchmark.com/phonedetails.jsp?benchmark=glpro25&D=Dalmore+Dalmore&testgroup=overall

"System" says that it uses a Tegra platform, clocked between 51 and 1836mhz. Scores are far better than Tegra 3, but would be disappointing for a Tegra 4 as both Mali t604 and PowerVr554 mp4 destroy it.
What do you think?

Do you have any mirror for that? The link doesn't show anything for me.

I didn't expect they would suck in terms of GPU. If that's the case then I'd go as far as call this another Tegra 3, i.e. a failure.

The most surprising info for me was the 80mm2 die size. It's much less than I expected for a 5xA15 design, it's basically the same size as a Tegra 3. It should give us a ballpark figure for what the competition can do at the same process. Suddenly a 4.4 big.LITTLE + T658 Exynos seems pretty darn reasonable.

Jubei · Jan 7, 2013

GPU still not unified? Wow

Rys · Jan 7, 2013

In that case the annotated die shot is completely fabricated, since all 72 GPU cores have the same synthesis. Not entirely unexpected given the Tegra 2 and 3 die shots were also made up.

OlegSH · Jan 7, 2013

Nebuchadnezzar said:
I didn't expect they would suck in terms of GPU

How do you came to such conclusion?
Maybe you know something we don't?
These 72 FPUs should be clocked at 650-700 Mhz considering 20x fp performance over Tegra2 from leaked slide, it's roughly 100 Gflops, should be enough to keep up with something like SGX554MP4 in iPad4 in GLB2.5

Deleted member 13524 · Jan 7, 2013

OlegSH said:
How do you came to such conclusion?
Maybe you know something we don't?
These 72 FPUs should be clocked at 650-700 Mhz considering 20x fp performance over Tegra2 from leaked slide, it's roughly 100 Gflops, should be enough to keep up with something like SGX554MP4 in iPad4 in GLB2.5

Logical conclusion is that if it did, they would show GPU performance comparisons against Adreno 320 and SGX554MP4.
They didn't, as they did with the CPU results. Most probably, Tegra 4's GPU falls short of expectations.

I'm really excited about Project Shield though. I'm hoping they'll leverage the streaming app for Tegra 3 tablets and Geforce 600 GPUs.

OlegSH · Jan 7, 2013

ToTTenTranz said:
They didn't, as they did with the CPU results. Most probably, Tegra 4's GPU falls short of expectations.

Most probably we just too hurry with conclusions, CES even don't started yet

Deleted member 13524 · Jan 7, 2013

OlegSH said:
Most probably we just too hurry with conclusions, CES even don't started yet

nVidia's keynote lasted for more than a hour and a half, during which they chose to show comparative CPU performance and they chose not to show comparative GPU performance.

I'm sure the GPU performance should be quite competent for today. It's probably a bit faster than the Adreno 320 and comparable to Mali T604, which is already substantially faster than a Vita (I always thought that would be a turning point for handheld gaming in Android devices).

Rushed up conclusion or not, I'm 95% sure the Tegra 4's GPU performance isn't comparatively groundbreaking or they would brag about it, as expected from a company that started its roots building 3D graphics processors.

NVIDIA Tegra Architecture

Jubei

DSC

ams

ams

ams

Arun

Unknown.

ams

Arun

Unknown.

Ailuros

Epsilon plus three

Blazkowicz

Ailuros

Epsilon plus three

Observer

DSC

Nebuchadnezzar

Jubei

Rys

Graphics @ AMD

OlegSH

Deleted member 13524

Guest

OlegSH

Deleted member 13524

Guest

Similar threads