NVIDIA Tegra Architecture

ltcommander.data · Jan 14, 2013

anexanhume said:
They also claim a non-unified architecture isn't a drawback because it's still well-suited to mobile graphics workloads.

I guess they're assuming a 2:1 pixel shader to vertex shader workload is/will continue to be used in most mobile games for the next year or two.

Arun · Jan 15, 2013

anexanhume said:
They also claim a non-unified architecture isn't a drawback because it's still well-suited to mobile graphics workloads.

I would argue that's probably true. There are clear efficiency advantages *if* (and only if) you can meet all of these conditions:
- No support for GPGPU.
- Much lower precision pixel shader.
- No texturing (i.e. no latency tolerance) in vertex shader.
- Sufficiently fast vertex shader so more likely to be triangle setup limited (i.e. could not realistically use the entire shader core for vertex shading anyway).

Tegra 4 clearly fits that description (for better and for worse). That makes it not very forward-looking but NVIDIA can probability get away with that and benefit more from the area/power efficiency advantage.

Overall, despite my personal disappointment at Tegra 4 not being Kepler-based, I've got to admit this is probably a better architecture for 2013 than Kepler would have been (especially now that they've confirmed framebuffer compression). The power/area cost of ES3/DX10/DX11/OpenCL is significant so this is likely to be much more efficient overall. It remains to be seen how it stacks up with the competition by the time Tier 1 devices are publicly available but, while not exciting at all, this does sound like a very solid architecture.

OlegSH · Jan 15, 2013

Arun said:
I would argue that's probably true. There are clear efficiency advantages *if* (and only if) you can meet all of these conditions:
- No support for GPGPU.
- Much lower precision pixel shader.
- No texturing (i.e. no latency tolerance) in vertex shader.

Unified architecture requires dynamic resource allocation to avoid pipeline stalls with IMR, but resource allocation could be static with tilers due to decoupled geometry, which means 0 control logic overhead for tile-based architectures

Arun · Jan 15, 2013

OlegSH said:
Unified architecture requires dynamic resource allocation to avoid pipeline stalls with IMR, but resource allocation could be static with tilers due to decoupled geometry, which means 0 control logic overhead for tile-based architectures

Agreed in theory although none of my points had anything with the extra control logic

To simplify the reasoning, I assumed that even on an IMR the control logic would be less expensive than adding more units to achieve the same performance (i.e. imagine a design with identical VS and PS units that are statically allocated to one or the other at design/manufacturing time to save that control logic; I would expect that to always be slower than an unified design).

Regarding geometry vs pixel processing logic, the specifics are often overlooked:
- On an IMR, if the triangle setup unit is the bottleneck, then the PS/TMUs will stall. So if you have a very high polygon count object that is nearly entirely outside the frustum or at least fails the depth test, you will pay the geometry processing cost of that object in full.
- On a TBR, only the *average* VS vs PS throughput matters, as you can do the pixel processing of the previous render/frame while doing the geometry processing of the current one. So if the geometry processing for the entire render/frame isn't the overall bottleneck, then that invisible object will basically come for free. This is an often overlooked advantage of TBRs for geometry processing.
- On both IMRs and TBRs, you ideally want to have all the fixed-function units busy as often as possible, so that they don't become a bottleneck at other points in time. For example, on a very naive unified shader TBR where you did geometry processing and pixel processing sequentially with no overlap, if the geometry stage was triangle setup limited, then the shader core will mostly stall for that duration. On the other hand if geometry and pixel processing happen in parallel, the geometry part may essentially come "for free".
- When doing VS and PS overlap in an unified architecture (either IMR or TBR), it's very frequent for the VS not to use the TMUs while the PS might be TMU limited. So ideally you want fine-grained overlap between VS and PS inside every individual shader core/cluster so that the VS ALU instructions basically come for free when the PS is TMU limited.

The last point especially is very easy to get wrong if you try to get away with simple control logic, although it's hard to say how much it matters for typical workloads as they'd need to be both VS_heavy and TMU-heavy for this to be significant; arguably part of GLBenchmark 2.5 does fit in that category FWIW.

One thing I'm still curious about in Tegra 4's case is whether the PS ALUs are really FP20 and if so how they're getting DX9_3 conformance... I assume they've got a dedicated normalisation unit (ala NV4x for FP16 but higher precision here) which would help reduce precision issues in typical workloads as normalisation is a frequent problem.

Alexko · Jan 16, 2013

Arun said:
I would argue that's probably true. There are clear efficiency advantages *if* (and only if) you can meet all of these conditions:
- No support for GPGPU.
- Much lower precision pixel shader.
- No texturing (i.e. no latency tolerance) in vertex shader.
- Sufficiently fast vertex shader so more likely to be triangle setup limited (i.e. could not realistically use the entire shader core for vertex shading anyway).

Tegra 4 clearly fits that description (for better and for worse). That makes it not very forward-looking but NVIDIA can probability get away with that and benefit more from the area/power efficiency advantage.

Overall, despite my personal disappointment at Tegra 4 not being Kepler-based, I've got to admit this is probably a better architecture for 2013 than Kepler would have been (especially now that they've confirmed framebuffer compression). The power/area cost of ES3/DX10/DX11/OpenCL is significant so this is likely to be much more efficient overall. It remains to be seen how it stacks up with the competition by the time Tier 1 devices are publicly available but, while not exciting at all, this does sound like a very solid architecture.

This may be a stupid question, but how common are ES3/DX10/11/OpenCL in mobile games and applications?

I mean, saving area and power, thus reaching high performance is great, but if a major game comes out and your shiny new flagship SoC can't run it, it doesn't help you very much, does it?

ltcommander.data · Jan 16, 2013

Alexko said:
This may be a stupid question, but how common are ES3/DX10/11/OpenCL in mobile games and applications?

I don't think there are any. Maybe there could be some DX10/DX11 mobile games currently for Win8 tablets, but I doubt that's a very big market yet. The question is really how common will they be in 2013/2014 during Tegra 4's lifespan? These technologies are only just being introduced in mobile while nVidia seems to believe there's a gap before they are "required" that Tegra 4 can exploit.

I'd imagine Epic could be cooking up an OES3.0 Infinity Blade to demo at the Apple keynote that will introduce a Rogue A7 chip. But since Infinity Blade has always been Apple exclusive that will probably be viewed as a general iOS/Android gaming differentiator rather than a differentiator between OES2.0 Tegra 4 and other OES3.0 capable GPUs. Most graphically intensive mobile games seem to use UE3 or Unity so OES3.0 adoption will be significantly influenced by how fast those engines incorporate OES3.0 support and make it available to licensees.

Ailuros · Jan 16, 2013

Alexko said:
This may be a stupid question, but how common are ES3/DX10/11/OpenCL in mobile games and applications?

None for smartphones/tablets (Android/OS).

I mean, saving area and power, thus reaching high performance is great, but if a major game comes out and your shiny new flagship SoC can't run it, it doesn't help you very much, does it?

Did we need anything from DX10 as a counter-example in late 2006 when G80 shipped? It was a very fast DX9 GPU with DX10 capabilities. Typically hw comes before sw in order for developers to have material to code for.

However the desktop market is a rather bad example compared to mobile, since in the first NVIDIA is a leading graphics IHV. Market share up to spring 2012 for ULP GeForces/Tegras overall was quite small, meaning completely different variables. NV as anyone else involved in the SFF market obviously wants to continiously gain market share; at the point T3 ended up it probably was an either/or choice between performance and functionalities. Granted in theory both would had been possible, however since power consumption is the most important factor in that space no one can just throw around with Watts like its no tomorrow as for desktop.

In any case the key to the entire story is how well the T4 GPU will hold up against the entire competition (direct and indirect). With the possibility of Wayne ending up faster than iPad4, if true NV went this time a step ahead compared to former SoCs. It may be that many don't acknowledge the halo it creates to be on top of benchmark lists both for customers/OEMs as well as consumers that follow such details.

Alexko · Jan 16, 2013

Got it, thanks. (It's probably obvious, but I don't pay much attention to mobile stuff, hence the naive question.)

Deleted member 13524 · Jan 16, 2013

Maybe nVidia believes the quad A-15 will suffice for almost every computing tasks that may appear in a mobile game. They've probably been working on their OpenCL-ARM compiler for a long time.

For example, they had PhysX running in the quad A-9 in Tegra 3, in the Glowball demo.

Ailuros · Jan 16, 2013

ToTTenTranz said:
Maybe nVidia believes the quad A-15 will suffice for almost every computing tasks that may appear in a mobile game. They've probably been working on their OpenCL-ARM compiler for a long time.

For example, they had PhysX running in the quad A-9 in Tegra 3, in the Glowball demo.

I'd say that we'll all notice when NVIDIA starts marketing for GPGPU. As a sidenote here's what IMG had to say recently on the OpenCL topic:

http://www.engadget.com/2013/01/04/opencl-mod-for-the-kindle-fire-hd/

For its part, Imagination explained to us that the main thing holding mobile device manufacturers back from OpenCL is fear: the fear of giving third-party devs too much control over the GPU too early on, with the risk of devices becoming bricked or unstable. Imagination reckons that the popularity of mobile OpenCL adoption will grow during 2013 but won't take off properly until the year after.

rpg.314 · Jan 16, 2013

Ailuros said:
I'd say that we'll all notice when NVIDIA starts marketing for GPGPU. As a sidenote here's what IMG had to say recently on the OpenCL topic:

http://www.engadget.com/2013/01/04/opencl-mod-for-the-kindle-fire-hd/

Hmm, so their drivers/chips aren't ready yet. Fair enough.

silent_guy · Jan 16, 2013

ToTTenTranz said:
They've probably been working on their OpenCL-ARM compiler for a long time.

Why would they do that?

For example, they had PhysX running in the quad A-9 in Tegra 3, in the Glowball demo.

Is PhysX an OpenCL application?

Helmore · Jan 16, 2013

Did that Glowball demo make use of PhysX? I thought it was simply making use of all 4 cores to do everything.

ltcommander.data · Jan 16, 2013

silent_guy said:
Is PhysX an OpenCL application?

I believe hardware GPU PhysX acceleration uses CUDA rather than OpenCL. I doesn't seem like Tegra 4 will support CUDA either.

Ailuros · Jan 16, 2013

Physics calculations on the CPU are an entirely different chapter than GPGPU, OpenCL, CUDA or whatever you'd want to call it.

french toast · Jan 17, 2013

I dont know wha to make of the tegra 4 news...initially I was shocked with the 72 cores and 6x tegra 3 gushing....as that was far more than I expected and at least in my own mind pointed to some kind of exotic keplar uarch. ..then common sense started to kick in and with no mention of halti, new uarch, unified shaders..and also nvidia somewhat suspect "5x" performance claims and 12 core gpu nonsense in the past I started to wonder whether a large part of the 6x gushing was also taking into consideration the cortex a15s @ 1.9ghz...

So reading that they have actually likely increased the gpu portion by 5-6x is comforting once more...the lack of halti in tegra 4 and MAYBE Exynos 5 octo has me strangely confused....

Someone help me decide please...excitment..to be or not to be?? :/

Ailuros · Jan 17, 2013

Decide on what? Forming an opinion? I'd ask what someone would use OGL_ES3.0 for within the year from a user perspective.

Most criticism connected with lack of API support for Wayne is mostly technology oriented in its way especially in a forum like that.

For the average user however Wayne will be quite fast, both in games as in public benchmarks. Once final products are up and running users can decide if the overall package fits their imagery.

french toast · Jan 17, 2013

Well im still sitting on the fence somewhat ailuros....was hoping the big 3 android soc manufacturers would make a complete transition to next gen apis such as halti and open cl.

That combined with likely tegra 4 optimised games would have really moved gaming forward.

Blazkowicz · Jan 17, 2013

GPGPU brings you all sorts of ineffiencies and limitations. In short it sucks and any mobile OpenCL support was always a meaningless bullet point. Needless memory copies cost you power alone.

AMD's HSA makes it viable and a Tegra Maxwell will be the same thing under another name.

For GPGPU to be viable on mobile, the HSA initiative, implying ARM, will have to get the mobile GPUs involved as well, i.e. PowerVR, Mali and something else. This would be the bright heterogeneous future, even. Dunno what will happen.
Tegra 4 is only really a stop gap.

Deleted member 13524 · Jan 17, 2013

silent_guy said:
Why would they do that?

Because OpenCL is a platform for heterogeneous computing, so if some developer decides to implement OpenCL in his game/app, Tegra 4 will be able to run it.
Nonetheless, OpenCL on ARM isn't anything new.

silent_guy said:
Is PhysX an OpenCL application?

PhysX is a physics engine originally designed for x86 and AGEIA's dedicated cards. After buying AGEIA, nVidia ended the development for the dedicated cards and enabled the "hardware acceleration" for their GPUs through CUDA.

In the meanwhile, the PhysX team developed paths for Xenos, Cell, ARM and further developed the x86 path.
So it's not OpenCL but its compatibility is so wide that it could be.

There's also the fact that AFAIK its early days, people would call OpenCL as a "sub-set of CUDA" because of the large proportion of code that was actually the same between the two platforms.

Helmore said:
Did that Glowball demo make use of PhysX? I thought it was simply making use of all 4 cores to do everything.

Yes, the Glowball uses PhysX through the quad Cortex A9 in Tegra 3:
https://play.google.com/store/apps/details?id=com.nvidia.demo.glowball

Ailuros said:
Physics calculations on the CPU are an entirely different chapter than GPGPU, OpenCL, CUDA or whatever you'd want to call it.

I don't know what you mean by "entirely different chapter" but the exact same PhysX effects can be enabled either in a CPU or a nVidia GPU in a game. Bullet Physics has an OpenCL path that can be taken by either a CPU or an OpenCL-enabled GPU. In 2009 Havok showed a physics demo that was processed in both an AMD HD4000 GPU and a Phenom II CPU (the CPU and GPU versions were shown side-by-side).

It's an entirely different chapter in hardware? Yes. But it's clear to me that physics engines are evolving to being heterogeneous and not just CPU or GPU-based.

I''m still hoping for Intel to come out with a version of Havok that enables an OpenCL-enabled iGPU to calculate physics, increasing the CPU performance and showing the same eye-candy as some PhysX games.

NVIDIA Tegra Architecture

ltcommander.data

Arun

Unknown.

OlegSH

Arun

Unknown.

Alexko

ltcommander.data

Ailuros

Epsilon plus three

Alexko

Deleted member 13524

Guest

Ailuros

Epsilon plus three

rpg.314

silent_guy

Helmore

ltcommander.data

Ailuros

Epsilon plus three

french toast

Ailuros

Epsilon plus three

french toast

Blazkowicz

Deleted member 13524

Guest

Similar threads