NVIDIA Tegra Architecture

Deleted member 13524 · Jan 10, 2013

Industry backlash on nVidia because they're not being supportive of GPGPU?

Wow, I never saw that coming.. after some replies here, I thought they had at least enabled OpenCL in the pixel shaders or something.

That brightsideofnews article isn't painting a bright future for Tegra 4.

3dcgi · Jan 10, 2013

monstercameron said:
couldn't nvidia use their kepler core in tegra, how difficult could it have really been, they already have the design at tsmc for both kepler and tegra, it also looks like kepler is more efficient, so would they stick with the older design?

I suspect Nvidia won't go with a Kepler based core until they're ready to pay the silicon cost for dx11 functionality. Nvidia's first unified shader was DX10 so they're probably sticking with separate vertex and pixel ALUs until they upgrade the feature set.

silent_guy · Jan 10, 2013

That brightsideofnews article isn't painting a bright future for Tegra 4.

It's hard to think of a better endorsement!

Arun · Jan 10, 2013

Exophase said:
Although it'll vary a lot depending on load I don't think Cortex-A15 will deliver such a big increase in ICP over A7 on average. ARM said they expected 50% better IPC in integer code than Cortex-A9 but in practice I'm not seeing numbers anywhere close to this very often, at least not with Exynos 5. And I think the expectation that A7 will be roughly A8 level and often better is credible - although it can't pair as many instruction combinations it does have a shorter branch mispredict penalty, claims very low latency caches (we'll see) and AFAIK benefits from auto prefetch and what have you.

So I'd give a number like ~1.8x in the best case but maybe as low as 1.4-1.5x. Very hand-waved, of course. Maybe will change a little with compilers.

Hmm, interesting, I think you're right that 2x is too high for integer workloads (although you'd expect A15 to be much closer to 2x IPC with 128-bit FP for what little it's worth). I'm honestly not sure what number would be most realistic but I suppose it really depends on the workload and memory subsystem. Given how complex the A15 is though, I'd be really disappointed if it was less than 1.5xA7 on average but we'll see...

OlegSH said:
I heard from one developer that single SGX Vec4 SIMD unit issues 4 FP16 MADs per cycle and only 2 FP32 MADs, does anybody know if this true or not? And does it mean that 76 theoretical gflops of SGX554MP4 in iPad4 are equal to 38 FP32 Gflops?

IMG DevTech's team has historically claimed that to some developers as a simplifcation of what you can expect in terms of performance, but the shader core can definitely achieve full-rate Vec4 FP32 MADDs. The problem is there are other restrictions (not ALUs and not operand sharing per-se) on Vec4 FP32 instructions, and if you don't meet those then you only get a Vec2 FP32 or Vec4 FP16 operation at full speed. These restrictions are not a problem for typical vertex shader workloads.

Exophase said:
However, according to a post by John H (sorry, don't have a link) there are instructions with a shared 32-bit operand that can perform two FMADDs in one cycle. Although he wasn't any more specific about what this meant my guess is that it's some form of (a * b) + (a * c) + d. This instruction would be used in the kernel of matrix * vector multiplication ie linear transformations, which is the fundamental building block for geometry transformation & lighting.

Series 5XT could dual-issue the above instructions so you get 4xFP16 or limited 4xFP32

Correct for SGX as per JohnH's post, but the SGX-XT instruction set is new and so you shouldn't extrapolate too much from his post, it's not simply dual-issuing original SGX instructions.

Exophase said:
I have never read anything about that, and I read that pretty detailed optimization manual nVidia released.

Same here, but when I did some low-level testing, I noticed higher performance (but not 2x) with LowP for some extreme corner case shaders I wrote. It might just be register pressure although it wasn't behaving in the way I would have expected it to based on my understanding of NV40's register system, but I never had the time to investigate.

Exophase said:
Whole thing might be moot, since according to an earlier post by Arun in this thread (http://forum.beyond3d.com/showpost.php?p=1659557&postcount=271) it looks like the 8/10-bit operations may not dual-issue in Series5XT..

Correct, a fully 10-bit shader can still be very slightly faster because you wouldn't need to do any FP->INT conversion at the end, but that's about it. In practice it could even be slower (before considering dual-issue) due to both hardware and compiler limitations although the compiler should detect those cases and revert to FP16.

AlexV said:
I would consider the odds for a wonky partially unified solution to be quite low. That being said, and given the Tegra GPU's heritage (G7x something), I'd expect the VS to be able to interact with the samplers in order to implement Vertex Texture Fetch (do we know if older Tegras didn't support this? it'd go against what G7x can do). IIRC though, there were some strict bounds on just how you could sample - possibly no filtering? It's been a while - sidenote, amusing how in the handheld / embedded world most of what is old is new again

Tegra has much less to do with G7x than most people realise. It's inspired by G7x but fundamentally a different architecture (e.g. VS and PS are definitely 4 MADDs per scheduler, while G7x was 5 MADDs for VS and 8 MADDs for PS). The feature set is also completely different (no MSAA, no framebuffer compression, lower precision - the main new features compared to G7x are programmable blending, a general depth/color cache, and a hacky CSAA implementation).

3dcgi said:
I suspect Nvidia won't go with a Kepler based core until they're ready to pay the silicon cost for dx11 functionality.

Yeah, although the bigger question is the power cost, obviously. It might compensate some of that by being more bandwidth efficient, but likely not enough.

mboeller · Jan 10, 2013

link: http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf

Page 3 and Page 5 have an A7 / A15 comparison regarding performance and power consumption

Exophase · Jan 10, 2013

With no frequencies listed those numbers are hard to gauge :/

This remark is interesting: "All code that is executed on Cortex-A7 is compiled for Cortex-A15."

I expect normally you'd do it the other way around, unless code just kind of ends up naturally scheduled well for A7. Given its dual issue restrictions I wouldn't expect this.

Laurent06 · Jan 10, 2013

Exophase said:
I expect normally you'd do it the other way around, unless code just kind of ends up naturally scheduled well for A7. Given its dual issue restrictions I wouldn't expect this.

Imagine that code compiled for A7 performs very badly on A15, while code compiled for A15 performs only slightly badly A7, what option do you choose?

As an example think about library routines that can be very different on various CPU.

Please note, I'm not stating this is the case

An alternative way to explain that is that if you are ready to accept to run on a core that is let's say twice slower, losing an extra 10% IPC is not a big deal, if on the other your hungry core finishes its jobs faster.

Exophase · Jan 10, 2013

Laurent06 said:
Imagine that code compiled for A7 performs very badly on A15, while code compiled for A15 performs only slightly badly A7, what option do you choose?

As an example think about library routines that can be very different on various CPU.

Please note, I'm not stating this is the case

My premise is founded on the assumption that that is NOT the case. That's a pretty reasonable assumption, because Cortex-A15 can re-order instructions to work around sub-optimal compiler orderings and Cortex-A7 can't.

Some things will be chaff for the A15, true.. for instance I find that specifying alignment for NEON isn't necessary if the array is aligned, so if you need both an unaligned and aligned version you'd prefer to use set of code for both. Not sure how much things like this would creep into compiler output.

I hope someone does some tests to better show what's at stake.

Laurent06 said:
An alternative way to explain that is that if you are ready to accept to run on a core that is let's say twice slower, losing an extra 10% IPC is not a big deal, if on the other your hungry core finishes its jobs faster.

Worse IPC would translate to worse (but probably not as worse) power consumption too. The impact would be bigger when on the big core though.

I wonder if we'll see anybody specifying different compiler options for different parts of their programs. If I'm writing some generic Android code that will always run fine on typical Cortex-A7s I should definitely optimize for that.

Laurent06 · Jan 10, 2013

I was thinking in terms of instruction selection that might have more impact than code scheduling.

Exophase · Jan 10, 2013

Laurent06 said:
I was thinking in terms of instruction selection that might have more impact than code scheduling.

What we may want is a compiler option that specifically balances both A7 and A15 optimizations.

Ailuros · Jan 11, 2013

Funny how no one noticed so far: I finally gave a closer look to those cached GLBenchmark results (albeit I usually take preliminary results just as an indication) and the ULP GF in T4 is capable of Multisampling, with a significant performance drop but until final results I don't want to jump to any conclusions. Fillrates and triangle rates seem quite low, suspiciously close to T3 results which could mean anything or nothing.

Arun · Jan 11, 2013

Ailuros said:
Funny how no one noticed so far: I finally gave a closer look to those cached GLBenchmark results (albeit I usually take preliminary results just as an indication) and the ULP GF in T4 is capable of Multisampling, with a significant performance drop but until final results I don't want to jump to any conclusions. Fillrates and triangle rates seem quite low, suspiciously close to T3 results which could mean anything or nothing.

Hah, when I wrote my initial posts it had been removed and the cached version not linked yet, and then later I didn't notice. That's good news! And the even better news is that the High mode in GLB2.5 requires 24-bit depth, so it definitely supports that as well

It's a pretty heavy performance hit though (especially as GLB doesn't use MSAA on the shadow/reflection passes which are a fairly significant percentage of the total workload) so who knows about framebuffer compression, we'll see...

Rys · Jan 11, 2013

Ailuros said:
Funny how no one noticed so far

In this thread

Ailuros · Jan 11, 2013

Rys said:
In this thread

Clocks can't be final either judging from those fillrate/geometry results...in this thread of course

dagamer · Jan 12, 2013

ToTTenTranz said:
nVidia's keynote lasted for more than a hour and a half, during which they chose to show comparative CPU performance and they chose not to show comparative GPU performance.

I'm sure the GPU performance should be quite competent for today. It's probably a bit faster than the Adreno 320 and comparable to Mali T604, which is already substantially faster than a Vita (I always thought that would be a turning point for handheld gaming in Android devices).

Rushed up conclusion or not, I'm 95% sure the Tegra 4's GPU performance isn't comparatively groundbreaking or they would brag about it, as expected from a company that started its roots building 3D graphics processors.

Problem with GPU power is that it bears little relation to developers actually sinking money into making the kind of games you see on the Vita and 3DS. Heck, I have yet to see a mobile game that matches the same depth as the original Pokemon Red and Blue.

It's not about graphics. It's never been. Developers are dis incentivized to make large scale games when Angry Birds continues to be the top seller on multiple platforms 3 years after its release.

silent_guy · Jan 12, 2013

This is probably a pretty gaming oriented audience, so it'd be interesting to know: how many here have played for a decent amount of time (>3h?) a 3D game that stresses the GPU on their phone/tablet?

I have tried Infinity Blade demos and such out of curiosity and even bought a couple of racing games, but never played more than 15mins before giving up and going back to Words with Friends, Trainyard-like puzzle games and some strategy games.

ltcommander.data · Jan 12, 2013

silent_guy said:
This is probably a pretty gaming oriented audience, so it'd be interesting to know: how many here have played for a decent amount of time (>3h?) a 3D game that stresses the GPU on their phone/tablet?

I have tried Infinity Blade demos and such out of curiosity and even bought a couple of racing games, but never played more than 15mins before giving up and going back to Words with Friends, Trainyard-like puzzle games and some strategy games.

The only more device intensive game that I've spent long play sessions (scale of hrs rather than mins) with is SimCity Deluxe for iPad which can start to chug a little late game on a fully developed large map. This is on an iPad Mini. I'm thinking that the late game performance drop is probably more due to the CPU than the GPU since there may be a lot going on on-screen but it isn't a graphically advanced game. Since the game was originally developed and released before the iPad 2 and A5 were launched it probably isn't well threaded.

I've played the NOVA, Infinity Blade, Real Racing, and GTA series but mainly in occasional short bursts to kill a bit of time like a single mission or just messing around in the case of GTA rather than long, dedicated play sessions.

Onkl Bjorni · Jan 13, 2013

silent_guy said:
This is probably a pretty gaming oriented audience, so it'd be interesting to know: how many here have played for a decent amount of time (>3h?) a 3D game that stresses the GPU on their phone/tablet?

I have tried Infinity Blade demos and such out of curiosity and even bought a couple of racing games, but never played more than 15mins before giving up and going back to Words with Friends, Trainyard-like puzzle games and some strategy games.

I have spent around 35 hours total on Galaxy on Fire 2 HD(Wing Commander type game)on my iPad3, with non stop sessions in the 4-5 hours range.
I would say this games stresses my iPad to the limit, because there is a noticeable slowdown in scenes with many enemy spaceships..

Ps. If anyone has a craving for some wing commander/privateer type game, I can highly recommend this game. The controls are quite functional as well

DSC · Jan 14, 2013

http://www.anandtech.com/show/6666/the-tegra-4-gpu-nvidia-claims-better-performance-than-ipad-4

Tegra 4 does offer some additional enhancements over Tegra 3 in the GPU department. Real multisampling AA is finally supported as well as frame buffer compression (color and z). There's now support for 24-bit z and stencil (up from 16 bits per pixel). Max texture resolution is now 4K x 4K, up from 2K x 2K in Tegra 3. Percentage-closer filtering is supported for shadows. Finally, FP16 filter and blend is supported in hardware. ASTC isn't supported.

anexanhume · Jan 14, 2013

DSC said:
http://www.anandtech.com/show/6666/the-tegra-4-gpu-nvidia-claims-better-performance-than-ipad-4

They also claim a non-unified architecture isn't a drawback because it's still well-suited to mobile graphics workloads.

NVIDIA Tegra Architecture

Deleted member 13524

Guest

3dcgi

silent_guy

Arun

Unknown.

mboeller

Exophase

Laurent06

Exophase

Laurent06

Exophase

Ailuros

Epsilon plus three

Arun

Unknown.

Rys

Graphics @ AMD

Ailuros

Epsilon plus three

dagamer

silent_guy

ltcommander.data

Onkl Bjorni

DSC

anexanhume

Similar threads