Tegra 2 Announcement

Ailuros · Jan 8, 2010

Arun said:
Although now that Rys is at IMG/PowerVR, I might want to try and convince them that resistance is futile

No need to worry; there's not much room left to breathe with Rys in that cupboard ROFL

Mike11 · Jan 8, 2010

Arun said:
500mW is a marketing number, it doesn't mean anything whatsoever. It's not a TDP per-se; there is such a number (all subsystems activated at once), but nobody really cares about it since you can just down-throttle in that case, or just prevent it from happening completely. 1080p decode logic is around 100mW IIRC, which is a very nice improvement (although expected, TSMC 40LP is better than most people seem to realize).

May I ask, what do you think could be a realistic number for running high-end 3D games (e.g. with the UE3) on the announced 1GHz dual-core Tegra2 (i.e. real-world worst case scenario)? The mentioned 500mW?

And how does the Tegra2 compare to the OMAP4 spec-wise? As far as I can see they seem to be in the same ballpark (at least clock for clock).

rpg.314 · Jan 8, 2010

Mike11 said:
May I ask, what do you think could be a realistic number for running high-end 3D games (e.g. with the UE3) on the announced 1GHz dual-core Tegra2 (i.e. real-world worst case scenario)? The mentioned 500mW?

And how does the Tegra2 compare to the OMAP4 spec-wise? As far as I can see they seem to be in the same ballpark (at least clock for clock).

In such a situation, the ISP, video decode/encode and audio decode cores would shut down as well, bringing the overall number down by a fair amount.

Arun · Jan 8, 2010

Mike11 said:
May I ask, what do you think could be a realistic number for running high-end 3D games (e.g. with the UE3) on the announced 1GHz dual-core Tegra2 (i.e. real-world worst case scenario)? The mentioned 500mW?

The 500mW doesn't mean anything, it's not a real TDP, it's not an average, and it's not a specific use case. Just forget it. It's about as useless as the "<1W" they once gave for Tegra1.

The two most important points on power for a 3D game would probably be how much it stresses the CPUs and how much memory bandwidth it takes. 2x1GHz at full throttle with high voltage LP transistors at a high voltage surely can't be incredibly cheap, no matter what ARM claims (remember their 'power-optimized' version runs at 800MHz not on 40LP but on 40G!) - as for bandwidth, obviously it'll take less power with LPDDR2 than low-voltage or standard DDR2 (which will be used in many tablets/smartbooks to save money).

I'm not going to bother estimating anything because frankly it's just going to be awfully wrong no matter what number I come up with

And how does the Tegra2 compare to the OMAP4 spec-wise? As far as I can see they seem to be in the same ballpark (at least clock for clock).

CPU-wise, it's pretty much the same but without NEON presumably. Video-wise, it's arguably better (higher bitrates, but seemingly less support for less usual codecs and limited to H.264 for encode). 3D-wise, it should be faster but it depends a bit on efficiency (SGX540 is 2 TMUs, Tegra2 is 4 TMUs but at a slightly lower clock rate and without the TBDR efficiency benefits). In the Image Signal Processing department, it's not as good as OMAP4 but I'm not sure how it compares there power-wise (not that ISP power matters much in reality, the sensor itself takes a lot actually). Bandwidth-wise, OMAP4 supports 64-bit LPDDR2 versus 32-bit for NV, but I very much doubt the latter will ever be much of a bottleneck in practice.

Cost-wise, Tegra2 should be noticeably cheaper than OMAP4, and it has been sampling for longer. And that's for the top chip; NV is even more ahead timeframe-wise when it comes to the single-core derivative as I said earlier, AFAIK. So both very solid solutions, and I'd give a slight technical/practical advantage to NV, but TI definitely has the edge in terms of existing customer relationships for smartphones which is a big deal we shouldn't forget about.

Manabu · Jan 8, 2010

Arun said:
AFAIK NV hasn't even licensed it, but I could be wrong. If it does have NEON, it would probably only be on one core (i.e. heterogeneous), and I very much doubt that. As I said in the past, I genuinely believe it's a pretty dumb piece of silicon in the current market environment and NV believes the same AFAIK.

I don't mind encoding x264 at 3~5 times the speed if it has NEON... Seriouly, ARM SoC enviroment needs more standartization if they want third-part developers to optmize their software to it. If it don't, then performance critical aplications will forever be slower in ARM than x86.

Libavcodec, x264 developers, etc, chosed NEON in part because they can't code for dozens of different, ill-documented and exotic DSPs. Of course, right now there is not many aplications that use NEON, but silicon must come first, and then software comes.

Arun · Jan 8, 2010

Manabu said:
Libavcodec, x264 developers, etc, chosed NEON in part because they can't code for dozens of different, ill-documented and exotic DSPs. Of course, right now there is not many aplications that use NEON, but silicon must come first, and then software comes.

I may be retarded, but who cares about Libavcodec or x264 on these platforms? You can encode at 1080p in real-time at okayish quality - I know you could do much better with a much slower encode, but what's the use case for that in smartphones or tablets?

The point, from NV's point of view, is that adding silicon for use cases that affect basically nobody doesn't make a whole lot of sense. It's much better to invest that money into exposing the acceleration hardware better, if you had to spend it at all. I mostly share that POV myself, although I still think it'd be nice to implement NEON on at least one of the two/four cores in the highest-end chips for compatibility's sake. For all I know, maybe that's what they did already, although I've got the impression it isn't. Of course, you've also got people like Marvell who stubbornly believe Wireless MMX still matters. I'm not sure whether to laugh or cry...

BRiT · Jan 8, 2010

They were thinking encoding x264 in real-time is required for devices to feature Video/Film capabilities such as all portable devices are converging to.

Arun · Jan 8, 2010

BRiT said:
They were thinking encoding x264 in real-time is required for devices to feature Video/Film capabilities such as all portable devices are converging to.

You're confusing H.264 and x264. Tegra2 can do 1080p30 H.264 encoding in real-time in hardware. Doing that in software on the ARM cores via x264 in real-time is completely insane and absolutely not an option, it's not going to happen for a long long time. Once again, it's very hard to find use cases for NEON that affect more than 1% of the potential userbase of these kinds of devices (and where a 1GHz A9 without NEON couldn't handle it just fine) - I'm not saying there aren't any, I'm sure there are, but it's easy to assume they are more frequent than they really are.

Of course, if ARM is to invade desktops in the long-term, then NEON would be extremely desirable. But this is not for this generation of hardware or even the next if ever, so I wouldn't get ahead of myself.

rpg.314 · Jan 8, 2010

Arun said:
You're confusing H.264 and x264. Tegra2 can do 1080p30 H.264 encoding in real-time in hardware. Doing that in software on the ARM cores via x264 in real-time is completely insane and absolutely not an option, it's not going to happen for a long long time. Once again, it's very hard to find use cases for NEON that affect more than 1% of the potential userbase of these kinds of devices (and where a 1GHz A9 without NEON couldn't handle it just fine) - I'm not saying there aren't any, I'm sure there are, but it's easy to assume they are more frequent than they really are.

Of course, if ARM is to invade desktops in the long-term, then NEON would be extremely desirable. But this is not for this generation of hardware or even the next if ever, so I wouldn't get ahead of myself.

Non-FP bits of NEON might be pointless, but the FP bits of NEON are surely useful. Especially as gaming on handhelds catches on?

NEON-lite to the rescue perhaps?

BRiT · Jan 8, 2010

Arun said:
You're confusing H.264 and x264. Tegra2 can do 1080p30 H.264 encoding in real-time in hardware.

Thanks for the clarification.

Manabu · Jan 8, 2010

Arun said:
I may be retarded, but who cares about Libavcodec or x264 on these platforms? You can encode at 1080p in real-time at okayish quality - I know you could do much better with a much slower encode, but what's the use case for that in smartphones or tablets?

Libavcodec provides decode suport for much more formats than one can dream of puting in fixed function decoders. For example, in linux comunity many people use theora to distribute video, as it can be shipped in linux distros w/o breaking stupid USA laws, and firefox will have it integrated as part of HTML5 support. I never heard of hardware theora decoding, not even DSP acelerated theora decoding. But we already have some initial NEON optmizations. Without SIMD optmizations, software video decoders will be much slower, and become unusable.

I also don't know about the suport for MPEG4 ASP's videos with qpel and GMC, but I have a lot of those old videos that I don't want to re-encode to see in my smartphone/tablet. Same for rmvb, quicktime, dv, and any new standard that might emerge in the mean time. Also, the post-processing options in software video decoders like ffdshow or mplayer are also very nice, and dependent on SIMD optimizations to run in real-time. If not for performance, then for lower cpu use and thus lower power consumption.

http://x264dev.multimedia.cx/?p=142

In the above announcement of initial ARM suport, Dark Shikari gives an pretty exotic use case for that x264 in ARM. But we don't need to go so far. I might simply want to encode something on the go in my Cortex A9 based smartbook, and preffer to have an file 2~4 times smaller for the same quality than the hardware encoder.

With NEON optmizations, even the slower OMAP4 will probably be faster than the 1.7Ghz Celeron that I used to encode with x264 some years ago. So it will be very viable to use, if you don't need real time encoding or mess around with 1080p videos. For blu-ray backup for example, I will continue to use my powerfull desktop CPU. But even that might be advantageous to move to cortex A9 if one is not in hurry, because of power efficiency. High end quad-cores systems consume 100 times more than a beagle board while encoding, but probably are "only" around 10 times faster than an OMAP4.

Arun said:
The point, from NV's point of view, is that adding silicon for use cases that affect basically nobody doesn't make a whole lot of sense. It's much better to invest that money into exposing the acceleration hardware better, if you had to spend it at all. I mostly share that POV myself, although I still think it'd be nice to implement NEON on at least one of the two/four cores in the highest-end chips for compatibility's sake..

Arun said:
Of course, if ARM is to invade desktops in the long-term, then NEON would be extremely desirable. But this is not for this generation of hardware or even the next if ever, so I wouldn't get ahead of myself.

Software developers wont code for something that very few people have suport, or that don't give any performance improvement for most people (only added for compatibility sake, but too slow). Open source developers in general only start optmizing for an new instruction set when they buy new hardware with it. So the lag between you start selling a silicon with suport for an new instruction set and a big number of softwares start using it is usually more than one year (see the adpotion rate of new SSE instructions, new versions of Direct X, Open CL, etc). Till then you have an almost uselless piece of silicon in your hands, buf if you wait for it be useful before you implement it, we would have the problem of the egg and the chicken.

Besides video encoding and decoding, other areas in wich integer SIMD might be useful is image processing (GIMP in the tablets, anyone?), and for floating point SIMD we have audio transcoding, rare formats audio decoding, games as rpg.314 said, etc.

Arun · Jan 8, 2010

Those are some interesting points Manabu, thanks - there are some premises I don't agree with though:

Manabu said:
Without SIMD optmizations, software video decoders will be much slower, and become unusable.

I would be very surprised if there was a single video codec on the planet, except probably H.265, that you couldn't decode a VGA/D1-level stream of on a 1GHz Cortex-A9 *without* NEON. Would it be more power efficient with NEON? Yeah, obviously. But it'd be perfectly usable, and if you cared about power efficiency you'd still gain a lot from transcoding it to something else.

This is the key: a lot of applications would benefit from NEON, but not a lot of them actually need it to be fast enough. That makes people talk past each other quite frequently in my experience

Also, the post-processing options in software video decoders like ffdshow or mplayer are also very nice, and dependent on SIMD optimizations to run in real-time.

I'm not familiar with those, so I can't really judge - are you sure? If so, I guess that's definitely a viable niche application.

I might simply want to encode something on the go in my Cortex A9 based smartbook, and preffer to have an file 2~4 times smaller for the same quality than the hardware encoder.

2-4x smaller than a HW encoder? This is not a super-naive encoder we're talking about most likely so that's probably with very high quality settings on x264. And HW encoder quality will only improve with time. Do you really think it's worth it to waste 1+ hour of battery life to encode 20 minutes of video twice as efficiently? I'm skeptical, there certainly are cases where it is but that's hardly a mainstream application.

High end quad-cores systems consume 100 times more than a beagle board while encoding, but probably are "only" around 10 times faster than an OMAP4.

Heh that's an interesting/amusing point!

I wouldn't do that on a netbook though unless I could make it run entirely off the power plug rather than the battery, because batteries do degrade over time (i.e. decreased battery life).

Till then you have an almost uselless piece of silicon in your hands, buf if you wait for it be useful before you implement it, we would have the problem of the egg and the chicken.

Well, in the case of NEON you don't, since the vast majority of companies are implementing NEON in nearly all their A9-based chips right now. But yes, I certainly agree with you in principle.

other areas in wich integer SIMD might be useful is image processing (GIMP in the tablets, anyone?)

Ah yes, that's a nice application, hadn't thought of it...

Manabu · Jan 9, 2010

Arun said:
Those are some interesting points Manabu, thanks - there are some premises I don't agree with though:
I would be very surprised if there was a single video codec on the planet, except probably H.265, that you couldn't decode a VGA/D1-level stream of on a 1GHz Cortex-A9 *without* NEON. Would it be more power efficient with NEON? Yeah, obviously. But it'd be perfectly usable, and if you cared about power efficiency you'd still gain a lot from transcoding it to something else.

This is the key: a lot of applications would benefit from NEON, but not a lot of them actually need it to be fast enough. That makes people talk past each other quite frequently in my experience

Hum, ok, I understand. I do have some ~720p 60fps divx/xvid stuff. AMVs, older traillers, video game captures and stuff like that. Also, theora is not the most optimized standard ever, so it is actually somewhat slow. People make screen captures with it in 800x600 + resolutions. I think that all this would be dificult to handle w/o any SIMD or DSP suport, but I'm not certain.

But I actually was impressed at what an 624mhz intel xscale ARM CPU was capable of runing. Only at 720x480 video resolutions that TCPMP started to have problems handling MPEG4 ASP. Actually, I had also an 640x480 high bitrate video that also didn't run well. Given that intel xscale should be much slower than Cortex A9 clock for clock, all this would not be an problem.

Arun said:
I'm not familiar with those, so I can't really judge - are you sure? If so, I guess that's definitely a viable niche application.

I'm not sure. But many of those filters are ported from avisynth. Many, but not all, avisynth filters use SIMD to gain some speed. I don't know how much actually. High quality deinterlacers, sharpeners and denoisers seems the most cpu intensive filters. I could not enable much post processing in the days of my celeron CPU, because almost all of it was strugling to decode SD video. But this processing is more and more moved to dedicated logic, and I think that much of it can be speeded up by GPU if given a chance. But SIMD is easier to program to, it seems.

Arun said:
2-4x smaller than a HW encoder? This is not a super-naive encoder we're talking about most likely so that's probably with very high quality settings on x264. And HW encoder quality will only improve with time. Do you really think it's worth it to waste 1+ hour of battery life to encode 20 minutes of video twice as efficiently? I'm skeptical, there certainly are cases where it is but that's hardly a mainstream application.

I don't expect that the HW encoder is more advanced than badaboom 1.2.1, that IIRC is already main or high profile. And it's quality is still horrible for a given bitrate.

Not an unbiased source, but it is the most well made comparison that I could find: http://x264dev.multimedia.cx/wp-content/uploads/2009/08/quality_chart1.png Details on the comparison set-up here. Theora can do better than this, for example, but it is "buggy" right now.

x264 veryfast preset beats badaboom by almost 3 times. You can roughly say that badaboom would need close to 3 times the bitrate to match the quality of x264 very fast preset. Better compression algorithms help more in anime than real-life fotage sources, but on the other hand it don't benefit as much from pysico-visual optmizations (like PsyRDO and AQ) as real-life fotage. And at least he didn't used touhou like he did in a previus comparison. Note that this last one is before a patch that improved in up to 70% x264 touhou compression.

And I'm a video compression freak, so my views in those subjects may (and probably will not) represent an average costumer use of their cortex A9 CPUs... but nonetheless the difference is there, if one don't mind spending time to compress more. If I needed to upload some video over an 3G network (especially in limited plans), I would like the option to take more time encoding to reach an lower file size.

Arun said:
Heh that's an interesting/amusing point! I wouldn't do that on a netbook though unless I could make it run entirely off the power plug rather than the battery, because batteries do degrade over time (i.e. decreased battery life).

I can take off the batery in most notebooks to avoid this. But I don't know how smartbooks for example will work... They seem like much more closed plataforms... this is a thing I don't like about the direction we are currently heading. If I can't change the OS in my smartbook, or make an dual-boot, I will think very hard about getting one or not, as I can do that in any netbook.

rpg.314 · Jan 9, 2010

Arun said:
None, but if what you care about is a clean die shot, I just edited my first post
Assuming the 49mm² die size is correct (Anand isn't the best source, and it's suspiciously near 7x7, although it makes sense given the specs) then each individual Cortex-A9 core (including L1/FPU, excluding L2 I/F, PTM, NEON, etc.) takes only 1.3mm²! Let me repeat that again: 1.3mm². Including all that stuff (except NEON presumably ofc), the dual-core+L2 takes ~7.25mm² (also keep in mind the L2 is reused as a buffer for video etc.)

What is your estimate of area of NEON (for a single core, ofc) at 40nm?

Lazy8s · Jan 9, 2010

The iPhone market (not to mention the Open Pandora community for emulators and applications) can make good use of NEON, and its market share causes it to influence mobile development for the whole market.

With an SGX535, a 150 MHz clocked GPU, a VXD, and a 600 MHz Cortex-A8 with NEON, Apple's got the highest performing mobile application processor all to themselves in the 3G S, and they're about the only company that will implement PowerVR at a competitive pace to nVidia's refresh cycle for Tegra.

Unfortunately, they don't take advantage of SGX's tradeoff of performance for extra DirectX 9+ level functionality.

AnarchX · Jan 9, 2010

HKEPC has a slide with annotated die-shot: http://www.hkepc.com/4435

Simon F · Jan 9, 2010

AnarchX said:
HKEPC has a slide with annotated die-shot: http://www.hkepc.com/4435

Intriguing. That shows the video decoder as much bigger than the encoder.

Rys · Jan 9, 2010

I'm more inclined to go with Arun's annotation than NVIDIA's

rpg.314 · Jan 9, 2010

Rys said:
I'm more inclined to go with Arun's annotation than NVIDIA's

Isn't 256K cache for an embedded GPU too much,especially one that is an IMR. AFAIK, even beefy desktop gpu's don't have that much cache to go around.

Simon F · Jan 10, 2010

Rys said:
I'm more inclined to go with Arun's annotation than NVIDIA's

So basically, you'd need to do the equivalent of a brain scan - get a IR camera and set the chip to do different tasks and see which parts get warm

Tegra 2 Announcement

Ailuros

Epsilon plus three

Mike11

rpg.314

Arun

Unknown.

Manabu

Arun

Unknown.

BRiT

(>• •)>⌐■-■ (⌐■-■)

Arun

Unknown.

rpg.314

BRiT

(>• •)>⌐■-■ (⌐■-■)

Manabu

Arun

Unknown.

Manabu

rpg.314

Lazy8s

AnarchX

Simon F

Tea maker

Rys

Graphics @ AMD

rpg.314

Simon F

Tea maker