NVIDIA Tegra Architecture

metafor · Aug 2, 2012

Ailuros said:
The problem is where in NV's roadmap and that recent public statement a performance increase would be an "up to" figure and where "average".

Roadmaps are funny things. You sneeze at them and they change.

The older roadmaps showed a 5x times difference (SoC level/theoretical maximum) between T3 and T2 and a 2x times difference between T4 and T3. It hardly can be that it's an "up to" figure in the first case and an "average" for the second.

Either NV is just firing around smoke to surprise its competition or the performance increase for Wayne is way more humble than many of us expected.

Or, marketing just likes to throw numbers around....

From what I recall the 5x times for T3 was:

2*A9@1GHz vs. 4*A9@1.5GHz = 2.5x
8 GPU ALU lanes@333MHz vs. 12 GPU ALU lanes@520MHz = 2+x
and the remaining to reach the 5x probably for the 5th CPU companion core

The point isn't how realistic any of the above is; we all know how marketing works in that regard and those funky up to 5x or more times are nearly everywhere to be found and not just at NV. What raises a question mark is that 2x times SoC performance claim for T4 vs. T3 if it should follow the same reasoning. As I said either a nasty trick to fool everyone and it's an average increase or it's merely a shrinked T3 at higher frequencies. I still consider the latter scenario unlikely, but the question mark remains with that kind of vague marketing parlance.

Given the marketing emphasis on core count, I'd say remaining quad-core has a lot to do with how conservative the numbers are this time.

french toast · Aug 2, 2012

ToTTenTranz said:
Why not? Look at GLBenchmark 2.5, it's a much more realistic bench than 2.1 ever was and it puts Tegra 3 quite well compared to the other SoCs.

Close the gap?
Performance leader or not, Tegra 3 is the current platform of choice for anything 3D in Android. OUYA, and that tablet with a gamepad are using Tegra 3, not a Snapdragon S4 or Exynos 4412.

Even the streaming apps have exclusive versions for Tegra 3 with more functionality, it's crazy.

Yes if for instance tegra 4 had advanced api support..like open gl es 3.0...then that would give it a massive advantage over every other soc....games really would look from a different class.

But I just think Mali t-604, rogue, adreno 320 and sgx 544 will habe the performance advantage...and in some cases maybe feature advantage also if it isn't new uarch.

When I mean close the gap I mean next year not this year.

ams · Aug 2, 2012

Ailuros said:
GLBenchmark2.5 sounds very ALU intense. If you look at the results the 543MP4@250MHz in the iPad3 scores 2739 frames at the moment with 32 GFLOPs and the ULP GF in T3 scores 1200 frames with 12.5 GFLOPs. A 2.3x score difference for a 2.5x GFLOPs difference is hardly a coincidence.

I was referring to real world gaming benchmarks, not ALU intensive synthetic benchmarks. The reality is that differences in GFLOP throughput between different GPU architectures do not correlate well with differences in real world gaming performance between different GPU architectures. This has been shown time and time again in past history.

For example, HD 6970 had almost 2x more GFLOP throughput compared to GTX 580, and yet the GTX 580 was as fast or even significantly faster in most gaming benchmarks.

As another example, the SGX543MP2 in iphone 4s has 8x more GFLOP throughput compared to the SGX535 in iphone 4, and yet the difference in Unreal Engine 3 gaming performance is closer to 2x (http://images.anandtech.com/graphs/graph4971/41966.png).

Besides that, keep in mind that Wayne is supposed to range from clamshells down to superphones, with quite a difference in power consumption per device. 64 GFLOPs is anything but impressive under that light.

Let's be realistic here. These are mobile GPU's. These are not high performance computing machines. Comparing mobile GPU performance based primarily on theoretical flops throughput is downright inane given the current useage model of mobile devices. Don't you think it would be far more fruitful to see benchmarks of mobile games and mobile [GPU-accelerated] applications rather than synthetic benchmarks and theoretical flops when looking at real world mobile GPU performance?

Xmas · Aug 3, 2012

ams said:
As another example, the SGX543MP2 in iphone 4s has 8x more GFLOP throughput compared to the SGX535 in iphone 4, and yet the difference in Unreal Engine 3 gaming performance is closer to 2x (http://images.anandtech.com/graphs/graph4971/41966.png).

It should be pointed out that 50 fps is sufficiently close to the vsync limit that you'd expect a significant fraction of frames to be affected by it.

It does highlight that running the same content n times faster isn't the point, though. Who wants to run Doom at a million fps?

Ailuros · Aug 3, 2012

Jubei said:
TI does since nothing else has been announced. They do have a series 6 license but there is no evidence that suggest it will be used with OMAP 5 series. Given that TI is using a 5 series GPU in late 2012-early 2013, i think its more likely that OMAP + Rogue is a late 2013 product that will be competing with Tegra 5

OMAP4 contains Series5 and Series5XT GPU IP. As for actual appearances in final devices let's see when Wayne appears first and then we can speculate on give or take another year on top of that for its successor.

Ailuros · Aug 3, 2012

ams said:
I was referring to real world gaming benchmarks, not ALU intensive synthetic benchmarks.

Future mobile games are going to increase significantly in shader complexity.

The reality is that differences in GFLOP throughput between different GPU architectures do not correlate well with differences in real world gaming performance between different GPU architectures. This has been shown time and time again in past history.

Nothing against that; however the indication of a synthetic benchmark which intends to "foresee" up to some degree the future isn't entirely worthless either.

For example, HD 6970 had almost 2x more GFLOP throughput compared to GTX 580, and yet the GTX 580 was as fast or even significantly faster in most gaming benchmarks.

It didn't strike you that you're comparing above vector vs. "scalar" ALUs? In my comparision above all compared GPUs have in fact vector ALUs, irrelevant if they have USC ALUs or not. The majority if not all IHVs involved in GPU development for the small form factor market will have IMHO "scalar" ALUs for the coming Halti generation of GPUs. And to get back to the initial point: if the highest variant of Wayne has "only" 64 GFLOPs maximum, while in comparison IMG with it's GPU IP can range at the moment from 100GFLOPs up to 1 TFLOP there's no efficiency magic wand possible to close that gap and that's exactly the reason why 64 GFLOPs don't sound like all that much. And no the whole enchilada doesn't come with just sterile GFLOPs comparisons; it'll get even more hairy when you start comparing achievable real time triangle rates, texel/pixel/Z fillrates and what not especially for the latter on a TBDR.

As another example, the SGX543MP2 in iphone 4s has 8x more GFLOP throughput compared to the SGX535 in iphone 4, and yet the difference in Unreal Engine 3 gaming performance is closer to 2x (http://images.anandtech.com/graphs/graph4971/41966.png).

Vsync ahoi.

Let's be realistic here. These are mobile GPU's. These are not high performance computing machines. Comparing mobile GPU performance based primarily on theoretical flops throughput is downright inane given the current useage model of mobile devices. Don't you think it would be far more fruitful to see benchmarks of mobile games and mobile [GPU-accelerated] applications rather than synthetic benchmarks and theoretical flops when looking at real world mobile GPU performance?

Yes of course would it be more fruitful to see real games benchmarks, but I don't think you'd find all that many mobile games out there where the MP4 in the iPad3 wouldn't trounce by a significant margin any ULP GF in T3.

GLBenchmark2.5 happens to be ALU bound as a prediction for future mobile games; if you're willing to bet that they'll be f.e. fillrate bound in a contrary case a SGX543MP4 with its 8 TMUs is still going to win by a signficant difference. It's not that 2.5 isn't taxing any fillrates, geometry, bandwidth or what not. The primary concentration just falls on higher shader complexity for a reason.

ams · Aug 3, 2012

Ailuros said:
Vsync ahoi.

Ahoi no

Vsync explains part of the difference, but memory bandwith explains part of the difference too. You do realize that the improvement in memory bandwith in iphone 4s (with SGX543MP2) vs. iphone 4 (with SGX535) is 2x, even though GFLOP throughput increases by 8x?

Future mobile games will obviously have increasingly more and more advanced visual effects, but game developers will still need to target the lowest common denominator. In the article comparing iphone 4s to iphone 4, Anandtech stated: "Because of the lower hardware target for most iOS games and forced vsync I wouldn't expect to see 2x increases in frame rate for the 4S over the 4 in most games out today or in the near future." So if anything, the Unreal Engine benchmark represented the best case scenario for the iphone 4s GPU vs. the iphone 4 GPU with respect to gaming performance difference at the time.

We know that Rogue will have a minimum of 100 GFLOP throughput (and will eventually scale to 1 TFLOP, although certainly not in 2013 during the timeframe of Wayne!). So if the best NVIDIA can do with ULP Geforce is 67 GFLOPS in 2013 (and that remains to be seen right?), then that may not seem impressive in terms of GFLOP throughput, but it certainly doesn't mean that the ULP Geforce will not be impressive as a mobile gaming device. Vec vs. scalar ALU architecture is not the issue (hell, we don't know anything about the Wayne GPU architecture at this time anyway, right?). The issue is that differences in GFLOP throughput between different GPU architectures simply do not correlate well with differences in actual gaming performance. And for the record, I never stated that synthetic GPU benchmarks serve no purpose. They are certainly of interest in comparing and contrasting different GPU architectures. But there is no substitute for real world gaming benchmarks. At the end of the day, it is simply mind-boggling that we would have a discussion about not being impressed based on a calculated theoretical GFLOP number, but at the same time we would not be impressed that a mobile GPU "core" count increases by more than 5x from one year to another (assuming the rumor is true in the first place).

Yes of course would it be more fruitful to see real games benchmarks, but I don't think you'd find all that many mobile games out there where the MP4 in the iPad3 wouldn't trounce by a significant margin any ULP GF in T3.

Here is a quote from a recent interview with Stam:

Question: How do you rate your chips against Apple's A5X?

Answer: One of the biggest differences is in user gaming experience. The combination of Tegra 3's quad CPU cores, 12 GPU cores, the ability to process higher levels of geometric complexity, and excellent drivers contribute to the Tegra 3's better gaming experience over A5X.

So at the end of the day, while I'm sure even NVIDIA would agree that the A5X GPU can achieve higher raw framerates per second in most situations, they feel that the overall gaming experience is better with Tegra 3. And Tegra 3 is a much smaller SoC with strict power consumption limits for use in smartphones, whereas A5X is a much bigger SoC with less strict power consumption limits for use in a tablet.

Jubei · Aug 3, 2012

Ailuros said:
OMAP4 contains Series5 and Series5XT GPU IP. As for actual appearances in final devices let's see when Wayne appears first and then we can speculate on give or take another year on top of that for its successor.

OMAP 4470 has not made it into a final device yet and we are nearing Q4. So i dont think im being unrealistic if i say that we wont see OMAP with a Series6 GPU before Q4 2013. Time to market is not exactly TIs best strength.

And obviously we dont know when Wayne is being released. But again, going by past launches i think its likely we will see it Q1 2013. So i think its definately competitive with the version of OMAP and Adreno that will be out in that same timeframe. The dark horse is ST-Ericsson. They have a monster, only question is if they can get some design wins

Arun · Aug 3, 2012

ams said:
Ahoi no Vsync explains part of the difference, but memory bandwith explains part of the difference too.

No it mostly doesn't.

The worst case for a 2 TMU TBDR is when you saturate both the two TMUs and the two ROPs with 32-bit RGBA8888 data. That's 16 bytes per cycle or 3.2GB/s at 200MHz (which I'm using as a reference number only because I like nice round numbers, alright?) which is identical to the bandwidth you'd get for a 64-bit 400MHz LPDDR1 memory bus. Even with the extra bandwidth for geometry loading/binning, framebuffer display, and various small things, you'd be hard pressed to be very bandwidth limited. The same calculation would apply to a lesser extent on a 4 TMU TBDR if you double the memory speed.

And of course in the real world you're not going to use RGBA8888 textures (except for GUIs) and you're not going to manage to sature your TMUs and ROPs, and especially not both at the same time. The only case where bandwidth can really matter in such an hypothetical system is when you're doing something bandwidth-heavy on the CPU at the same time - which is certainly the case of some games but not all of them.

The reality is that the iPad 2 is 7.8x faster than the iPad in GLBenchmark 2.1 offscreen and that the iPhone 4S is 6.4x faster than the iPhone 4 in the same benchmark. I certainly wouldn't claim that GLBenchmark 2.1 is an amazing benchmark (GLB2.5 is significantly better in my opinion, but 2.1 is still massively better than most of the so-called 'benchmarks' on the Android market that I had the... pleasure of analysing recently) but in this case I suspect it might be slightly *more* bandwidth limited than most workloads for reasons I'm not sure I can publicly talk about (sorry!)

We know that Rogue will have a minimum of 100 GFLOP throughput (and will eventually scale to 1 TFLOP

Not all configurations of Rogue are necessarily above 100 GFLOPS. It's a scalable architecture

But there is no substitute for real world gaming benchmarks. At the end of the day, it is simply mind-boggling that we would have a discussion about not being impressed based on a calculated theoretical GFLOP number, but at the same time we would not be impressed that a mobile GPU "core" count increases by more than 5x from one year to another (assuming the rumor is true in the first place).

While I obviously agree that real world gaming benchmarks are a very good thing and GFLOPS are only one of many important metrics, it's still an important number. The difference between GFLOPS and 'cores' is that the former does require a certain minimum amount of silicon to implement (even if you need more silicon to be able to use it efficiently, e.g. single issue vs co-issue) while the latter has no real definition and therefore no real cost.

Here is a quote from a recent interview with Stam:

Well, what else is he going to say? I'm also pretty sure that the performance of SGX devices in GLBenchmark 2.5 disproves his claim that Tegra is better at higher levels of geometric complexity...

ams · Aug 3, 2012

Arun said:
No it mostly doesn't.

If not bandwith, then there has to be something else in play. Surely you don't expect the iphone 4s with vsync disabled to achieve anything even remotely close to 176fps using Unreal Engine 3 right?

Not all configurations of Rogue are necessarily above 100 GFLOPS. It's a scalable architecture

I know that it is a scalable architecture, but the VP of Marketing indicated on video at CES that Rogue cores start at around 100 GFLOPS and scale up to the TFLOP range.

Well, what else is he going to say?

Sure, it is more or less expected that he would say that. But I think it would be hard for anyone to justify that Tegra 3 tablets do not offer a quality mobile gaming experience, even compared to best of breed tablets such as ipad. Considering that this is an SoC designed first and foremost to fit into the power envelope of a smartphone, not too shabby.

Arun · Aug 3, 2012

ams said:
If not bandwith, then there has to be something else in play. Surely you don't expect the iphone 4s with vsync disabled to achieve anything even remotely close to 176fps using Unreal Engine 3 right?

Depends on the workload I guess, it's likely that an AAA dev like Epic would optimise very heavily towards the actual platform so in the case of the SGX535 that might mean as few ALU operations as possible, and very importantly using LowP (10-bit integer) for most pixel shaders. LowP isn't actually any faster on SGX543 compared to the original SGX, so SGX543MP2 would only have 4x as much ALU performance rather than 8x.

BTW I'm not saying bandwidth won't potentially be a serious limiter in the future as GPU performance continues to increase faster than memory technology, but I simply don't think it's that big of an issue on today's platforms (especially with TBDR, if you compare different end-devices with different DRAM speeds it's clear that the 32-bit memory bus has a real impact on Tegra 3 GPU performance, although even then it's usually not a huge difference at the current performance levels).

I know that it is a scalable architecture, but the VP of Marketing indicated on video at CES that Rogue cores start at around 100 GFLOPS and scale up to the TFLOP range.

I can't speak for TKS, but he probably only meant that the first variant would exceed 100 GFLOPS.

Sure, it is more or less expected that he would say that. But I think it would be hard for anyone to justify that Tegra 3 tablets do not offer a quality mobile gaming experience, even compared to best of breed tablets such as ipad. Considering that this is an SoC designed first and foremost to fit into the power envelope of a smartphone, not too shabby.

Sure, I'd much rather buy a Tegra 3 tablet than most alternatives in today's market, it's a solid overall platform and the exclusive games are a very nice bonus if you're personally interested in a specific one.

I'm far from convinced that Tegra 3 was 'designed first and foremost to fit into the power envelope of a smartphone' though. Everything I've ever heard indicates that they've had tablets in their mind from the very start of the design process. From my discussions with Mike Rayfield and Phil Carmack at MWC11, I suspect they simply underestimated the GPU performance that would be available in the same timeframe. They both seemed honestly surprised when I told them Tegra 3 definitely wouldn't be the fastest GPU of 2011. So I'm assuming they won't make the same mistake again, or at least I hope so, competition is fun

ams · Aug 3, 2012

Arun said:
I'm far from convinced that Tegra 3 was 'designed first and foremost to fit into the power envelope of a smartphone' though.

Tegra 3 was designed specifically to fit into the same power envelope as Tegra 2. So while NVIDIA did intend on using the Tegra 3 SoC in both tablets and smartphones, due to the strict power consumption requirements of the design, there was no way for them to specifically target tablets/clamshell devices. On the other hand, the A5X SoC was designed specifically to fit into the power envelope of a tablet, so power consumption requirements were not as strict. Another factor to consider is that NVIDIA's partners may have been extremely cost conscious too, so having a larger, hotter, more powerful, and more expensive SoC may not have been too palatable to them.

ams · Aug 3, 2012

Xmas said:
It does highlight that running the same content n times faster isn't the point, though. Who wants to run Doom at a million fps?

Yes, this is exactly right. So the clear goal is to have richer and more advanced visual effects, rather than simply higher raw fps, with mobile devices. And achieving higher performance is certainly important too. The higher the performance from mobile devices, the more visual effects that can be added to games.

One nice game to show off advanced visual effects on Tegra 3 is Glowball. Glowball [Part 2] uses dynamic lighting, caustics, fog, etc. to create a richer looking underwater environment. The movement of seaweed, bubbes, etc. is even simulated using all four CPU cores. http://www.youtube.com/watch?v=C30ShWQm5pI&feature=relmfu

Arun · Aug 4, 2012

ams said:
Tegra 3 was designed specifically to fit into the same power envelope as Tegra 2.

Same average real-world target power, yes, but the power at peak or under heavy load is significantly higher.

So while NVIDIA did intend on using the Tegra 3 SoC in both tablets and smartphones, due to the strict power consumption requirements of the design, there was no way for them to specifically target tablets/clamshell devices. [..] Another factor to consider is that NVIDIA's partners may have been extremely cost conscious too, so having a larger, hotter, more powerful, and more expensive SoC may not have been too palatable to them.

Disagreed on the first point, agreed on the second. They could have undervolted the GPU and achieved even lower power consumption. There is a fundamental trade-off between area and power for parallel processors (see one of my favourite presentations: http://www2.lirmm.fr/arith18/slides/ARITH18_keynote-Knowles.pdf) and I think it's clear that NVIDIA was more cost conscious than some competitors with both Tegra 2 and Tegra 3. That has relatively little to do with the kind of end-device (e.g. $199 Nexus 7 vs $499 HTC One X) and more to do with strategic and financial imperatives IMO. Also remember that a faster GPU would have required a more expensive 64-bit memory bus given that Tegra is a classic IMR without framebuffer compression.

ams · Aug 4, 2012

My point is that, due to cost/resource allocation/etc., NVIDIA had to design the Tegra 3 SoC to work well with the lowest common denominator (which happens to be smartphones). They did not have the luxury of releasing one higher performance SoC targeted for tablets/clamshells (with less strict avg. power consumption requirements), and one lower performance SoC targeted for smartphones (with more strict avg. power consumption requirements). That is why most Tegra 3 tablets and smartphones perform so similarly to one another.

ams · Aug 4, 2012

ams said:
Yes, this is exactly right. So the clear goal is to have richer and more advanced visual effects, rather than simply higher raw fps, with mobile devices. And achieving higher performance is certainly important too. The higher the performance from mobile devices, the more visual effects that can be added to games.

One nice game to show off advanced visual effects on Tegra 3 is Glowball. Glowball [Part 2] uses dynamic lighting, caustics, fog, etc. to create a richer looking underwater environment. The movement of seaweed, bubbes, etc. is even simulated using all four CPU cores. http://www.youtube.com/watch?v=C30ShWQm5pI&feature=relmfu

And just to add to this, here is Glowball [Part 1]: http://www.youtube.com/watch?v=eBvaDtshLY8&feature=relmfu . Note that due to the CPU-simulated environment, the performance of the game slows down tremendously when two of the four CPU cores are shut down, to the point where the game is no longer very playable on a dual CPU core device.

Ailuros · Aug 4, 2012

Since when is Glowball a game? From one side you oppose against popular synthetic benchmarks (which in Kishonti's case are developed with the support/input of the majority of IHVs if not all) and on the other side an IHV specific synthetic benchmark is supposed to prove what exactly?

If NV would drive developers to concentrate game resources more on CPUs than on GPUs just like in Glowball, they'd be quite in an awkward position in the future.

Ailuros · Aug 4, 2012

Jubei said:
OMAP 4470 has not made it into a final device yet and we are nearing Q4. So i dont think im being unrealistic if i say that we wont see OMAP with a Series6 GPU before Q4 2013. Time to market is not exactly TIs best strength.

There are Archos G10 results in the Kishonti database, so I'm not sure whether it's TI's fault here or their partners.

And obviously we dont know when Wayne is being released. But again, going by past launches i think its likely we will see it Q1 2013. So i think its definately competitive with the version of OMAP and Adreno that will be out in that same timeframe. The dark horse is ST-Ericsson. They have a monster, only question is if they can get some design wins

Do past launches also include the T3 related delays? Neither NVIDIA or anyone else is immun from any possible delays. By the way "we will see it" stands for a SoC announcement or final device availability?

As for ST-E, someone shoot me but I'm more confident others will be on shelves with a Rogue GPU integrated than them. If you want to talk about time to market, have a look when the A9600 was supposed to sample and when they initially projected mass production for it.

Exophase · Aug 4, 2012

Glowball is okay as a tech demo but isn't a great example of resource balancing for a game. At least GLBenchmark stuff resemble what real games do.

Jubei · Aug 4, 2012

Ailuros said:
There are Archos G10 results in the Kishonti database, so I'm not sure whether it's TI's fault here or their partners.

Lets concede that that is actually the case. Even so OMAP 4470 was announced Q1 2011 and scheduled for release first half of 2012. They would have to both announce OMAP 5XXX and release it well before the end of 2013 to be able to compete with T4

Ailuros said:
Do past launches also include the T3 related delays? Neither NVIDIA or anyone else is immun from any possible delays. By the way "we will see it" stands for a SoC announcement or final device availability?

Were they actually delays by Nvidia? Depends on if you believe Charlie or not. Its not impossible that Asus wanted a christmas launch for their tablet. Either way Nvidia beat all their competitors in bringing out a next gen chip. Rumour is that T4 was taped out december 2011 and sampled first half of this year. So i would be bold enough to claim we will see it in products at MWC (and prepared to eat crow if im wrong

)

Ailuros said:
As for ST-E, someone shoot me but I'm more confident others will be on shelves with a Rogue GPU integrated than them. If you want to talk about time to market, have a look when the A9600 was supposed to sample and when they initially projected mass production for it.

Oh definately. But i was more talking about competing with Nvidia, Apple might be first with Rogue but they play in their own division. As for ST-Es delays im not sure if the blame can be put solely on them. Everyone has had issues with 28nm, to bring out both Rogue and Cortex A15 on 28nm as early as they planned is not a small task

NVIDIA Tegra Architecture

metafor

french toast

ams

Xmas

Porous

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

ams

Jubei

Arun

Unknown.

ams

Arun

Unknown.

ams

ams

Arun

Unknown.

ams

ams

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Exophase

Jubei

Similar threads