Tegra 3 officially announced; in tablets by August, smartphones by Christmas

Interesting. Do you know whether the driver improvements span across for all scenes? What does Pro show?

iPad2 seems to have gotten the most recent driver increases for the moment around 30% for both Egypt and PRO. PRO is more an OGL_ES1.1 test if I'm not mistaken, and abysmally vsync limited on iPad2 otherwise it wouldn't go from 59fps@1024*768 to 148fps@offscreen 1280*720.

The embedded market needs more advanced synthetic benchmarks than PRO and finally mobile games to get timedemos or any other sort of measurements.

Arun,

Yes after Logan the performance improvements literally explode, but it could be by coincidence that that road-map has mixed both Tegra and Denver SoCs in the end?
 
iArun,

Yes after Logan the performance improvements literally explode, but it could be by coincidence that that road-map has mixed both Tegra and Denver SoCs in the end?
I'm fairly confident no Project Denver-based SoC was on the roadmap - that comes even later (remember the first chip with Denver should be Tesla not Tegra). Also, I have strong reasons (which I won't get into) to believe NEON plays a big part in NVIDIA's 5x calculation for Kal-El, which implies the CPU must also play a big part in the other numbers. Given that I'm expecting Cortex-A15 to be Vec4 FMA+ADD instead of Vec2 MADD, that gives us 3x the flops per clock, and combined with a higher clock speed it's pretty easy to justify the 5x increase for Logan - keeping in mind all these numbers are very rough and usually rounded upwards (and don't necessarily even make sense).

Also I wonder if they could get LPDDR3 in Wayne - I think the timeframe fits and it should be cheap if they already need to support both LPDDR2 and DDR3, but we'll see.
 
Uhmm wasn't Logan somewhere in the 75x league with successing parts being at even more insane multiples compared to Tegra2?
 
Uhmm wasn't Logan somewhere in the 75x league with successing parts being at even more insane multiples compared to Tegra2?
No, Logan is 50x and Stark is 75x. Mind you it's rather silly to be taking these numbers so seriously..
 
No, Logan is 50x and Stark is 75x. Mind you it's rather silly to be taking these numbers so seriously..

Mind you I never took or will take NV's marketing numbers seriously; but it made a wee bit more sense to spin with the thought that they've mixed Tegra with Denver in that roadmap. At least for the latter 50x or 75x times hypothetical increases would be realistic.
 
I've been pretty systematic in claiming Wayne is very likely based on a quad-core Cortex-A9 again but (unlike OMAP5 or Krait) on a 28nm High-K process, very likely TSMC 28HPM. And yes, that probably does imply a tape-out around late Q4 2011 (as also implied by the roadmap slide in that german article - if it's more than one year behind Kal-El, it can't be taping-out so soon).

The only evidence against 4xA9 come from Charlie ("NVIDIA WILL DIE, NO WAIT, THEY ARE ALREADY DEAD INSIDE!") and Theo Valich ("WAYNE IS OCTO[STRIKE]MOM[/STRIKE] CORE A15 AND CONSUMES NEGATIVE POWER"). Excuse me if I don't take them very seriously compared to all evidence to the contrary ;) (e.g. low performance improvement claims for Wayne and very high improvement claims for Logan).

nVidia is very secretive about these things. I suppose there isn't any indication out there about it.

I'm curious how high a Cortex-A9 could clock on 28HPM within a tablet TDP. Apparently there will be Kal-El SKUs with noticeably higher clock speeds (aka Kal-El+ still on 40nm) so assuming 1.8GHz on 40LPG, then 2.5GHz should be very reasonable on 28HPM. That could be reasonably competitive, especially in terms of marketing, versus a dual-core 2GHz A15.

Not really. There are many improvements of A15 more than just Dhrystone DMIPS numbers. The improvement to the FPU/NEON unit alone should dramatically increase performance in many applications. Hell, I dare say that A15 on 28LPG would be significantly higher in terms of perf/W as well as perf than a 2.5GHz 28HPM A9.

Of course, the die area for A15 is much larger, so there's always that.

BTW, talking of process tech, this GlobalFoundries press release implies the ST-Ericsson is on 28SLP (equivalent of TSMC 28HPL) rather than 28HPP (equivalent of TSMC 28HPM). So the A15 can (theoretically) hit 2GHz on 28LP SiON (OMAP5) and 2.5GHz on 28SLP High-K (A9600). I wonder how much it would hit on 28HP or 28HPM - TSMC mentioned 28HPM can handle 3GHz CPUs in their latest CC, but I'd be surprised if A15 couldn't hit a lot more than that (>3.5GHz?) if you targeted traditional PC TDPs (not that anyone is going to). And that implies it's basically as fast (clockspeed-wise) as Bulldozer and could do even better on Intel's process. The more I think about it, the more I wonder if ARM went too far and should have stuck with a slightly shorter pipeline.

I think ARM's ambitions for A15 go far beyond tablet/smartphones. So no, I don't think they went too far. Of course, thus far, no one has decided to use A15 for such high-performance applications; most of the major vendors targeting that market has spun their own designs.

Oh and while we're talking process - metafor, any idea how different are 28HP and 28HPM in terms of synthesis? I assume you'd basically be forced to redo everything but I don't really know. One reason why I assume NV would wait longer to integrate the baseband is that ICE9040 is on 28HP and it doesn't make sense for NV to use that process for their application processors. The problem obviously is that Icera uses some structured custom and that'd take some time to redo (and would delay the 20nm generation as well). Then again maybe RDR on 28nm means they're doing much less custom than they used to...

I don't know the details of 28HP. However, to respin a design on a new process isn't very difficult. Optimizing something for a process is different, however. Custom cells are another matter and that's usually the long-pole when it comes to shifting processes. However, again, one can do it in a relatively short time if one doesn't worry about having it done optimally.
 
iPad2 seems to have gotten the most recent driver increases for the moment around 30% for both Egypt and PRO. PRO is more an OGL_ES1.1 test if I'm not mistaken, and abysmally vsync limited on iPad2 otherwise it wouldn't go from 59fps@1024*768 to 148fps@offscreen 1280*720.

The embedded market needs more advanced synthetic benchmarks than PRO and finally mobile games to get timedemos or any other sort of measurements.

There's the 3D Mark Mobile suite (Taiiji, Hover, etc.) that's actually semi-relevant in their complexity and variation of scenes. But unfortunately, they don't run on all platforms. I wonder how much of a difference the SGX drivers make in those benchmarks.
 
Let's not go wild with confidences :)

LOL :LOL:

There's the 3D Mark Mobile suite (Taiiji, Hover, etc.) that's actually semi-relevant in their complexity and variation of scenes. But unfortunately, they don't run on all platforms. I wonder how much of a difference the SGX drivers make in those benchmarks.

http://www.anandtech.com/show/4643/htc-evo-3d-vs-motorola-photon-4g-best-sprint-phone/6

Should be all with the older drivers judging from the GLbenchmark2.0 results.
 
How am I supposed to know unless Anand or any other website is going to re-test some of those devices with more recent drivers?

In the results in the link above in Taiji/Hoverjet a 200MHz/540 is roughly on par with the 300MHz/ULP GF and the 300MHz/540 a good notch above the latter.
 
How do mobile devices get more recent graphics drivers?

With OS updates? Have there been iOS updates for iPad 2 since release? I thought it was mostly waiting for iOS 5.
 
How do mobile devices get more recent graphics drivers?

With OS updates? Have there been iOS updates for iPad 2 since release? I thought it was mostly waiting for iOS 5.

Graphics driver updates come through OS updates. In this case, GLBenchmark.com's numbers are measuring iOS 5, even though it has not been released yet. You can see that on the 'System Information' tab
 
Ailuros said:
Else I'd love to hear a half way reasonable explanation how you suddenly quadruple graphics performance with 50% more PS ALU lanes and a higher frequency unless the latter ends up close or over 1GHz.
It should be 100% more PS ALU lanes. T2 was 8 "cores" (4VS, 4PS), while T3 is supposedly (4VS, 8PS).
 
nVidia is very secretive about these things. I suppose there isn't any indication out there about it.
Actually there is one good piece of evidence that indicates Wayne is mostly a shrink of Kal-El:
http://seekingalpha.com/article/253357-nvidia-s-ceo-discusses-q4-2011-results-earnings-call-transcript?part=qanda said:
Jen-Hsun Huang

But going forward, the way to think about it is the general rhythm that you will see in the industry for companies that come up with new processors every year, you should expect to see two processors in the same node or so. You’ve noticed other companies that have used different ways of explaining the rhythm and some use a one process node change, one is a architecture change, the next one is a process node change. So basically, it's every other year for a new process node. And I think that rhythm is not a bad rhythm. I mean, that's basically how quickly the industry is changing process. So it stands to reason that Tegra 2 and Tegra Next or Kal-El would use the same process, and then the following ones would be a 28-nanometer deal.
I could be reading too much into it (repeating Intel's claims as an explanation doesn't mean they'll do the exact same thing), but it does seem to imply Wayne is still 4xA9/GeForce ULP while Logan is the big change. Then again things can change (they've already changed with the integrated modem 'Grey' being added to the roadmap) and maybe they'll change just the CPU or just the GPU part but not the other. I don't know but I still think an incremental improvement is slightly most likely. There's also the marketing problem of probably going from a 4xA9 to a 2xA15 unless they make it heterogeneous and add 2xA9 or 2xA5 in there.

Not really. There are many improvements of A15 more than just Dhrystone DMIPS numbers. The improvement to the FPU/NEON unit alone should dramatically increase performance in many applications. Hell, I dare say that A15 on 28LPG would be significantly higher in terms of perf/W as well as perf than a 2.5GHz 28HPM A9.
Sure, you're basically talking about ~1.5x on integer and >2x on NEON (per clock), so NEON-heavy applications will look comparatively much better than integer on A15 vs A9.

I didn't mean a 2.5GHz 4xA9 on 28HPM was a better design decision than a 2GHz 2xA15 on 28LPG (i.e. OMAP5). I certainly think it's worse in nearly every way. All I meant is it's not all that bad and it's still pretty good from a marketing perspective so I don't expect NVIDIA to suffer much from it (for better or worse). Honestly I'm more worried about their GPU performance if they stick to the same GPU architecture for yet another generation.

I think ARM's ambitions for A15 go far beyond tablet/smartphones. So no, I don't think they went too far. Of course, thus far, no one has decided to use A15 for such high-performance applications; most of the major vendors targeting that market has spun their own designs.
I think I'd agree completely with you *IF* A15 supported 64-bit. As it is, I'd be surprised if it took these market by storm, and they'd have been better off focusing slightly more on mobile. Then again it if means less voltage overdrive by everyone to reach their clock targets (because they are more quickly TDP limited) it could still be as or even more power efficient I suppose (but slightly less area efficient which hardly matters). Hard for me to say.

I don't know the details of 28HP. However, to respin a design on a new process isn't very difficult. Optimizing something for a process is different, however. Custom cells are another matter and that's usually the long-pole when it comes to shifting processes. However, again, one can do it in a relatively short time if one doesn't worry about having it done optimally.
Given that Icera is all about optimising for the process, that is indeed what I'm worried about. I'm not 100% sure about 40nm/28nm but Icera definitely used Prolific ProGenesis for their cell library on 90nm/65nm. I expect they're still using it and one of ProGenesis' selling points is that it helps to "create usable layout while design rules, SPICE netlists, and the cell architecture itself are still in flux". So assuming 28HP and 28HPM aren't too horribly different I assume the cells themselves wouldn't be a huge problem. They even claim that you can "easily modify cells or entire libraries after characterization or place-and-route testing and optimization". And since structured custom means they synthesise the wiring anyway, that shouldn't be a problem either. Hmm.

Yep obvious brainfart. 50% more ALU lanes for the entire core and 100% more PS ALU lanes.
The Tegra VS is ridiculously powerful compared to the PS (since there are a lot more pixels than vertices), so I think it's fair to say they doubled the performance of the part that actually matters.
 
Actually there is one good piece of evidence that indicates Wayne is mostly a shrink of Kal-El:
I could be reading too much into it (repeating Intel's claims as an explanation doesn't mean they'll do the exact same thing), but it does seem to imply Wayne is still 4xA9/GeForce ULP while Logan is the big change. Then again things can change (they've already changed with the integrated modem 'Grey' being added to the roadmap) and maybe they'll change just the CPU or just the GPU part but not the other. I don't know but I still think an incremental improvement is slightly most likely. There's also the marketing problem of probably going from a 4xA9 to a 2xA15 unless they make it heterogeneous and add 2xA9 or 2xA5 in there.

I believe the "tock" will be Kal-el+. With the release date of Wayne, there's simply no way it'll be competitive if it's still A9, regardless of the frequency.

Sure, you're basically talking about ~1.5x on integer and >2x on NEON (per clock), so NEON-heavy applications will look comparatively much better than integer on A15 vs A9.

Much more in some cases. It isn't just about absolute throughput; most programs out there are not written well enough to saturate the typical pipeline, especially in Android. The fully out of order VFP operations will help dramatically in many cases (since the likes of Javascript defaults to DP FP as its data type). The dual-dividers will make most matrix operations (that wasn't hand-written, Dalvik seems to love doing divides) orders of magnitude faster.

On the integer side, the improvements may be less dramatic but I expect workloads with heavy load/store chains to be improved significantly as well as certain DSP-style integer (and 8-bit SIMD) operations.

I didn't mean a 2.5GHz 4xA9 on 28HPM was a better design decision than a 2GHz 2xA15 on 28LPG (i.e. OMAP5). I certainly think it's worse in nearly every way. All I meant is it's not all that bad and it's still pretty good from a marketing perspective so I don't expect NVIDIA to suffer much from it (for better or worse). Honestly I'm more worried about their GPU performance if they stick to the same GPU architecture for yet another generation.

Fair enough. But "not that bad" doesn't seem to win it these days. Realistically, all you need is a 800MHz Cortex A8 and most people would swear by it as the best thing since the last iThing. It's the benchmark geeks that nVidia has to win over since they're the people buying and recommending Android devices.

I think I'd agree completely with you *IF* A15 supported 64-bit. As it is, I'd be surprised if it took these market by storm, and they'd have been better off focusing slightly more on mobile. Then again it if means less voltage overdrive by everyone to reach their clock targets (because they are more quickly TDP limited) it could still be as or even more power efficient I suppose (but slightly less area efficient which hardly matters). Hard for me to say.

A15 fits many server applications that don't require a flat 64-bit address. From many of its features -- Hypervisor, extended addressing, 3rd level translation, a (relatively) sane secure/non-secure memory management -- it seems clear they had more than tablets/smartphones in mind. Whether or not that market will take off is another question, but it's pretty clear they are aiming for it.

Of course, considering that Windows 8 will run on ARM SoC's, it now doesn't seem so implausible to have ARM-powered desktops. It also makes the A15's microarchitecture relatively conservative in comparison to what's running on desktops today.

Given that Icera is all about optimising for the process, that is indeed what I'm worried about. I'm not 100% sure about 40nm/28nm but Icera definitely used Prolific ProGenesis for their cell library on 90nm/65nm. I expect they're still using it and one of ProGenesis' selling points is that it helps to "create usable layout while design rules, SPICE netlists, and the cell architecture itself are still in flux". So assuming 28HP and 28HPM aren't too horribly different I assume the cells themselves wouldn't be a huge problem. They even claim that you can "easily modify cells or entire libraries after characterization or place-and-route testing and optimization". And since structured custom means they synthesise the wiring anyway, that shouldn't be a problem either. Hmm.

I'm not very familiar with that but it sounds like they're able to keep the cell footprint the same while modifying it's contents. That certainly makes it easier to move to a different node but I don't know that it would be area-efficient. But that depends on how good your circuit designers are.
 
I believe the "tock" will be Kal-el+. With the release date of Wayne, there's simply no way it'll be competitive if it's still A9, regardless of the frequency.
A 28nm Kal-El+? I was assuming it to be a simple clockbump but maybe. However then Logan would have to be 20nm to stick with two chips (excluding Grey) per generation. That doesn't seem likely at all in the 2H13 timeframe for real products they were targeting.

My understanding of the roadmap was that the "tock" would essentially be a 'mid-range and high-end' play whereas the 'tick' would target the 'high-end and ultra-high-end', and even after the 'tick' they'd keep selling the 'tock' in very large quantities (it's not a direct replacement). But with the addition of Grey to the roadmap, they've got a better mid-range play, so they don't need Wayne to be as cheap anymore. And maybe they didn't expect ST-Ericsson/Qualcomm to be so aggressive in terms of CPU performance before they saw the public roadmaps at MWC. I'm not sure they'd have the time to change from A9 to A15 if so, although with Wayne only taping-out in early 2012 they might. Lots of possibilities as always, which means speculation going all over the place...
Much more in some cases. It isn't just about absolute throughput; most programs out there are not written well enough to saturate the typical pipeline, especially in Android. [...]
Good point about the VFP and hardware division. I'm actually rather curious how heterogeneous architectures like the ST-Ericsson A9600 will handle hardware division since A9/A5 don't have it. I fear they might have to somehow convince the OS they don't support the new instructions in A15 which will hurt performance slightly.

BTW, on NEON: some key NEON workloads that are likely to be benchmarked in the future (e.g. x264) do scale very well with >2 cores, so that might mitigate the impact somewhat.

Fair enough. But "not that bad" doesn't seem to win it these days. Realistically, all you need is a 800MHz Cortex A8 and most people would swear by it as the best thing since the last iThing. It's the benchmark geeks that nVidia has to win over since they're the people buying and recommending Android devices.
True enough. I suppose it also depends on how fast Logan is released afterwards. Historically NVIDIA has taken about 18 months between major Tegra tape-outs, but AFAIK that was led entirely by US design centers. I saw an interview mentioning Logan would be the first Tegra chip to be handled primarily by their Indian design centers (at least in terms of physical implementation) so maybe it's being done more in parallel and it'll be out a bit faster. Still not enough to compensate for a potentially subpar Wayne though.

A15 fits many server applications that don't require a flat 64-bit address. From many of its features -- Hypervisor, extended addressing, 3rd level translation, a (relatively) sane secure/non-secure memory management -- it seems clear they had more than tablets/smartphones in mind. Whether or not that market will take off is another question, but it's pretty clear they are aiming for it.
Agreed but my fear is that not supporting 64-bit reduces the TAM enough that it's significantly less likely to generate enough momentum. And anyway I think Intel's Atom-based server roadmap is attractive enough power-wise that there's not much benefit to transitioning to another ISA (that's hardly ARM's fault though).

Of course, considering that Windows 8 will run on ARM SoC's, it now doesn't seem so implausible to have ARM-powered desktops.
As I pointed out in my ARM article, the expectations for desktops in terms of applications are completely different. I can easily believe ARM might eventually be fairly successful in the traditional 13-15" notebook market where they will indeed have to compete with Intel's Haswell and AMD's Bulldozer, but I can't believe they'll ever get any traction whatsoever in the desktop market (excluding the HTPC niche) unless you get a full x86-to-ARM transcoder.

I'm not very familiar with that but it sounds like they're able to keep the cell footprint the same while modifying it's contents. That certainly makes it easier to move to a different node but I don't know that it would be area-efficient. But that depends on how good your circuit designers are.
Icera definitely doesn't do that for full node transitions, they redid all the structured custom work for 65nm and 40nm (otherwise they wouldn't have been relatively late to both nodes). Things obviously change too much. But I assume 28HP to 28HPM is much much more similar than 65GP to 40G, right? So maybe that's what they're doing.
 
Good point about the VFP and hardware division. I'm actually rather curious how heterogeneous architectures like the ST-Ericsson A9600 will handle hardware division since A9/A5 don't have it.

Seems more like a software question than a hardware one. The question would be if you'd want to avoid scheduling that thread on the div-less cores or emulate it in software. If the thread gets pegged on the div-less core for a while a reasonable approach would be to trap the instruction when it's executed and patch the program to call a VFP-based division routine instead, if VFP is available.

BTW, do you have a source that indicates that A5 doesn't have integer division? It seems like a strange omission, everything else considered.
 
Last edited by a moderator:
A 28nm Kal-El+? I was assuming it to be a simple clockbump but maybe. However then Logan would have to be 20nm to stick with two chips (excluding Grey) per generation. That doesn't seem likely at all in the 2H13 timeframe for real products they were targeting.

I don't think I'd take the Intel model comments too seriously. They are likely far more influenced by 1. what their competition is doing and 2. engineering constraints than by any philosophy on when to release products.

My understanding of the roadmap was that the "tock" would essentially be a 'mid-range and high-end' play whereas the 'tick' would target the 'high-end and ultra-high-end', and even after the 'tick' they'd keep selling the 'tock' in very large quantities (it's not a direct replacement). But with the addition of Grey to the roadmap, they've got a better mid-range play, so they don't need Wayne to be as cheap anymore. And maybe they didn't expect ST-Ericsson/Qualcomm to be so aggressive in terms of CPU performance before they saw the public roadmaps at MWC. I'm not sure they'd have the time to change from A9 to A15 if so, although with Wayne only taping-out in early 2012 they might. Lots of possibilities as always, which means speculation going all over the place...

I can only comment on what would be competitive in 2012/2013. And it's not A9's unless you're going for cheap chips...

Good point about the VFP and hardware division. I'm actually rather curious how heterogeneous architectures like the ST-Ericsson A9600 will handle hardware division since A9/A5 don't have it. I fear they might have to somehow convince the OS they don't support the new instructions in A15 which will hurt performance slightly.

Sadly, that's the case for many of A15's new instructions; including fused MAC. A15's implementation of FMA was an afterthought (grafted onto the chained implementation) so ARM's compiler doesn't like to -- rather, just doesn't -- spit out FMA instructions.

BTW, on NEON: some key NEON workloads that are likely to be benchmarked in the future (e.g. x264) do scale very well with >2 cores, so that might mitigate the impact somewhat.

Even so, the NEON throughput of A15 is 2x that of an A9 and realistically speaking, achievable throughput is even higher due to the larger re-ordering mechanisms and lower latency in some key instructions. For codecs especially, low-latency (compared to A9) integer SIMD instructions will help tremendously.

But I suspect most applications that perform workloads like x264 have dedicated (either fixed function or DSP) processors to do that and won't be using the CPU.

True enough. I suppose it also depends on how fast Logan is released afterwards. Historically NVIDIA has taken about 18 months between major Tegra tape-outs, but AFAIK that was led entirely by US design centers. I saw an interview mentioning Logan would be the first Tegra chip to be handled primarily by their Indian design centers (at least in terms of physical implementation) so maybe it's being done more in parallel and it'll be out a bit faster. Still not enough to compensate for a potentially subpar Wayne though.

I'm not sure how that would really speed anything up. Most teams are divided into dedicated physical/synthesis groups and logic/architecture groups. Shifting physical design to India would only cause complications.

Agreed but my fear is that not supporting 64-bit reduces the TAM enough that it's significantly less likely to generate enough momentum. And anyway I think Intel's Atom-based server roadmap is attractive enough power-wise that there's not much benefit to transitioning to another ISA (that's hardly ARM's fault though).

I don't know how much ARM really needs momentum there. They're happy with having the smartphone/tablet market to themselves. Having A15 be usable in a desktop/server application is nice, but not necessary. They simply designed it such that it *could* scale there.

As I pointed out in my ARM article, the expectations for desktops in terms of applications are completely different. I can easily believe ARM might eventually be fairly successful in the traditional 13-15" notebook market where they will indeed have to compete with Intel's Haswell and AMD's Bulldozer, but I can't believe they'll ever get any traction whatsoever in the desktop market (excluding the HTPC niche) unless you get a full x86-to-ARM transcoder.

Looking at the market, 13-15" notebooks are where it's at. Even Intel has shifted their strategy to mainly target processors for that market. And A15 would fit perfectly in that area, more so than Haswell I'd say.

Icera definitely doesn't do that for full node transitions, they redid all the structured custom work for 65nm and 40nm (otherwise they wouldn't have been relatively late to both nodes). Things obviously change too much. But I assume 28HP to 28HPM is much much more similar than 65GP to 40G, right? So maybe that's what they're doing.

I honestly don't know. 28LP to 28HPM was quite a shift. The design rules are wholely different and what FET structures worked optimally changed dramatically as well. Hell, going from 45LP to 45LPG was a pretty big shift; die area ballooned somewhere on the order of 20%.
 
Back
Top