Tegra 3 officially announced; in tablets by August, smartphones by Christmas

Gubbi · Jul 14, 2011

Exophase said:
Yes, NEON is in-order. And for in-order it doesn't really have enough registers to cover its latency, which can be substantial (and you don't get single cycle for almost everything). It requires aggressive hand scheduling to get good utilization out of, but for the type of data parallel and data linear algorithms that you often use here it's fairly viable to get good utilization without OoOE and with generous prefetching. Just painful.

Scheduling around instruction latencies is trivial in most cases. The problem is scheduling around loads that miss the caches, this is where OoOe excels.

Cheers

metafor · Jul 14, 2011

Gubbi said:
The latter.

Cheers

Interesting. Is this true of VFP instructions as well?

Exophase · Jul 14, 2011

Gubbi said:
Scheduling around instruction latencies is trivial in most cases. The problem is scheduling around loads that miss the caches, this is where OoOe excels.

It's not trivial when you run out of registers (hence why register renaming is ncie in OoOE), but I agree cache misses are the big point, not that you can really schedule that much in an A9 level window to get around the 200+ cycles an L2 miss seems to cost on typical SoCs.. But what I was mainly getting at is that a lot of the algorithms are highly data linear and can be prefetched. Cortex-A9 improves this a lot by adding an automatic prefetch unit and allowing prefetch into L1.

metafor said:
Interesting. Is this true of VFP instructions as well?

Should be exactly the same.

Ailuros · Jul 14, 2011

metafor said:
If it's a matter of dual A15 at 2GHz vs quad A9's at 2GHz, 99.99999% of the applications out there will run faster on the A15 solution.

How faster is according to ARM each A15 vs. a A9 clock for clock?

metafor · Jul 14, 2011

Ailuros said:
How faster is according to ARM each A15 vs. a A9 clock for clock?

Dhrystone is 3.5 DMIPS/MHz vs 2.5 DMIPS/MHz. Realistic workloads may improve even more. Certainly anything that involves MRC's and MCR's.

NEON should be significantly faster; VFP as well though not as much.

Not that it counts for anything but things like Quadrant and Smartbench would spike with A15 compared to A9.

Ailuros · Jul 15, 2011

metafor said:
Dhrystone is 3.5 DMIPS/MHz vs 2.5 DMIPS/MHz. Realistic workloads may improve even more. Certainly anything that involves MRC's and MCR's.

NEON should be significantly faster; VFP as well though not as much.

Not that it counts for anything but things like Quadrant and Smartbench would spike with A15 compared to A9.

Doesn't sound all too convincing, but that's exactly what I was aiming at. Dhrystone is 40% higher clock for clock so there's still another 60% to go which I am even willing to grant you. It's still 4 cores vs. 2 and while I'd personally prefer the latter for pure efficiency reasons I still don't see a quad core A9 at similar frequencies losing by a big margin.

It sounds more like the difference will be negligable and most of you also seem to forget that NV is consistently ahead in terms of CPU execution (only *ahem*), which means nothing else but that it'll bank over NV just fine and that if NV wants something stronger than 2*A9@1.5GHz and quite some time before A15 can even be available a quad A9 is a one way street.

Quadrant and Smartbench with all due respect don't sound like 99,9% of the real time applications out there.

In any case I'm all but for NV's strategy pushing the CPU envelope higher than anything else. Especially for a graphics IHV and to that one of the leading ones in all other markets. Performance-wise the ULP GeForce in Tegra2 is up to snuff, but that's about it. No technological leadership, no forefront no nothing up to now. Just funky CPU driven demos.

metafor · Jul 15, 2011

Ailuros said:
Doesn't sound all too convincing, but that's exactly what I was aiming at. Dhrystone is 40% higher clock for clock so there's still another 60% to go which I am even willing to grant you. It's still 4 cores vs. 2 and while I'd personally prefer the latter for pure efficiency reasons I still don't see a quad core A9 at similar frequencies losing by a big margin.

Wait, you're assuming twice the number of cores = twice realistic performance increase?

It sounds more like the difference will be negligable and most of you also seem to forget that NV is consistently ahead in terms of CPU execution (only *ahem*), which means nothing else but that it'll bank over NV just fine and that if NV wants something stronger than 2*A9@1.5GHz and quite some time before A15 can even be available a quad A9 is a one way street.

Maybe it's just my parsing skills but that paragraph didn't really grok for me....

Quadrant and Smartbench with all due respect don't sound like 99,9% of the real time applications out there.

It isn't, hence my statement "not that it means much".

Exophase · Jul 15, 2011

Ailuros said:
It sounds more like the difference will be negligable and most of you also seem to forget that NV is consistently ahead in terms of CPU execution (only *ahem*), which means nothing else but that it'll bank over NV just fine and that if NV wants something stronger than 2*A9@1.5GHz and quite some time before A15 can even be available a quad A9 is a one way street.

So far nVidia was ahead once with Tegra 2, and arguably behind with Tegra 1. If Kal-El devices really come out this year they'll be ahead again, but that remains to be seen. nVidia actually said we'd see devices in August and I'm definitely not buying that.

Ailuros · Jul 15, 2011

metafor said:
Wait, you're assuming twice the number of cores = twice realistic performance increase?

An theoretical increase by X% in efficiency of a newer architecture doesn't also necessarily mean an increase by the same persentage now does it?

Maybe it's just my parsing skills but that paragraph didn't really grok for me....

If everything goes according to plans Tegra3 might be shipping somewhat this year and in way larger volumes beginning of 2012. When are the first dual A15 powered SoCs slated to appear in devices again? Most likely end of 2012 if not slightly later and that from ARM's A15 lead partners.

Wouldn't it tell you that A15 isn't possible to be integrated before ARM's A15 lead partners will? So what would NV integrate in a Tegra3 since it's not a A15 lead partner yet still wants to stay on the CPU execution forefront? No other choice than quad A9; considering that in the meantime most others most likely will still be fiddling around with dual A9 they're well positioned on the CPU front at least.

Tomorrow's architectures are more efficient; holy smokes what else is new?

It isn't, hence my statement "not that it means much".

I take that as an acknowledgement of the former exaggeration.

Exophase said:
So far nVidia was ahead once with Tegra 2, and arguably behind with Tegra 1. If Kal-El devices really come out this year they'll be ahead again, but that remains to be seen. nVidia actually said we'd see devices in August and I'm definitely not buying that.

Delays aren't exclusive to just one manufacturer you know. At least in terms of yields with 40nm T3 is on a relatively safer side than 28nm. All you've got from ARM's A15 lead partners is that they also "expect" to see devices in X time frame.

It should be enough IMHO for T3 to fair relatively well if it manages a lead time of at least 8 months against the 28nm SoCs from the competition. Of course never to the degree NVIDIA projects in terms of sales, but that's a totally different chapter.

Exophase · Jul 15, 2011

Ailuros said:
An theoretical increase by X% in efficiency of a newer architecture doesn't also necessarily mean an increase by the same persentage now does it?

But a 2x improvement from doubling cores is a theoretical maximum that is almost never fully attained in the real world and drops off sharply depending on the software used, while the improvement numbers from Cortex-A9 to A15 are supposed to represent real world averages. If anything Dhrystone underestimates the improvements because a lot of them are in the memory subsystem or more aggressive OoO than Dhrystone needs.

Ailuros said:
Delays aren't exclusive to just one manufacturer you know. At least in terms of yields with 40nm T3 is on a relatively safer side than 28nm. All you've got from ARM's A15 lead partners is that they also "expect" to see devices in X time frame.

It should be enough IMHO for T3 to fair relatively well if it manages a lead time of at least 8 months against the 28nm SoCs from the competition. Of course never to the degree NVIDIA projects in terms of sales, but that's a totally different chapter..

Q3 2012 seems like the most optimistic timeframe, and I certainly don't expect A15 to hit prior to A9 - but this isn't like Tegra 2 beating out OMAP4 or Exynos to market, comparing Kal-El to A15 is less apples to apples. Instead there'll be more competing with Kal-El before A15 hits. For instance, i.MX6 will also be available in quad-core A9, and OMAP4470 will hit 1.8GHz instead of 1.5GHz.

I personally think that the software ecosystem won't be there to really take advantage of quad-core A9 outside of stuff coming from PSV development. So while nVidia will push the marketing for it hard in the real world it'll often be better to have a higher clocked A9.

metafor · Jul 15, 2011

Ailuros said:
An theoretical increase by X% in efficiency of a newer architecture doesn't also necessarily mean an increase by the same persentage now does it?

Well no, hence why I put "may be even more for realistic workloads". A15's unified pipe and improved NEON/VFP pipeline alone will mean vast improvements for anything floating-point related.

Furthermore, a theoretical increase in X% IPC has correlated far more with improved end-performance than a theoretical increase in core scaling. Few applications scale anywhere near linearly with core count and most see no improvement at all going from 2 cores to 4 (or hell, going from 1 core to 2).

This isn't the case for IPC improvements.

Furthermore, DMIPS/MHz when measuring IPC isn't a purely "theoretical increase" either. It may be overly optimistic but it's nowhere near as "theoretical" as saying 2xCores = 2xPerformance.

If we were to speak in terms of purely theoretical improvements, A15 would be 50% faster clock-for-clock than A9; assuming it were fetch limited. Or 100% faster clock-for-clock in the case of NEON instructions. Or 100% faster clock-for-clock in the case of dispatch-limited.

So, in conclusion, A15 will be faster -- likely significantly so -- clock-for-clock than A9. And like I said, under ~99% of applications out there, 2xA15 will likely outperform 4xA9 and by quite a significant degree.

If everything goes according to plans Tegra3 might be shipping somewhat this year and in way larger volumes beginning of 2012. When are the first dual A15 powered SoCs slated to appear in devices again? Most likely end of 2012 if not slightly later and that from ARM's A15 lead partners.

While true, Krait will be shipping end of this year with devices likely early 2012.

Wouldn't it tell you that A15 isn't possible to be integrated before ARM's A15 lead partners will? So what would NV integrate in a Tegra3 since it's not a A15 lead partner yet still wants to stay on the CPU execution forefront? No other choice than quad A9; considering that in the meantime most others most likely will still be fiddling around with dual A9 they're well positioned on the CPU front at least.

While true, I was merely pointing out that your statement of 4xA9 being somewhat equal to 2xA15 to be absolutely false in the vast vast vast majority of circumstances.

I agree that for the time-frame that nVidia is aiming at, A9's are still the best option. I will disagree that the extra area and power at 40nm of having 4xA9 with NEON is worth any marginal gain in performance from the tiny fraction of applications that may scale to 4 threads. They would've been better off using that resource to beef up the GPU or add a pair of lower-power CPU's in a heterogeneous configuration.

Ailuros · Jul 16, 2011

Nah uh; I didn't suggest they'd be equal, but rather that the difference won't be "that" groundbreaking. Now if we'd go to the GPU side it's there where the rather "ouch" part is considering what ST Ericsson and TI have already announced for instance.

Ailuros · Jul 16, 2011

Exophase said:
But a 2x improvement from doubling cores is a theoretical maximum that is almost never fully attained in the real world and drops off sharply depending on the software used, while the improvement numbers from Cortex-A9 to A15 are supposed to represent real world averages. If anything Dhrystone underestimates the improvements because a lot of them are in the memory subsystem or more aggressive OoO than Dhrystone needs.

True. As it seems for the CPU side NV though didn't have any other choice.

Q3 2012 seems like the most optimistic timeframe, and I certainly don't expect A15 to hit prior to A9 - but this isn't like Tegra 2 beating out OMAP4 or Exynos to market, comparing Kal-El to A15 is less apples to apples.

Of course did Tegra2 beat both OMAP4 and Exynos or any other dual core CPU in SoCs to the market. Yes their had been delays in terms of integration for T2 too, but when it starting shipping in actual devices it was quite a bit before Exynos arrived in GalaxyS2 (which is only now catching up severely in sales) with OMAP4430 making a an earlier appearance in the RIM Playbook but still later than T2.

If you'd tell me that OMAP4 will by far outsell T2 in the longrun, I wouldn't disagree. All I'm saying is that NV tries to beat its competition in terms of execution placing most of its bets on the CPU side. It didn't sell a gazillion as probably only NV and some wannabe analysts expected, but their so far penetration is anything but bad. I'd rather expect other SoC manufacturers to try to accelerate their road-maps and try some equal marketing stunts like we're used to.

Instead there'll be more competing with Kal-El before A15 hits. For instance, i.MX6 will also be available in quad-core A9, and OMAP4470 will hit 1.8GHz instead of 1.5GHz.

Isn't 4470 sampling with roughly one year's distance after T3? By the way TI played so far within an OMAP family mostly with frequencies for the GPU. 4470 isn't just a healthy CPU frequency increase but also a very strong GPU increase.

I personally think that the software ecosystem won't be there to really take advantage of quad-core A9 outside of stuff coming from PSV development. So while nVidia will push the marketing for it hard in the real world it'll often be better to have a higher clocked A9.

Neither the CPU nor the GPU by themselves will define a SoCs performance. It's rather the entire bundle of processing units (often even reaching 12 these days) that'll define performance. In the grander scheme of things yes it is better to have a 2*A9@1.8GHz + "18 unified cores" at >400MHz + other units vs. 4*A9@1.5GHz + "12 split purpose cores" at =/>400MHz + other units.

From what it looks like when it comes to OMAP4 T3 seems to have to battle itself first with 4460 and then later on with 4470.

Exophase · Jul 16, 2011

Tegra 2 did beat OMAP4 to market by a big margin, even though TI projected OMAP4 phones in 2010. It beat Exynos as well, but Exynos seemed pretty much on schedule relative to Samsung announcements; actually, it seems that it was released slightly earlier than expected. Other platforms like U8500 are still nowhere to be seen. What I'm saying is that this is pretty much just one win for nVidia, not a track record.

One thing to note is that even though nVidia had a huge lead in tablets and netbooks, their lead in phones was much less significant. And a lot of the earlier Tegra 2 tablets were marred by poor software. nVidia also benefited from being the Honeycomb reference platform. I don't know if Google chose them due to being available or if Google's choice itself helped push availability; it may have been a combination of both. But it would appear that Google wants to keep giving new companies a chance for reference selection, so no matter why nVidia had that advantage last time they won't have it next time. And unfortunately for nVidia, that last reference win only got them tablets, while the next reference win will get someone tablets and phones.

nVidia was focusing phones and tablets with the same initial SKU, while TI for instance planned to have multiple ones. It looks like nVidia will be following the latter strategy more (which obviously makes more sense in this market). Meanwhile, TI seems to be diversifying their product range and probably intends on more aggressively releasing as well; I'm sure they're not just ignoring nVidia.

Ailuros said:
Isn't 4470 sampling with roughly one year's distance after T3?

That's an exaggeration. Sources say that Kal-El started sampling in February but also only 12 days after tape-out, which I don't really buy at all (or if it did, probably with a lot of issues that's going to delay time to product). 4470 is supposed to start sampling in October or so. So 8 months if we take nVidia at their word. Note that nVidia also said that they taped out in December then said it was really February.

Arun · Jul 16, 2011

Exophase said:
That's an exaggeration. Sources say that Kal-El started sampling in February but also only 12 days after tape-out, which I don't really buy at all (or if it did, probably with a lot of issues that's going to delay time to product).

Heh, it's pretty much a courtesy to lead partners AFAICT. Broadcom also started 'sampling' BCM21551 to lead partners then admitted they knew even before tape-out that it was only a prototype and they'd do a silicon respin (aka an entirely new tape-out) afterwards. More than a year later, the project was cancelled due to a large number of technical problems and customers not being happy with either the integrated 3G RF or the Bluetooth (after the new tape-out I believe).

In NVIDIA's case, I suspect one important consideration is that their lead partners were already working on PCBs *before* sampling. So even if the chip itself was fairly buggy it would still help to make sure their PCBs didn't have any major problems. In practice this makes the entire process much closer to what Apple can do: PCBs are developed right after tape-out while waiting the chips get back from the fab, and NVIDIA provides the OS software themselves (based on Android) as well. Same thing happened with Exynos, but certainly not Tegra 2.

Of course, if they need more than a very simple and straightforward respin, none of that will help.

metafor · Jul 16, 2011

Ailuros said:
Nah uh; I didn't suggest they'd be equal, but rather that the difference won't be "that" groundbreaking.

What do you consider "groundbreaking"? Based on the u-arch changes, we're talking about quite a significant gain in many applications; particularly floating point.

Also, your quote was:

Wouldn't it come down to the quad A9 frequencies vs. dual A15 frequencies of the time?

If hypothetically both square out at around 2.0GHz, I don't see the quad A9 falling that much short if at all.

This is just patently untrue. There will be a clear (if not bigger) difference between A15 and A9 than there is between A9 and A8.

metafor · Jul 16, 2011

Arun said:
Heh, it's pretty much a courtesy to lead partners AFAICT. Broadcom also started 'sampling' BCM21551 to lead partners then admitted they knew even before tape-out that it was only a prototype and they'd do a silicon respin (aka an entirely new tape-out) afterwards. More than a year later, the project was cancelled due to a large number of technical problems and customers not being happy with either the integrated 3G RF or the Bluetooth (after the new tape-out I believe).

In NVIDIA's case, I suspect one important consideration is that their lead partners were already working on PCBs *before* sampling. So even if the chip itself was fairly buggy it would still help to make sure their PCBs didn't have any major problems. In practice this makes the entire process much closer to what Apple can do: PCBs are developed right after tape-out while waiting the chips get back from the fab, and NVIDIA provides the OS software themselves (based on Android) as well. Same thing happened with Exynos, but certainly not Tegra 2.

Of course, if they need more than a very simple and straightforward respin, none of that will help.

I think there's a trend throughout the industry to begin hardware and OEM design very early on. It didn't use to be this way (as much as a year back, in fact). Overall, I think it's a much more efficient method albeit potentially wasteful in the case of a major rehaul.

Ailuros · Jul 16, 2011

metafor said:
This is just patently untrue. There will be a clear (if not bigger) difference between A15 and A9 than there is between A9 and A8.

Well lets waste some more bandwidth then. Show me a single A9 at the same frequency as a single A8.

To me a SoC isn't just a CPU. To re-phrase the whole damn thing: since there will be a large difference between future dual A15 and quad A9/Tegras the first spot to look for the biggest weakness won't be the CPU. If ST Ericsson manages to execute with the A9600 as they'd want to the result would make a killing why exactly?

And yes I still believe that in mainstream real time applications the average consumer shouldn't notice any significant differences from the CPU side in purely CPU bound applications.

metafor · Jul 17, 2011

Ailuros said:
Well lets waste some more bandwidth then. Show me a single A9 at the same frequency as a single A8.

I don't think there exists a single A9 currently. However, you can look at any number of benchmarks that are single-threaded to see the performance difference between an A9 and A8.

To me a SoC isn't just a CPU. To re-phrase the whole damn thing: since there will be a large difference between future dual A15 and quad A9/Tegras the first spot to look for the biggest weakness won't be the CPU. If ST Ericsson manages to execute with the A9600 as they'd want to the result would make a killing why exactly?

There are plenty of things that are still CPU bound -- including Javascript parsing and browser speeds.

And yes I still believe that in mainstream real time applications the average consumer shouldn't notice any significant differences from the CPU side in purely CPU bound applications.

There are mainstream applications for which the consumer won't notice the difference between a 600MHz single A8 and a dual 1.2GHz A9. What's your point?

Exophase · Jul 17, 2011

Ailuros said:
Well lets waste some more bandwidth then. Show me a single A9 at the same frequency as a single A8.

Amlogic AML8726 is a single core Cortex-A9 clocking up to 1GHz. You can find 1+GHz Cortex-A8s pretty easily. But the Amlogic chip has a further handicap by only having 128KB of L2, vs A8 SoCs that have 256KB or 512KB.

It won't be long before you'll be able to compare tablets using this SoC and similar memory configurations to ones with A8 at the same clock.

Tegra 3 officially announced; in tablets by August, smartphones by Christmas

Gubbi

metafor

Exophase

Ailuros

Epsilon plus three

metafor

Ailuros

Epsilon plus three

metafor

Exophase

Ailuros

Epsilon plus three

Exophase

metafor

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Exophase

Arun

Unknown.

metafor

metafor

Ailuros

Epsilon plus three

metafor

Exophase

Similar threads