Tegra 3 officially announced; in tablets by August, smartphones by Christmas

Transitioning in a few short months from sampling to volume production has 0% chance of happening.

Agreed. What's worst is that TSMC dispite as usual painting pretty pictures about their manufacturing processes there's never ever a guarantee that things will go according to plans.

Tegra 4 is slated for the end of 2012 like Tegra 3 is for 2011.

I probably missed seeing anywhere that T4 is slated for late 12', but if it truly is then NV most likely will have a hard time overall with it since it sounds more like they're give or take scheduled as most if not all of their competitor's 28nm SoCs.

So far I lived under the impression that they're aiming for entering an early 2012 mass production.
 
Agreed. What's worst is that TSMC dispite as usual painting pretty pictures about their manufacturing processes there's never ever a guarantee that things will go according to plans.



I probably missed seeing anywhere that T4 is slated for late 12', but if it truly is then NV most likely will have a hard time overall with it since it sounds more like they're give or take scheduled as most if not all of their competitor's 28nm SoCs.

So far I lived under the impression that they're aiming for entering an early 2012 mass production.

If T4 is on HKMG and not 28LP, I really doubt there will be a mass production part before Q3 2012. The models aren't even complete at this point for hpm.
 
I think it's fair to say I wasn't the only one that was surprised to learn T3 was still on 40nm.

Oh yes im sure a lot of people were surprised (myself included). But after seeing the state of TSMC/GF's 28nm processes i think it made sense that it was on 40nm in the end. Like you've said below, baybe they switched the process later on in the design once it became apparent that 28nm wasnt going to be ready in time.

Mike Rayfield told me explicitly at MWC11 that T4 was on 28nm High-K and he downplayed the 28nm SiON (28LP/LPT) process. I don't know if it's 28HPL or 28HPM - given that I don't expect chips to come back before Q4 2011 they are both possible. I expect 28HPM would be best and TSMC insists there's a lot of interest in it, but if it's 28HPL for time-to-market reasons they might still switch to 28HPM for T5. We'll see. Either way, definitely not 28LPT...

Gispel posted a slide comparing the processes here - http://forum.beyond3d.com/showpost.php?p=1568882&postcount=513

From that (and Charlie's info if that counts..) it appears that 28 HPL is more suited to SoC's than 28 HP (is that the same as 28 HPM?). And also for time to market reasons probably. Just a guess on my part though, i really have no idea
 
If T4 is on HKMG and not 28LP, I really doubt there will be a mass production part before Q3 2012. The models aren't even complete at this point for hpm.

28 HPL (somewhere in between 28 LP and HKMG in a relative sense) that Charlie/SA suggests AMD SI/GCN might be using sounds like a far better solution than either/or. Question still being when reasonable mass production can start with it. If AMD is truly using HPL they have far better chances to get SI into mass production than NV has with Kepler on HKMG.

Else unless TSMC has capacity constraints for the specific process I don't see how a highly complex GPU chip would be possible in late 2011 and not a LOT smaller SoC. It's more a question of what NVIDIA has decided for than anything else.

Erinyes,

Depends what your targets are. I don't think HPL is particularly well suited for high frequencies; NV most likely again using ALU hotclocking in Kepler HP/HKMG sounds like a one way street.

For something like Tegra4 and assuming they stay at quad A9's but yield this time a 2.0GHz frequency are you sure HPL would be the better solution?

Gipsel obviously refers to GPUs in his post and I'd be personally very surprised if AMD's GCN/SI has any signficant frequency increases compared to its current GPUs. Even worse with a high end GPU you can throw around with Watts like no tomorrow (well obvious exaggeration) instead of counting every milliWatt to conserve as much battery life as possible on an embedded device.
 
Else unless TSMC has capacity constraints for the specific process I don't see how a highly complex GPU chip would be possible in late 2011 and not a LOT smaller SoC. It's more a question of what NVIDIA has decided for than anything else.
You could argue that area doesn't matter, and the SoC is actually the more complex device to manufacture. Just because one IC makes it to market on a process node doesn't mean all others will.
 
28 HPL (somewhere in between 28 LP and HKMG in a relative sense) that Charlie/SA suggests AMD SI/GCN might be using sounds like a far better solution than either/or. Question still being when reasonable mass production can start with it. If AMD is truly using HPL they have far better chances to get SI into mass production than NV has with Kepler on HKMG.

Else unless TSMC has capacity constraints for the specific process I don't see how a highly complex GPU chip would be possible in late 2011 and not a LOT smaller SoC. It's more a question of what NVIDIA has decided for than anything else.

From what ive read on the other threads, HPL is also HKMG, but it dosent have the SiGe strain. Another thing is LP is cheaper and has higher density than HPL so HPL is not neccesarily better than LP for low power applications. AMD and Nvidia may be using completely different processes so it wouldnt be surprising if one gets a chip out earlier than the other.

Another thing is SoC's have traditionally lagged behind GPU production on the same node. For eg, i think the first SoC on 40G was Tegra 2, and it was out roughly a year after 40nm GPU's. This may not be the case for all Fabs and all SoC's but from what i can remember, usually SoC's have never been the first products on a new node at TSMC. This gap looks to be shrinking fast though and we may see 28nm SoC's at the same time or not far behind 28nm GPU's. And as Rys said, just because one IC makes it to market on a process node dosent mean that others will. A prime example being Fermi compared to Cypress.

Erinyes,

Depends what your targets are. I don't think HPL is particularly well suited for high frequencies; NV most likely again using ALU hotclocking in Kepler HP/HKMG sounds like a one way street.

For something like Tegra4 and assuming they stay at quad A9's but yield this time a 2.0GHz frequency are you sure HPL would be the better solution?

Gipsel obviously refers to GPUs in his post and I'd be personally very surprised if AMD's GCN/SI has any signficant frequency increases compared to its current GPUs. Even worse with a high end GPU you can throw around with Watts like no tomorrow (well obvious exaggeration) instead of counting every milliWatt to conserve as much battery life as possible on an embedded device.

Ive posted this in another thread as well and im going to repeat that here. What is the connection between ALU hotclocking and the process used? Its the scalar design that enables higher clocks, it dosent matter which process is being used. NV was able to hit 1.5 ghz on its shaders on the now ancient 90nm process. And from what ive read so far even 28 LP is going to be as fast or faster than 40G.

I really dont know which process they're actually using, it was just a guess on my part. HPL seems like it will be ready earlier than HP, and from the slide Gispel posted, HPL looks like it has far lower leakage as well. Hence my guess was HPL
 
Another thing is SoC's have traditionally lagged behind GPU production on the same node. For eg, i think the first SoC on 40G was Tegra 2, and it was out roughly a year after 40nm GPU's. This may not be the case for all Fabs and all SoC's but from what i can remember, usually SoC's have never been the first products on a new node at TSMC. This gap looks to be shrinking fast though and we may see 28nm SoC's at the same time or not far behind 28nm GPU's. And as Rys said, just because one IC makes it to market on a process node dosent mean that others will. A prime example being Fermi compared to Cypress.

Tegra2 went from what I recall into mass production in the same quarter as GF100 (Q1 2010), for reasons obviously only NVIDIA knows. Ironically it had its tape out according to rumors in late 2008, so either NV needed a couple of spins or yields or capacities were still crappy in late 2009 or eventually both. Either way NVIDIA went into production with a 49mm2 SoC and a ~530mm2 high end GPU chip in the same quarter.

Texas Instruments on the other hand sounds like they'll manufacture OMAP5 on 28HP but obviously primarily at UMC; according to their own claims things are going well so far, but you never know either.

Ive posted this in another thread as well and im going to repeat that here. What is the connection between ALU hotclocking and the process used? Its the scalar design that enables higher clocks, it dosent matter which process is being used. NV was able to hit 1.5 ghz on its shaders on the now ancient 90nm process. And from what ive read so far even 28 LP is going to be as fast or faster than 40G.

You know better than the layman here that when frequency goes up, so does leakage. It might be just a marketing table but it shows what can be done under 40LP/TSMC vs. 40G/TSMC for GPU IP:

http://www.vivantecorp.com/p_mvr.html#GC1000

Synthetis gate count, synthesis area, silicon area and active power are only slightly affected from going from LP to G, yet the frequency advantage of the latter is huge (all assuming the provided data is accurate and not some marketing guestimate). In that regard the active power mentioned doesn't tell me all that much and my gut feeling tells me that potential lincesees would prefer LP over G in such a case.

For NV's GPU hotclocked ALUs I wouldn't suggest that it doesn't come also at an area cost; but compared to having only core clocks and more ALUs instead, there should definitely be a gain overall otherwise they'd be foolish to opt for it.

I really dont know which process they're actually using, it was just a guess on my part. HPL seems like it will be ready earlier than HP, and from the slide Gispel posted, HPL looks like it has far lower leakage as well. Hence my guess was HPL
Well my own reasoning for HPL being a better candidate is that if NV manages to get under 40nm a quad A9 to 1.5GHz, 2.0 GHz for the same CPU config doesn't sound like much of a problem hypotheticall under HPL. For the GPU I severely doubt they've changed the base architecture much for Tegra4 and hence ALU hotclocking doesn't make all that much sense just yet. NV introduced hotclocked ALUs only with G80 after their first USC ALUs appeared and unless I'm again terribly wrong I don't expect them for Tegra before 20nm.
 
Tegra2 went from what I recall into mass production in the same quarter as GF100 (Q1 2010), for reasons obviously only NVIDIA knows. Ironically it had its tape out according to rumors in late 2008, so either NV needed a couple of spins or yields or capacities were still crappy in late 2009 or eventually both. Either way NVIDIA went into production with a 49mm2 SoC and a ~530mm2 high end GPU chip in the same quarter.
NVIDIA definitely had Tegra2 samples back at MWC09 (although they didn't have them at the show) so they definitely taped-out in late 2008. However they only started sampling to tablet manufacturers in July 2009 (and even later for smartphones) so I would expect at least one respin was necessary before they could sample it even to lead partners (unlike Kal-El).

Also you can definitely expect a chip taped-out so early in the process to be using suboptimal Design for Manufacturing rules (they improve over time) so I'd expect yields even today to be slightly lower than for a more recent 40nm chip. I think what delayed them more than anything is Android's maturity for tablets though - which turned out pretty good in the end when they became the exclusive initial provider for Honeycomb.

Texas Instruments on the other hand sounds like they'll manufacture OMAP5 on 28HP but obviously primarily at UMC; according to their own claims things are going well so far, but you never know either.
OMAP5 is 28LP at UMC and very likely dual-sourcing at GlobalFoundries. That's SiON rather than High-K, however remember Cortex-A15 clocks significantly higher than Cortex-A9 on the same process, and it's possible they are using something like 28LPG (ala Tegra's 40LPG) rather than only 28LP.

For NV's GPU hotclocked ALUs I wouldn't suggest that it doesn't come also at an area cost; but compared to having only core clocks and more ALUs instead, there should definitely be a gain overall otherwise they'd be foolish to opt for it.
Unlike on the desktop where the area cost should be obvious, in the handheld world they could presumably simply use significantly higher leakage (->higher performance) transistors (like they already need for the CPU). So the area cost would be smaller but the static power cost would be higher. I'd expect that to be a good trade-off but it's hard to be certain.

Well my own reasoning for HPL being a better candidate is that if NV manages to get under 40nm a quad A9 to 1.5GHz, 2.0 GHz for the same CPU config doesn't sound like much of a problem hypotheticall under HPL.
Hmm, I'm a bit skeptical. The lack of a SiGe strain on 28HPL is a pretty big deal and remember the CPU is really using 40G Standard Vt transistors which are nothing to sneeze at. I'd expect 28HPL Low Vt transistors to be competitive with that, but 33% faster might be pushing it. We'll see.

Also 28HPL isn't ready before 28HP on paper although maybe yields will be good enough faster. And 28HPM is a 'jack of both trades' between 28HP and 28HPL, it's probably the ideal process for something like Tegra, but since metafor said the models aren't even ready yet for it, that would mean that either T4 will easily be one of the first chips on the process (not impossible given T2 was easily one of the first on 40LP) or that it'll be released slightly later than we expect. Or a combination of the two!

unless I'm again terribly wrong I don't expect them for Tegra before 20nm.
I'd be very surprised if Logan (late 28nm generation and first Tegra with Cortex-A15) didn't have the next-generation GPU architecture but who knows.
 
For some reason I was placing in my mind Logan for 20nm. With the rather huge performance increase for it NV's Tegra roadmap shows 28nm sounds like a pretty tall order to me. On the other hand ST's A9600 isn't going to be small either.

Oh and thanks for all the useful information & opinions above. It sure clears up a few things.
 
For some reason I was placing in my mind Logan for 20nm. With the rather huge performance increase for it NV's Tegra roadmap shows 28nm sounds like a pretty tall order to me. On the other hand ST's A9600 isn't going to be small either.
My expectation as I've said before is that T4 is probably going to be about the same die size as T2 or even slightly smaller, so it might eventually be able to target the mid-range as well. I expect them to stick to 32-bit LPDDR2 for the smartphone market with all that implies in terms of both cost and performance. T5/Logan would presumably switch to 64-bit LPDDR3.

And remember not only Apple's A5 but also Samsung's Exynos are about ~120mm2. Application processor die sizes are certainly going up and you can do a lot of things with 100mm2 on 28nm. Let's put this in perspective: even if only 75mm2 are dedicated to real processing (rest is I/O and associated logic), that's the equivalent of ~600mm2 on 90nm. That's more transistors than G80! And 28HPM Standard Vt should definitely be very competitive with 90G.

Of course, it will be a LOT more power optimised and less area optimised, it'll probably have some more features than G80, only a fraction of those transistors will really be dedicated to the GPU, and it will have a LOT less memory bandwidth (heck it won't even have as much memory bandwidth as G86). My point simply is I wouldn't worry about the die size being too big. I'd worry a lot more about the peak power consumption of those four A15 cores really...

Oh and thanks for all the useful information & opinions above. It sure clears up a few things.
No problem, just don't take it as gospel as you well know :) I don't have hard data on 28HPL vs 28HPM vs 40G performance, I'd be quite interested in specifics but I don't expect to ever really know.
 
Even though IP cores are generally designed to be process friendly (no heavy reliance on eDRAM, etc.), mixing a bunch of them and from different suppliers makes a SoC a challenging production/fabrication candidate.
 
Mostly because you will generally have to raise voltage to get to that higher frequency. If you don't have to increase it then it would mean you were wasting power at lower frequency by running it at higher voltages than was actually needed.
 
Its the scalar design that enables higher clocks, it dosent matter which process is being used.
Sorry for the OT, but the ALUs are not scalar (it's SIMD/vector ALUs) and that doesn't enable higher clocks neither. It's simply the much longer pipeline which enables the hotclocking (Fermi 18 cycles latency, G80/GT200 was even slightly higher, Radeons have 8 cycles arithmetic latency).
 
The table on page 9 demonstrates the use of lower threshold transistors to increase the clock target. Even when using only a few of them (where it is necessary) it significantly increases the static power consumption (rising ~20 fold when tuning a dual Cortex A9 from 1.8GHz with all transistors at regular Vt to 2.46 GHz with a mixture of superlow and regular Vt on the SLP process, it goes up to 3.5 times the base value already with a mixture of low/regular Vt for a clock speed increase to 2.2 GHz).
 
The table on page 9 demonstrates the use of lower threshold transistors to increase the clock target. Even when using only a few of them (where it is necessary) it significantly increases the static power consumption (rising ~20 fold when tuning a dual Cortex A9 from 1.8GHz with all transistors at regular Vt to 2.46 GHz with a mixture of superlow and regular Vt on the SLP process, it goes up to 3.5 times the base value already with a mixture of low/regular Vt for a clock speed increase to 2.2 GHz).

Those numbers can be a bit misleading. Keep in mind that these likely come from preliminary auto-place-and-route experiments instead of final chip implementation. So there is quite a big delta depending on how well the tool can handle an ever increasing frequency target. I've seen place-and-route tools balloon cell count due to replication (and thereby leakage) by many tens of folds when increasing the target frequency by just 20% or so.

A more interesting comparison would be if the total cell count and area at the end as well as the actual percentage of LVT vs RVT cells were included in the figures.

Also keep in mind that there will be an exponential increase in any power/area figure as the frequency target rises above what the design can be implemented to realistically speaking. For instance, an increase from 1.8GHz to 2.0GHz may have simply replaced a few critical paths with RVT cells. But increasing the frequency target to 2.4GHz will replace almost all complex paths with RVT cells. And that's not even taking into account the ballooning that was caused by buffer trees, replications, etc.
 
I think one interesting take home message is that you can obviously reach 2+ GHz with a dual cortex A9 with reasonable power consumption on GF's SLP process which also lacks SiGe like TSMC's HPL. ;)
 
I think one interesting take home message is that you can obviously reach 2+ GHz with a dual cortex A9 with reasonable power consumption on GF's SLP process which also lacks SiGe like TSMC's HPL. ;)
Good point. Actually that table is at 1.1v despite the nominal voltage for both 28nm-SLP at GF and 28nm-HPL at TSMC being 1.0v, but then again 40G nominal voltage is 0.9v and Tegra 2 needs 1.0v for 1GHz iirc. And OMAP3630 runs at 1.26v on a 1.1v process (OMAP4 can run even higher).

It would be a bit weird for NVIDIA to go from 1.0v to 1.1v but then again maybe Kal-El is already higher than 1.0v and what matters in the end is total power... Talking of total power, is it just me or does that Global Foundries table not make any sense? Clock frequency multiplier by Dynamic Power per MHz added to Static Power is not at all the same number as Total Power. I assume they just copy-pasted the wrong numbers somewhere?
 
Back
Top