NVIDIA Tegra Architecture

The guys at Anandtech believe that NVIDIA will release two Tegra 5 [Logan] variants, one with 384 CUDA "cores" (ie. 2 Kepler SMX's) and one with 192 CUDA "cores" (ie. 1 Kepler SMX), with a GPU clock operating frequency ~ 540MHz (which happens to be the GPU clock operating frequency of Kayla according to them). If fabricated on a 28nm fabrication process (as Anandtech believes), the SoC die size would have to be relatively large, but the perf/w and perf/mm^2 could still be very good. Kayla is a 2 SMX Kepler part (but likely based on GT 735M and not GT 640M LE, which is something I think Anandtech got wrong in their podcast). Since Logan is supposed to be at least as performant as Kayla (with respect to GFLOPS throughput?), then it would make sense that there would be a 384 CUDA "core" Logan variant, unless there is some unknown piece of the puzzle we are all missing here. What is unclear at this time is how exactly will NVIDIA be able to keep die size and TDP under control with this many CUDA "cores" using a 28nm fabrication process. Is it really out of the question for NVIDIA to release a Tegra 5 [Logan] variant with 192 CUDA "cores" (ie. 1 Kepler SMX) with a GPU clock operating frequency close to 1GHz, considering that Tegra 4 [Wayne] already has a GPU clock operating frequency up to 672MHz, and considering that the Kepler GT 740M mobile GPU already has a GPU clock operating frequency up to 980MHz?
 
Last edited by a moderator:
Is it really out of the question for NVIDIA to release a Tegra 5 [Logan] variant with 192 CUDA "cores" (ie. 1 Kepler SMX) with a GPU clock operating frequency close to 1GHz, considering that Tegra 4 [Wayne] already has a GPU clock operating frequency up to 672MHz, and considering that the Kepler GT 740M mobile GPU already has a GPU clock operating frequency up to 980MHz?

Nothing is out of the question if you're willing to ignore power consumption and I honestly hope no one will come up with any funky ALU hotclock idea; if its going to be a Kepler grandchild it's obviously going to inherit it's wide ALU reasoning of the architecture. For a tablet you need a maximum of 5W (give or take) TDP for the entire SoC and way less for a smartphone SoC (1-2W). What kind of TDP do you think the GT740M really has in the end and since it's being manufactured at 28nm, how possible do you think those kind of frequencies you're asking about really are?

Outside of that there are quite a few theoretical scenarios that would work; is it anywhere carved in stone that a stream processor must at any price yield only 1 FMAC? And that's just one example out of many.

Well 192SP @ 500 MHz & 200 Gflops would still be an impressive achievement.

Considering that the ULP GF in Tegra4 (RIP) has 97 GFLOPs in total are you sure it's really all that much? If Apple ships a quad cluster Rogue at high enough frequencies within the year it'll be almost yesteryear's news. They have finally a VERY aggressive featureset, so for 2014 they'd have to at least equal a G6630.
 
I have no idea what approach NVIDIA will take to make Tegra 5 work in thin fanless tablets and smartphones. One thing I do know is that we have yet to see any 192 CUDA "core" Kepler variant, so it may be a logical choice short of totally rearchitecting a Kepler SMX.

Based on the comments from Anandtech and NVIDIA, it appears that the highest performance Tegra 5 variant will have > 400 GFLOPS throughput, and will "easily" be in mass production by early next year. Think about that for a second. Within about one year's time, we may finally have better graphics performance than Xbox 360 or Playstation 3 from a low power mobile handheld device. In fact, ironically, Shield v2--with all of it's hardware contained inside an Xbox/PS-like game controller--may have better graphics performance than than a full Xbox 360 or PS3. More importantly for NVIDIA, they are increasing GFLOPS throughput of Tegra by more than 30x (!) in approximately a two year time frame.
 
Last edited by a moderator:
I have no idea what approach NVIDIA will take to make Tegra 5 work in thin fanless tablets and smartphones. One thing I do know is that we have yet to see any 192 CUDA "core" Kepler variant, so it may be a logical choice short of totally rearchitecting a Kepler SMX.

NV probably didn't need yet a single cluster Kepler, but that's besides the point. Exactly because there are lightyears of differences in terms of latency between any sort of high end design and ultra tiny SFF mobile design I find it hard to believe that it won't be a special design from ground up for the latter.

Based on the comments from Anandtech and NVIDIA, it appears that the highest performance Tegra 5 variant will have > 400 GFLOPS throughput, and will "easily" be in mass production by early next year.

NV stated what exactly indicating any FLOP value for it?

Think about that for a second. Within about one year's time, we may finally have better graphics performance than Xbox 360 or Playstation 3 from a low power mobile handheld device.

What's there to think about? Do you actually think it wouldn't had happened even without NVIDIA? Of course now all of the sudden it's the second coming :devilish:
 
NV probably didn't need yet a single cluster Kepler, but that's besides the point. Exactly because there are lightyears of differences in terms of latency between any sort of high end design and ultra tiny SFF mobile design I find it hard to believe that it won't be a special design from ground up for the latter.

I'm sure that Tegra 5 will be a purpose-built low power design, but considering that Tegra 4 will already approach 100 GFLOPS throughput, I can't imagine that NVIDIA would aim for anything less than 200 GFLOPS for any Tegra 5 variant. So the use of one Kepler SMX may be a logical building block, but who knows for sure.

NV stated what exactly indicating any FLOP value for it?

NVIDIA didn't state any GFLOPS throughput for Logan, but they did say that they expect Logan to be faster than Kayla, and Anandtech claimed that Kayla has ~ 414 GFLOPS throughput.

What's there to think about? Do you actually think it wouldn't had happened even without NVIDIA? Of course now all of the sudden it's the second coming :devilish:

Obviously it would have happened with or without NVIDIA. I just find it ironic that a game controller next year may have better graphics performance than a full Xbox 360 or PS3 console this year. Obviously the next gen consoles will widen the gap again, but within ~ 2-3 years time, handheld mobile devices will again be nipping at the heels of consoles.
 
NVIDIA didn't state any GFLOPS throughput for Logan, but they did say that they expect Logan to be faster than Kayla, and Anandtech claimed that Kayla has ~ 414 GFLOPS throughput.
Nvidia didn't say Logan had a faster GPU than Kayla. They said Logan was faster in some cases than Kayla, but didn't specify what cases those were.
 
Nvidia didn't say Logan had a faster GPU than Kayla. They said Logan was faster in some cases than Kayla, but didn't specify what cases those were.

NVIDIA demonstrated compute performance on Kayla at GTC 2013, and said that Logan will likely be even higher performance. So the implication is that Logan will match or exceed Kayla in terms of GFLOPS throughput (but not necessarily graphics performance though depending on what bottlenecks exist in Logan).
 
I'm sure that Tegra 5 will be a purpose-built low power design, but considering that Tegra 4 will already approach 100 GFLOPS throughput, I can't imagine that NVIDIA would aim for anything less than 200 GFLOPS for any Tegra 5 variant. So the use of one Kepler SMX may be a logical building block, but who knows for sure.

See my post #982.

NVIDIA didn't state any GFLOPS throughput for Logan, but they did say that they expect Logan to be faster than Kayla, and Anandtech claimed that Kayla has ~ 414 GFLOPS throughput.

With or without SFUs? And no it's by far not a joke at all; I wish it would be at least.

Obviously it would have happened with or without NVIDIA. I just find it ironic that a game controller next year may have better graphics performance than a full Xbox 360 or PS3 console this year. Obviously the next gen consoles will widen the gap again, but within ~ 2-3 years time, handheld mobile devices will again be nipping at the heels of consoles.

It's been how many years since the XBox360/PS3 have been released? With the pace the SFF mobile market is moving so far I'm not surprised one bit. And no it'll take far longer that just 2-3 years until SFF mobile SoCs reach upcoming consoles. More like >7 years at least since process technology won't do them any favors and you can't have a new process every year.
 
Nvidia didn't say Logan had a faster GPU than Kayla. They said Logan was faster in some cases than Kayla, but didn't specify what cases those were.

Isn't Kayla using a 4c A9 CPU, so the A15s (expected) in Logan should be faster.
 
Last edited by a moderator:
With or without SFUs? And no it's by far not a joke at all; I wish it would be at least.

Anandtech always quotes GFLOPS throughput exclusive of SFU's.

It's been how many years since the XBox360/PS3 have been released? With the pace the SFF mobile market is moving so far I'm not surprised one bit. And no it'll take far longer that just 2-3 years until SFF mobile SoCs reach upcoming consoles. More like >7 years at least since process technology won't do them any favors and you can't have a new process every year.

I don't expect mobile handheld devices to reach the performance of next gen consoles within 2-3 years, but I do expect mobile handheld devices to reach feature parity of next gen consoles within 2-3 years, while being reasonably close in performance too (ie. > 1 TFLOPS throughput, for lack of a better performance metric). That said, it will be quite a while before mobile handheld devices actually exceed the performance of next gen consoles, but certainly well within 7 years. If we have 400 GFLOPS throughput within one year, then throughput would have to grow only ~ 30% each year thereafter for mobile handheld devices to match the throughput of next gen consoles, which seems way too low for a growth rate. If we assume a 50% growth rate in throughput each year starting one year from now, then throughput of mobile handheld devices will exceed the throughput of next gen consoles within ~ 5 years from now. If we assume a 100% growth rate in throughput each year starting one year from now, then throughput of mobile handheld devices will exceed the throughput of next gen consoles within ~ 3-4 years from now.
 
Last edited by a moderator:
So it's back to this discussion I guess... You expect > 1 TFLOPs throughput in 2-3 years, a 13-15x increase in perf over what is just coming out now/soon.

There's absolutely no way perf/W will go up so dramatically in such a short time. I'd be very surprised if it exceeds 4x. That means at least 3 times more peak power consumption. Do you honestly think a device like Shield can tolerate 3x the peak power consumption of the GPU? It's probably already pushing on thermal limits with Tegra 4, much like it's pushing on the limits of form factor for a device that can still be considered mobile handheld.

I've said it several times, but the big mistake people are making is that they think power budget for handheld devices can keep growing like it has been.
 
Anandtech always quotes GFLOPS throughput exclusive of SFU's.

It took some time until it sat how MADDs and FLOPs work together; to think of SFUs would complicate things way too much.

I don't expect mobile handheld devices to reach the performance of next gen consoles within 2-3 years, but I do expect mobile handheld devices to reach feature parity of next gen consoles within 2-3 years, while being reasonably close in performance too (ie. > 1 TFLOPS throughput, for lack of a better performance metric).
Rogue, ULP GF@Logan, Adreno4xx will be all DX11 at least; what will take time for some of them if not all of them will be things like FP64. It'll take longer even for the TFLOP exactly because of process technology being the milestone.

Maybe for safe measure say 3-4 years. That said, it will be quite a while before mobile handheld devices actually exceed the performance of next gen consoles, but certainly well within 7 years. If we have 400 GFLOPS throughput within one year, then throughput would have to grow only ~ 30% each year thereafter for mobile handheld devices to match the throughput of next gen consoles, which seems way too low for a growth rate. If we assume a 50% growth rate in throughput each year starting one year from now, then throughput of mobile handheld devices will exceed the throughput of next gen consoles within ~ 5 years from now.
Let's see those mythical 400GFLOPs first. Let's see things like FP64 making their entry after that and quite a few other tidbits that cost transistors and power consumption too. You're making it way too easy for yourself; it'll take a magic wand to get an upcoming say >100W SoC into a =/<5W envelope in say 5-6 years time. A viable scenario would be ~200GFLOPs for up to 5W TDP tablet scenarios and a ~400GFLOP for 10-15W TDP tablet scenarios. That's obviously then a totally different chapter.
 
So it's back to this discussion I guess... You expect > 1 TFLOPs throughput in 2-3 years, a 13-15x increase in perf over what is just coming out now/soon.

There's absolutely no way perf/W will go up so dramatically in such a short time. I'd be very surprised if it exceeds 4x. That means at least 3 times more peak power consumption. Do you honestly think a device like Shield can tolerate 3x the peak power consumption of the GPU? It's probably already pushing on thermal limits with Tegra 4, much like it's pushing on the limits of form factor for a device that can still be considered mobile handheld.

I've said it several times, but the big mistake people are making is that they think power budget for handheld devices can keep growing like it has been.

As I said above if you don't restrict yourself to =/<5W tablets it could be a totally different ballgame. The higher the TDP can be, the higher the potential FLOP rate obviously. Come to think of it it's anything but a bad idea for NV (and AMD obviously) to want to actively battle ULV Haswell.
 
But he specifically said mobile handheld devices. We can extend it to around 8W for something like Shield. No way is 8W enough to do 1TFLOP in 2-3 years. I think my 4x perf/W figure was overly generous. Power consumption improvements for a direct shrink applied twice won't be able to get that much higher than 2x. The rest is given due to trading off more transistors for power consumption and design improvements. The problem is that the 16nm TSMC FinFET node won't give them smaller transistors, so they only get more transistors once (unless they want to start making much larger dies for mobile, probably a bad move).

Those higher TDP chips are probably going to need to be separate larger dies, there's a limit to how high mobile SoCs can scale. Maybe you think going against ULV Haswell in large/higher end Windows 8 tablets and ultrabooks, laptops, etc is a good idea.. Personally I think it'd be a disaster for nVidia.

The one place where a high TDP nVidia ARM SoC would have made sense is a console, but that ship has sailed (and the big obstacle there would have been 64-bit not being ready in time)
 
So it's back to this discussion I guess... You expect > 1 TFLOPs throughput in 2-3 years, a 13-15x increase in perf over what is just coming out now/soon.

Ok, maybe not > 1 TFLOPS throughput within 2-3 years, but close to 1 TFLOPS throughput within 3-4 years, assuming that we actually see 400 GFLOPS throughput within 1 year. That would mean that the growth rate in throughput would be "only" 50% each year starting one year from now. Of course, it is a complete mystery how and if NVIDIA can achieve 400 GFLOPS throughput in a mobile handheld device within one years time, especially if a 28nm fabrication process is still used. If they can achieve "only" 200 GFLOPS throughput in a mobile handheld device in one years time (which is still very good relatively speaking), and assuming that growth rate is 50% each year starting one year from now, then it would take about 5 years to achieve close to 1 TFLOPS throughput, and about 6-7 years to exceed the throughput of next gen consoles.
 
Last edited by a moderator:
Ok, maybe not > 1 TFLOPS throughput within 2-3 years, but close to 1 TFLOPS throughput within 3-4 years, assuming that we actually see 400 GFLOPS throughput within 1 year. That would mean that the growth rate in throughput would be "only" 50% each year starting one year from now. Of course, it is a complete mystery how and if NVIDIA can achieve 400 GFLOPS throughput in a handheld device within one years time, especially if a 28nm fabrication process is still used.

Complete mystery indeed. It's also a complete mystery how Cortex-A57 can offer 3x the performance at the same power consumption or 5x power efficiency vs Cortex-A15 merely going from TSMC's 28nm to 16nm, but this is apparently what ARM claims. Best explanation: performance estimates on future products are often grossly inaccurate.

However I'll give Ailuros the benefit on this one; that 400 GFLOPs number is attainable with a much higher power budget. Did nVidia actually say anything about 400 GFLOPs in a handheld device within one year's time?
 
All NVIDIA said was that Logan will fit inside a dime (which doesn't say much because the actual area of a dime is quite large in terms of mm^2), will not require a fan or heatsink, and will likely outperform Kayla. So pretty vague all things considered. I don't know what TSMC's roadmap looks like, but is there any chance that Logan could be fabricated on a 20nm fabrication process if it will reportedly be easily in production by early next year?
 
All NVIDIA said was that Logan will fit inside a dime (which doesn't say much because the actual area of a dime is quite large in terms of mm^2), will not require a fan or heatsink, and will likely outperform Kayla. So pretty vague all things considered.

But will it do all of those things simultaneously? How many times have we seen claims of "will provide X more performance! Will use Y less power!" where it's obvious that the two weren't meant to be true at the same time? Maybe it can be used w/o a fan or heatsink but only if you limit performance well below Kayla. Or have some other heavy cooling they feel comfortable not calling a fan or heatsink.

Plus the other point that the "outperform Kayla" part may be taking into consideration the much better non-GPU half of Logan, since Kayla is just using a weak old Tegra 3 for that.

I don't know what TSMC's roadmap looks like, but is there any chance that Logan could be fabricated on a 20nm fabrication process if it will reportedly be easily in production by early next year?

If TSMC is to be believed it's at least possible:

http://focustaiwan.tw/news/atod/201304010042.aspx

TSMC's schedule announcements are always hopelessly optimistic but maybe they're a little more accurate when getting closer.

I haven't heard Anand's podcast so I don't know if this 28nm information is based on real details from nVidia or is just their hunch. It'd be much better for nVidia to be a few months later than their projections (which they always are anyway) and 20nm than have a real early 2014 release but on 28nm.
 
If TSMC is to be believed it's at least possible:

http://focustaiwan.tw/news/atod/201304010042.aspx

TSMC's schedule announcements are always hopelessly optimistic but maybe they're a little more accurate when getting closer.

I haven't heard Anand's podcast so I don't know if this 28nm information is based on real details from nVidia or is just their hunch. It'd be much better for nVidia to be a few months later than their projections (which they always are anyway) and 20nm than have a real early 2014 release but on 28nm.

How did yields look like in early 2012 for 28LP and why should any 20nm variant be any better in early 2014? Have any 20nm capacities been secured already and if yes by whom?

Rumors are obviously just rumors; but if the rumors should turn out to be true that they are planning to manufacture some early smaller Maxwell chips on 28nm and at a comparable timeframe they're using 20nm for SFF SoC I'll have a long hard laugh. By the way did it just fly by my head or did they postpone Project Denver by another process generation?
 
Exophase said:
haven't heard Anand's podcast so I don't know if this 28nm information is based on real details from nVidia or is just their hunch. It'd be much better for nVidia to be a few months later than their projections (which they always are anyway) and 20nm than have a real early 2014 release but on 28nm.

Anandtech was just purely speculating that Tegra 5 would use a 28nm fabrication process. Based on NVIDIA's performance expectations, and based on the link you provided above, I cannot imagine that Tegra 5 will use anything other than a 20nm fabrication process.
 
Back
Top