NVIDIA Tegra Architecture

mczak · Jul 8, 2013

ToTTenTranz said:
AFAIK, there's no GK108.

There's the GK107 with 384 shader units, 32 TMUs and 16 ROPs and now there's the GK208 which appears to be aither a respin of the GK107 or a smaller GPU with the same numbers of shader units but half the TMUs and ROPs.

But these are GPUs with over 1B transistors, which could be too much.
I was actually expecting nVidia to deliver something like half a GK107 with Tegra 5.

GK208 is NOT a respin of gk107. It has the same number of SMX (2) and hence the same number of TMUs but only half the memory interface and hence half the ROPs. But it actually has beefed up L2 cache and unlike the gk10x chips features cuda 3.5. Interestingly despite dumping just some ROPs and memory interface (and because of 4 times the L2 per ROP so 2 times L2 per chip) it seems to be quite a bit smaller, and perf/power also seems to be improved.
I don't think that fits into Tegra 5 power budget neither, 384 ALUs is a lot, AMD could squeeze in just 128 ALUs with Temash and they have to run at a really low clock too to fit into a tablet, so 2 SMX seem rather unpractical to me, even if that's on 20nm. Maybe "gk108 derived" really means that, that is a theoretical 1 SMX version (even then though I would expect it to be "gk209 derived" instead even if the differences are small). After all 192 ALUs would still be a lot but look a lot more reasonable than 384, ~2.6 times the ALUs of Tegra 4, and way more capable (and beefy, power hungry) ones at that, FP32 vs. FP20 certainly won't come for free.

Ailuros · Jul 8, 2013

1. It might be hair splitting but you're stil talking about ALU lanes and not ALUs.
2. ALU lanes and in extension ALUs are relatively cheap in hw compared to many other parts of a GPU.
3. When any IHV claims N relative performance for such a case it's most likely just arithmetic efficiency.
4. At the right (quite realistic frequency under 20nm) a G6630 Rogue could exceed the 300 GFLOPs mark theoretically in 2014. It doesn't sport more than 96 ALU lanes last time I checked.
5. Where's the barrier between fairy tale and reality and why can't anyone's speculative math lead to a reasonable result while judging from desktop GPU designs?

Erinyes · Jul 8, 2013

ToTTenTranz said:
AFAIK, there's no GK108.

There's the GK107 with 384 shader units, 32 TMUs and 16 ROPs and now there's the GK208 which appears to be aither a respin of the GK107 or a smaller GPU with the same numbers of shader units but half the TMUs and ROPs.

But these are GPUs with over 1B transistors, which could be too much.
I was actually expecting nVidia to deliver something like half a GK107 with Tegra 5.

Yes i know there is no GK108..which is why i said "GK108 class". I thought it would be understood what I meant but maybe I should have been more clear. Like mczak said, theoretically it should be over 2x the ALU power of T4. Remains to be seen what real world performance will be like.

Mczak, agreed..the clocks will have to be fine tuned for it to fit in a smartphone or tablet form factor. AFAIK it is still on 28nm though. Ailuros..yep this is a 2014 part so it had to be designed to compete with the likes of Rogue.

PS: The GK208 you guys are talking about is actually GM208, i.e. Maxwell derived.

ninelven · Jul 8, 2013

Erinyes said:
The GK208 you guys are talking about is actually GM208, i.e. Maxwell derived.

I am fairly certain you are mistaken...

Erinyes · Jul 8, 2013

ninelven said:
I am fairly certain you are mistaken...

Hmmm well thats the info I had got and the source has proven to be right in the past (eg. the T3 tape out and GK104 die size) but maybe I'm mistaken this time. Guess we'll have to wait and watch. Though lets take it to PM or to the appropriate thread and not be OT here. Cheers

mczak · Jul 8, 2013

Ailuros said:
1. It might be hair splitting but you're stil talking about ALU lanes and not ALUs.

Yes I'm sloppy with the terms

.
Anyway I guess if you want ~300GFlops then yes that sounds like you'd need 2 SMX for best power efficiency (with 1 SMX that would need ~700Mhz which is probably above the most efficient operating range).

Erinyes said:
Hmmm well thats the info I had got and the source has proven to be right in the past (eg. the T3 tape out and GK104 die size) but maybe I'm mistaken this time. Guess we'll have to wait and watch. Though lets take it to PM or to the appropriate thread and not be OT here. Cheers

GK208 is already out and used in some products, I have not seen any specific rumors to maxwell based parts yet anywhere. The TDP is definitely way above what would be suitable for a tablet though even for the mobile versions.

xpea · Jul 8, 2013

Erinyes said:
Been off the grid for a while..but have an interesting tidbit to share. Apparently Tegra 5 came back from the fab last week and is already up and running at speeds >2 Ghz. CPU is still the same Cortex A15 layout as T4. And it has a Kepler derived, GK108 class GPU.

hmm silicon working on July... if we take into account typical 6 months from first silicon to production, maybe this time they will be ready for Barcelona announcements and spring in shipping products. Interesting. maybe Qualcomm finally will have some competition in high-end...

Arun · Jul 9, 2013

xpea said:
hmm silicon working on July... if we take into account typical 6 months from first silicon to production, maybe this time they will be ready for Barcelona announcements and spring in shipping products. Interesting. maybe Qualcomm finally will have some competition in high-end...

FWIW, Tegra 2 had silicon back in June 2009 and Tegra 3 had silicon back in February 2011. No idea about Tegra 4 and Tegra 5.

xpea · Jul 9, 2013

Arun said:
FWIW, Tegra 2 had silicon back in June 2009 and Tegra 3 had silicon back in February 2011. No idea about Tegra 4 and Tegra 5.

yeah I know but it was they first products. This time, the team has more experience, manufacturing process is the same, CPU blocks should be nearly the same, interfaces and I/O itto, only doubt is about GPU. Bringing full OpenGL 4.3 and latest CUDA 3.5 lvl hardware to the mobile space is bold move...

Exophase · Jul 9, 2013

xpea said:
yeah I know but it was they first products. This time, the team has more experience, manufacturing process is the same, CPU blocks should be nearly the same, interfaces and I/O itto, only doubt is about GPU. Bringing full OpenGL 4.3 and latest CUDA 3.5 lvl hardware to the mobile space is bold move...

nVidia has made lots of chips before Tegra 2 which wasn't even their first Tegra. The other stuff you said would mostly apply to the transition from Tegra 2 to Tegra 3 (same process, same CPU type, GPU was less of a change too) which still needed about 10 months after February 2011 before it showed up in its first product.

Erinyes · Jul 9, 2013

mczak said:
GK208 is already out and used in some products, I have not seen any specific rumors to maxwell based parts yet anywhere. The TDP is definitely way above what would be suitable for a tablet though even for the mobile versions.

Ok brainfart on my part..like i said I was off the grid for a while and wasn't following the sector too closely. I think i got that part confused with the new Maxwell part. Forgot they had already launched GK208. Well then I think that means that GM208(7?) is also back from the fab then.

Arun said:
FWIW, Tegra 2 had silicon back in June 2009 and Tegra 3 had silicon back in February 2011. No idea about Tegra 4 and Tegra 5.

Not entirely sure about T4 but I do know that it was delayed by a fair bit. Will try finding out the date.

xpea said:
hmm silicon working on July... if we take into account typical 6 months from first silicon to production, maybe this time they will be ready for Barcelona announcements and spring in shipping products. Interesting. maybe Qualcomm finally will have some competition in high-end...

Its not uncommon to see first silicon back so early. See above for what Arun said reg T2 and T3. As per NV, Tegra 3 was supposed to be available in a consumer tablet~6 months after first silicon and we all know what happened. As for the launch..CES 2014 is also a candidate if the T4 launch is anything to go by.

xpea said:
yeah I know but it was they first products. This time, the team has more experience, manufacturing process is the same, CPU blocks should be nearly the same, interfaces and I/O itto, only doubt is about GPU. Bringing full OpenGL 4.3 and latest CUDA 3.5 lvl hardware to the mobile space is bold move...

Agreed..you would think that given that its a similar process to T4 and very similar chip, they could bring it to production faster..but we saw what happened with T2 to T3.

Exophase said:
nVidia has made lots of chips before Tegra 2 which wasn't even their first Tegra. The other stuff you said would mostly apply to the transition from Tegra 2 to Tegra 3 (same process, same CPU type, GPU was less of a change too) which still needed about 10 months after February 2011 before it showed up in its first product.

I think 9-10 months was the first tablet..the first smartphone was closer to 12 months wasnt it?

Exophase · Jul 9, 2013

Erinyes said:
I think 9-10 months was the first tablet..the first smartphone was closer to 12 months wasnt it?

Yeah but it was never a big hit on phones. Tegra 4 will be even less of one, hence the need for Tegra 4i.. and if Tegra 5 is essentially a respin of Tegra 4 with a Kepler GPU then it'll be even less still. If it's still on 28nm, that is.

But then again I'm not sure getting first silicon back now for a 20nm SoC is that out of the question, depending on exactly what first silicon means.

Ailuros · Jul 9, 2013

mczak said:
Yes I'm sloppy with the terms .
Anyway I guess if you want ~300GFlops then yes that sounds like you'd need 2 SMX for best power efficiency (with 1 SMX that would need ~700Mhz which is probably above the most efficient operating range).

Depends who wants what and how they want it exactly; there are more than one way to reach N FLOP target and there's no rule that tells hw designers that they have to limit themselves to "just" 2 FLOPs/ALU lane. I'm not saying it will be the case, but you folks should keep an open mind with these things since yes you will see multiple upcoming mobile GPUs that have >2 FLOPs for those cases.

Ailuros · Jul 9, 2013

Exophase said:
But then again I'm not sure getting first silicon back now for a 20nm SoC is that out of the question, depending on exactly what first silicon means.

Since the newest projections seem to target H2 and not H1 2014 for Logan, chances are far better than before for anything 20nm IMHO.

mczak · Jul 9, 2013

Ailuros said:
Depends who wants what and how they want it exactly; there are more than one way to reach N FLOP target and there's no rule that tells hw designers that they have to limit themselves to "just" 2 FLOPs/ALU lane. I'm not saying it will be the case, but you folks should keep an open mind with these things since yes you will see multiple upcoming mobile GPUs that have >2 FLOPs for those cases.

Sure yes but the assumption here was that it's basically Kepler, so it would be 2 FLOPs per ALU lane. But well maybe they aren't going for 300 GFlops for Tegra 5 (though if others are going to reach that I think nvidia should too).

Ailuros · Jul 9, 2013

mczak said:
Sure yes but the assumption here was that it's basically Kepler, so it would be 2 FLOPs per ALU lane. But well maybe they aren't going for 300 GFlops for Tegra 5 (though if others are going to reach that I think nvidia should too).

It won't be basically any Kepler but a usual derivate carefully adjusted for perf/mW and SFF mobile SoCs; just as much as the so far ULP GeForces aren't any NV3x, NV4x or anything else but a happy go merry amalgalm from left and right with ps precision at FP20 which isn't anywhere present in any of the desktop GPUs.

NV like any other graphics developer for that market might want to have high arithmetic efficiency because it's relatively cheap and might be needed down the line for GPGPU in their upcoming mobile GPUs yet not necessarily a shitload of rasters, ROPs or TMUs. By the time you start scratching entire TMU quads, reduce this and reduce that, the real question in the end is what's left after that and if you really can call it "Kepler" after all.

I'm actually repeating myself with that stuff in this thread but that's besides the point; another thing as I've already said is that we still don't know how many Logan variants there will be and if NV intends to develop a higher end variant to go against low power Haswells. In such a case and if they're not bound let's say at 5W TDP for a tablet but twice as much as a simple example it might end up even being more than 300 GFLOPs in the end.

Now that they're even licensing GPU IP they could eventually just release one Logan variant with something that resembles in arithmetic efficiency a single Kepler cluster and scale from there upwards in GPU IP variants only.

Deleted member 13524 · Jul 9, 2013

The difference is that until now, nVidia has been calling "Geforce ULV" to the iGPU in their SoCs, whereas in Tegra 5 they're calling it Kepler.

Not giving it a generic name like before seems to suggest it's part of the Kepler family, which would mean using at least one SMX with 192 ALUs and 16 TMUs.

Besides, a GK208 measures only 79mm^2 on 28nm. nVidia could probably fit an entire GK208 into a SoC with 4+1 Cortex A15 cores in something like 120mm^2 on 28nm. Clocking the iGPU to 350MHz would make them competitive with the 300GFLOPS iGPUs from the competition.

mczak · Jul 9, 2013

Ailuros said:
NV like any other graphics developer for that market might want to have high arithmetic efficiency because it's relatively cheap and might be needed down the line for GPGPU in their upcoming mobile GPUs yet not necessarily a shitload of rasters, ROPs or TMUs. By the time you start scratching entire TMU quads, reduce this and reduce that, the real question in the end is what's left after that and if you really can call it "Kepler" after all.

I certainly wouldn't call that "Kepler". For even saying "Kepler-derived" imho it has to keep the basic building blocks essentially the same. Now ripping out half the TMUs out of a SMX might just be ok and you could reduce number of alu lanes by some factor too and it still would look mostly the same but once you start redesigning even the alu lanes that imho isn't really "Kepler-derived" any longer you could as well call it "Fermi-derived" probably at this point because the differences are just going to be as large to both essentially.
Not saying this isn't what nvidia is doing, but since I keep hearing "Kepler-derived" I was assuming it at least looks similar to Kepler.

I'm actually repeating myself with that stuff in this thread but that's besides the point; another thing as I've already said is that we still don't know how many Logan variants there will be and if NV intends to develop a higher end variant to go against low power Haswells. In such a case and if they're not bound let's say at 5W TDP for a tablet but twice as much as a simple example it might end up even being more than 300 GFLOPs in the end.

That is true I was assuming it's tablet only, which might not be the case. Agreed that if there's 10W or so versions they could do a lot more.

Ailuros · Jul 9, 2013

mczak said:
I certainly wouldn't call that "Kepler". For even saying "Kepler-derived" imho it has to keep the basic building blocks essentially the same. Now ripping out half the TMUs out of a SMX might just be ok and you could reduce number of alu lanes by some factor too and it still would look mostly the same but once you start redesigning even the alu lanes that imho isn't really "Kepler-derived" any longer you could as well call it "Fermi-derived" probably at this point because the differences are just going to be as large to both essentially.
Not saying this isn't what nvidia is doing, but since I keep hearing "Kepler-derived" I was assuming it at least looks similar to Kepler.

I'd be very and I mean VERY surprised if in the ULP GF of Logan ALUs will be completely de-coupled from TMUs. If that's the case it's not even G80 derived even if it has SM35 ALUs.

That is true I was assuming it's tablet only, which might not be the case. Agreed that if there's 10W or so versions they could do a lot more.

I'm just speculating because there seem to be sitings from Anand having heard of north of 400 GFLOPs. Above is until now the only case where I could imagine it would make sense. In any other case just another false rumor probably based on a misunderstanding.

DSC · Jul 12, 2013

Tegra 5 GPU is GK208 derived, Kayla GK208 card supports CUDA Compute Capability 3.5 and T5 will also support Compute Capability 3.5.

https://developer.nvidia.com/content/cuda-arm-platforms-now-available

This GPU has Compute Capability 3.5, meaning that it supports most of the same CUDA features of a high-end Tesla K20 GPU.

NVIDIA Tegra Architecture

mczak

Ailuros

Epsilon plus three

Erinyes

ninelven

PM

Erinyes

mczak

xpea

Arun

Unknown.

xpea

Exophase

Erinyes

Exophase

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

mczak

Ailuros

Epsilon plus three

Deleted member 13524

Guest

mczak

Ailuros

Epsilon plus three

DSC

Similar threads