NVIDIA Tegra Architecture

Krait is a mystery to me. I'd love to point at some features of the uarch and speculate on those contributing to certain weak benchmark scores. But we know close to nothing about the uarch. Just what Anand says which frankly I take with a grain of salt to begin with. Only things we really know is it has a pretty small L2 cache size (16KB), a smaller so-called L0 data cache that might be adding latency for code that doesn't play nice with it, and an asynchronous decoupled L2 cache that could also suffer from extra latency.

I doubt individual vendors are shipping vastly optimized Javascript engines beyond what comes with standard Android versions. And whatever they do has to be contributed back to mainline. At best they're bundling newer builds of publicly available code.

Sunspider is a pretty terrible benchmark (most of them are but Sunspider is especially bad), I could see a lot of glass jaws or easy wins on it, that aren't really characteristics of what you see elsewhere. I'm more concerned with the SPECInt2k scores.

I've never been much of a gadget freak which makes this sound really weird now.. it looks like I currently have at least one Cortex-A8, Scorpion, Cortex-A9, Krait, and Cortex-A15 device >_> So I could do a bunch of low level comparisons but I don't really have that kind of time :/ At the very least, I could do some benchmarking of my own software at some point.
 
It doesn't matter as its dead on arrival.

Tegra 4i however is much more interesting for a upper midrange phone.

Let's not get carried away here. Tegra 4 will still arrive to market at least six months before Tegra 4i, and T4 is still a much better choice for high performance high res tablets and clamshell devices than T4i. T4's quad-core Cortex A15 CPU will likely be the fastest low power mobile CPU for most of this generation, and T4's GPU will likely be one of the fastest low power mobile GPU's for most of this generation too. FWIW, NVIDIA has recently indicated that Tegra 4 is ahead of Tegra 3 with respect to quality and quantity of design wins in all areas except one instance [probably the Nexus 7 refresh] where T4 was simply not ready in time to win the design.
 
I don't care much for design wins, least of all design wins as indicated by NVIDIA. I care for actual product releases with a Tegra chip in there. I'll judge Tegra based on those.
 
Let's not get carried away here. Tegra 4 will still arrive to market at least six months before Tegra 4i, and T4 is still a much better choice for high performance high res tablets and clamshell devices than T4i. T4's quad-core Cortex A15 CPU will likely be the fastest low power mobile CPU for most of this generation, and T4's GPU will likely be one of the fastest low power mobile GPU's for most of this generation too.

It'll definitely pose in the top10 of benchmark charts when it'll arrive but not necessarily in the top10 perf/mW chart and that's one chart no website will be able to fill in that easily. Despite the fact that the ULP GF in T4 won't be the fastest SFF mobile GPU this year, perf/mW and/or perf/W is times more important in that market than performance by itself. Heck perf/W has become THE defining factor for high end desktop GPUs too and that's one point that NV realized better than ever with Kepler.

FWIW, NVIDIA has recently indicated that Tegra 4 is ahead of Tegra 3 with respect to quality and quantity of design wins in all areas except one instance [probably the Nexus 7 refresh] where T4 was simply not ready in time to win the design.

Since the Nexus7 deal was worth 6 Mio units and the highest volume deal for T3 by far I'm not so sure why anyone should expect any glorious sales volumes from T4.
 
It'll definitely pose in the top10 of benchmark charts when it'll arrive but not necessarily in the top10 perf/mW chart and that's one chart no website will be able to fill in that easily.

Why would the T4 GPU not have competitive perf/w on average compared to other mobile GPU's when it is fabricated at 28nm and has the power consumption benefit of using FP20 pixel shader precision? And why would the T4 CPU not have competitive perf/w on average when it can complete tasks more quickly than slower CPU's and then go to sleep while switching over to the low power battery saver core? What reviewers really need is some way to effectively measure both performance and power consumption for individual applications (including games). I'm not sure how easy it would be to measure power consumption short of hooking up a power meter directly to the battery terminals (assuming that one can even gain easy access to the battery terminals in the first place).

Since the Nexus7 deal was worth 6 Mio units and the highest volume deal for T3 by far I'm not so sure why anyone should expect any glorious sales volumes from T4.

The expectation is that any loss in T3 tablet sales will be offset by growth in T4 tablet/clamshell sales and T4i smartphone sales, so overall Tegra revenue in FY2014 will probably be flat compared to FY2013.
 
Last edited by a moderator:
Why would the T4 GPU not have competitive perf/w on average compared to other mobile GPU's when it is fabricated at 28nm and has the power consumption benefit of using FP20 pixel shader precision?

We don't know but I'm thinking this:

1) nVidia boasts that they have a huge perf/mm^2 advantage even on the same process node, you don't tend to get this without sacrificing perf/W
2) We know they're running at pretty high clock speeds compared to at least some of the competition
3) FP20 pixel shaders isn't an advantage vs the competitors that have FP16 options, if they're predominantly used in their pixel shaders
4) Lack of tiling vs everyone else and TBDR vs IMG IMO puts them at a power disadvantage

And why would the T4 CPU not have competitive perf/w on average when it can complete tasks more quickly than slower CPU's and then go to sleep while switching over to the low power battery saver core?

Because hurry up and go to sleep isn't a good perf/W optimization. If you spend twice as long performing a task at half the clock rate you will use (often substantially) less than half the energy doing so. If you use a processor that uses more power than it is faster you will use more energy on a task. The amount of power drawn while truly sleeping is going to be comparable for all players, battery saver core or not.

The only difference it makes is if running your tasks means other things have to be powered and can't sleep until the CPU does, things that use a fixed amount of power. But that's not generally how things work. You tend to either want the display on well after the task is finished or off when the task started.

What reviewers really need is some way to effectively measure both performance and power consumption for individual applications (including games). I'm not sure how easy it would be to measure power consumption short of hooking up a power meter directly to the battery terminals (assuming that one can even gain easy access to the battery terminals in the first place).

Reviewers are already gutting tablets and rigging power monitoring circuits to them. Measuring power consumption of different parts in isolation is not a problem.

The problem is that no one is even trying to normalize performance or fix performance to several different points. This is especially hard to parse for GPU scores that give wildly different performance and power consumption at the same time. GPU performance is hard for the user to artificially limit but easy for the program to, but generally that only applies if both platforms are frame rate limited all the time.
 
We don't know but I'm thinking this:

1) nVidia boasts that they have a huge perf/mm^2 advantage even on the same process node, you don't tend to get this without sacrificing perf/W
2) We know they're running at pretty high clock speeds compared to at least some of the competition

That may be true, but we don't have any evidence yet to show that the T4 GPU sacrifices perf/w relative to the competition at 28nm. T4's GPU is also an order of magnitude smaller in die size area compared to many of the other mobile GPU's. And using a smaller GPU die size at a higher GPU clock operating frequency does not always result in worse perf/w compared to a completely different GPU architecture using a larger GPU die size at a lower GPU clock operating frequency (such as GK104 vs. Tahiti).

3) FP20 pixel shaders isn't an advantage vs the competitors that have FP16 options, if they're predominantly used in their pixel shaders

As far as I know, all of T4's competitors will be using a unified shader architecture with full FP32 precision, is that not correct?

4) Lack of tiling vs everyone else and TBDR vs IMG IMO puts them at a power disadvantage

The T4 GPU does have Early-Z processing so that Z, color, and texture data for hidden pixels is discarded, so it would be hard to say exactly how that implementation would compare to different tiling architectures with respect to power efficiency.

Because hurry up and go to sleep isn't a good perf/W optimization.

Hurry up and go to sleep while switching over to a very low power state can certainly result in competitive perf/w in some cases. For instance, if a very fast CPU "A" can complete a task in 10 seconds while consuming an average of, say, 2000mW of power (ie. 1500mW consumed by CPU "A" plus 500mW consumed by the screen) and then go to sleep while switching over to a very low power state or core using, say, 500mW for the next 10 seconds, the average power consumed would be essentially identical to a more power efficient but slower CPU "B" that can complete a task in 20 seconds while consuming an average of 1250mW (ie. 750mW consumed by CPU "B" plus 500mW consumed by the screen), even with the screen powered on in both cases. The more efficient the battery saver core and the more time spent in sleep mode relative to performance mode will tilt the equation more and more in CPU A's favor too.

Reviewers are already gutting tablets and rigging power monitoring circuits to them. Measuring power consumption of different parts in isolation is not a problem.

This is easier said than done though. When Anand was trying to isolate CPU power and GPU power, he wasn't even sure what else was being powered on a given voltage rail.
 
Last edited by a moderator:
@ exophase. .thanks for your insight...yes that L0 cache is not something typical for a cpu is it? Krait seems to be built as sort of a cortex A9 +...which makes it perfect for smartphones IMO...powerfull yet not too power hungry...at the expense of some latency?.

We dont know yet..but cortex A15s inside a small form factor looks to be a poor fit..especially without big little.

@AMS -My view Is tegra 4 will be a disaster, just a gut feeling.
If you think tegra 3 had modest sales and average performance in comparison to competition,

Tegra 4 has already missed out on htc one...maybe nexus 7, asus remains to be seen...it wont get a look in any samsung device, or apple (obviously)..probably not Motorola (intel)..or future HTC (Qualcomm?) so that leaves windows RT. .or asus as the only potential high selling devices..

Power consumption has serious question marks (tegra 4)..so I wouldn't bet on good sales..also I would bet on a poor markup on chips just to get them into devices.

Just my take.
 
The Tegra business as a whole (including tablets, smartphones, automotive, and embedded) had over $700 million in revenue in FY2013. The Tegra business as a whole is expected to have similar revenue for FY2014. If that is considered to be disastrous, then so be it, but there is no denying that T4 will have one of the fastest CPU/GPU for this generation of low power mobile handheld devices.
 
That may be true, but we don't have any evidence yet to show that the T4 GPU sacrifices perf/w relative to the competition at 28nm. T4's GPU is also an order of magnitude smaller in die size area compared to many of the other mobile GPU's. And using a smaller GPU die size at a higher GPU clock operating frequency does not always result in worse perf/w compared to a completely different GPU architecture using a larger GPU die size at a lower GPU clock operating frequency (such as GK104 vs. Tahiti).

Several techniques trade perf/mm^2 for perf/W and vice-versa. Running at a higher clock speed vs more units at a lower clock speed is one of them.

To truly have an order of magnitude better perf/mm^2 and similar perf/W on the same or similar process would suggest a mind-bogglingly superior GPU design and there's no way I think nVidia attained this with a fairly modest respin of the same old GeForce ULV. So I'm going to tend to think either their area advantage isn't as overwhelming as they say it is or they're not on the same level with power efficiency.

As far as I know, all of T4's competitors will be using a unified shader architecture with full FP32 precision, is that not correct?

At least IMG is also capable of true FP16 SIMD, AFAIK even with Rogue. That is what I'm referring to. Not sure what Mali-T6xx can do (Mali-400 had 16-bit pixel shader ALUs), and am pretty sure Adreno is pure FP32.

The T4 GPU does have Early-Z processing so that Z, color, and texture data for hidden pixels is discarded, so it would be hard to say exactly how that implementation would compare to different tiling architectures with respect to power efficiency.

I was taking that into consideration and IMO a good tiling implementation still gives you an advantage. Color cache isn't that great at reducing render-target bandwidth requirements for read-modify-writes (alpha blends) of fragments submitted in a typical order.

Hurry up and go to sleep while switching over to a very low power state can certainly result in competitive perf/w in some cases. For instance, if a very fast CPU "A" can complete a task in 10 seconds while consuming an average of, say, 2000mW of power and then go to sleep while switching over to a very low power state or core using, say, 500mW for the next 10 seconds, the average power consumed would be essentially identical to a more power efficient but slower CPU "B" that can complete a task in 20 seconds while consuming an average of 1250mW, even with the screen powered on in both cases. The more efficient the battery saver core and the more time spent in sleep mode relative to performance mode will tilt the equation more and more in CPU A's favor too.

But this doesn't make any sense. Starting with the same CPU at different clock speeds, unless the task is latency sensitive it definitely makes more sense to spend more time completing it because perf/W is VERY superlinear. Regardless of processor. Power consumption is roughly linear with respect to frequency and roughly quadratic with respect to voltage. You need to increase voltage to increase frequency, very roughly linearly, so perf/W over a wide MHz range will look most like a cubic curve.

Then the other factor is perf/MHz of two different CPUs. I'm sorry but Cortex-A15 is definitely not pushing new ground on perf/W, it's clearly giving some up for peak perf and that's why ARM is pushing Cortex-A7s to supplement it.

Hurry up and go to sleep is just not going to give you an example like the one you highlighted. I'm not going to bother talking about power efficiency while truly sleeping because idle power consumption of mobile CPUs is so low that they're far from the primary interest in efficiency anymore.

This is easier said than done though. When Anand was trying to isolate CPU power and GPU power, he wasn't even sure what else was being powered on a given voltage rail.

All the more reason to try to characterize perf/W curves with several input points, so what will likely be a constant factor can be removed. Although from experience I can tell you SoCs tend to have separate rails for peripherals (often coming off of LDOs instead of switchers) and not load them onto their big primary CPU, GPU, DSP etc rails.
 
@ exophase. .thanks for your insight...yes that L0 cache is not something typical for a cpu is it? Krait seems to be built as sort of a cortex A9 +...which makes it perfect for smartphones IMO...powerfull yet not too power hungry...at the expense of some latency?.

Not at all willing to call it Cortex-A9+, I don't want to encourage this idea that Qualcomm in any way utilizes ARM's CPU designs. Not that I know very much about Scorpion either, but I'm willing to bet you could notice some level of design evolution from it to Krait.

so that leaves windows RT.

Assuming the rumors that the next Surface RT is also going Qualcomm are false.
 
To truly have an order of magnitude better perf/mm^2 and similar perf/W on the same or similar process would suggest a mind-bogglingly superior GPU design and there's no way I think nVidia attained this with a fairly modest respin of the same old GeForce ULV. So I'm going to tend to think either their area advantage isn't as overwhelming as they say it is or they're not on the same level with power efficiency.

I think that the answer is much more simple than it seems. NVIDIA was able to reduce GPU die size relative to the competition in large part not necessarily due to sacrificing perf/w, but due to sticking with a non-unified shader architecture and FP20 max pixel shader precision with lack of full support for more modern API's. So their GPU die size advantage is largely due to stripping out some features that are not so critical for most mobile applications this generation, rather than cranking up operating frequency to infinity just to save on die size.

At least IMG is also capable of true FP16 SIMD, AFAIK even with Rogue. That is what I'm referring to. Not sure what Mali-T6xx can do (Mali-400 had 16-bit pixel shader ALUs), and am pretty sure Adreno is pure FP32.

That is news to me. In what instances or applications is IMG using partial [FP16] pixel shader precision? Regardless, all 5XT and 6 series IMG GPU's still have hardware support for full [FP32] pixel shader precision, correct?

But this doesn't make any sense. Starting with the same CPU at different clock speeds, unless the task is latency sensitive it definitely makes more sense to spend more time completing it because perf/W is VERY superlinear.

I wasn't talking about starting with the same CPU and comparing it at different clock speeds. I was comparing one CPU "A" that consumes twice as much power as a different CPU "B" while completing the task in half the time, but has the same overall average power consumption (with the screen on) due to being able to switch to a low power state for half the time. Let me explain again:

Scenario 1 (where CPU "A" completes a task in 10 seconds, and then switches to a low power state for the next 10 seconds, with the screen on):
1-10 seconds time elapsed: Average power consumed = 2000mW (ie. 1500mW from CPU "A" plus 500mW for the screen).
11-20 seconds time elapsed: Average power consumed = 500mW (for the screen)
1-20 seconds time elapsed: Average power consumed = 1250mW

Scenario 2 (where CPU "B" completes a task in 20 seconds, with the screen on):
1-10 seconds time elapsed: Average power consumed = 1250mW (ie. 750mW from CPU "B" plus 500mW for the screen).
11-20 seconds time elapsed: Average power consumed = 1250mW (ie. 750mW from CPU "B" plus 500mW for the screen)
1-20 seconds time elapsed: Average power consumed = 1250mW

Performance per watt for CPU "A" in Scenario 1 is identical to performance per watt for CPU "B" in Scenario 2, while performance for CPU "A" in Scenario 1 is twice as high as performance for CPU "B" in Scenario 2. Since both performance and performance per watt matter, Scenario 1 is the preferred scenario in form factors where the higher peak power of CPU "A" is acceptable and manageable.
 
Last edited by a moderator:
If Tegra 4 is even remotely competitive then it wouldn't have been overlooked for all the big design wins.

If it's got such an area advantage then it should be cheaper. If Nvidia still can't get good design wins with a smaller, cheaper faster SoC then what does that leave except a likely severe power issue that doesn't have a fix in a reasonable timeframe?

The alternative is that Nvidia has a faster, cheaper, smaller and power-competitive SoC that nobody wants...for no good reason?

You can argue that Nvidia has held out for a better price but ultimately that is suicidal at this stage of their development. They needed to be in the 7 this time around regardless of price.
 
I think that the answer is much more simple than it seems. NVIDIA was able to reduce GPU die size relative to the competition in large part not necessarily due to sacrificing perf/w, but due to sticking with a non-unified shader architecture and FP20 max pixel shader precision with lack of full support for more modern API's. So their GPU die size advantage is largely due to stripping out some features that are not so critical for most mobile applications this generation, rather than cranking up operating frequency to infinity just to save on die size.

I can see FP20 fragment shading giving some area benefit, but I can't see it giving anything like an order of magnitude like you said. The other side of this is that non-unified shaders lowers perf/mm^2 when the fragment side bottlenecks the vertex side or vice-versa. And from where I stand the vertex shading part of Tegra 4 looks grossly over-specified, I doubt it will very often come anywhere close to full utilization.

That is news to me. In what instances or applications is IMG using partial [FP16] pixel shader precision? Regardless, all 5XT and 6 series IMG GPU's still have hardware support for full [FP32] pixel shader precision, correct?

If you use mediump in a GLSL fragment shader it's FP16. In Series 5/5XT lowp means 8-bit (or maybe 10-bit) fixed point. highp is FP32.

I wasn't talking about starting with the same CPU and comparing it at different clock speeds. I was comparing one CPU "A" that consumes twice as much power as a different CPU "B" while completing the task in half the time, but has the same overall average power consumption (with the screen on) due to being able to switch to a low power state for half the time.

Yes, and I told you that Cortex-A15 has worse perf/W than its competitors (at least at higher clocks but probably throughout much of the curve) when on a similar process, so there's no way it gets the same tasks done faster AND with equal or less consumed energy. Which is the reality defying logic that your scenario needs to work.
 
Why would the T4 GPU not have competitive perf/w on average compared to other mobile GPU's when it is fabricated at 28nm and has the power consumption benefit of using FP20 pixel shader precision? And why would the T4 CPU not have competitive perf/w on average when it can complete tasks more quickly than slower CPU's and then go to sleep while switching over to the low power battery saver core? What reviewers really need is some way to effectively measure both performance and power consumption for individual applications (including games). I'm not sure how easy it would be to measure power consumption short of hooking up a power meter directly to the battery terminals (assuming that one can even gain easy access to the battery terminals in the first place).

Because you're assuming yourself ahead and outside of Rogue that's why.

The expectation is that any loss in T3 tablet sales will be offset by growth in T4 tablet/clamshell sales and T4i smartphone sales, so overall Tegra revenue in FY2014 will probably be flat compared to FY2013.
That doesn't change the assumption one bit that overall T4 sales volumes will be rather boring than worth mentioning which was more or less the original point. One step further would be that they'll probably lose more than before since expenses aren't in any way decreasing.

I think that the answer is much more simple than it seems. NVIDIA was able to reduce GPU die size relative to the competition in large part not necessarily due to sacrificing perf/w, but due to sticking with a non-unified shader architecture and FP20 max pixel shader precision with lack of full support for more modern API's. So their GPU die size advantage is largely due to stripping out some features that are not so critical for most mobile applications this generation, rather than cranking up operating frequency to infinity just to save on die size.

Shall we take again NV's T4 marketing material with that particular slide and re-analyze how many "mistakes" it contains about pretty much any competing solution, how many lies and misconceptions exactly?

If there would be such a "great" advantage after all NV itself wouldn't burry it 10 feet under the ground while presenting any detailed Logan/Kayla material.

And that stripping out "some features" for this generation is absolute nonsense too; Tegra ULP GFs are under-delivering in terms of functionalities from day one with Tegras and T4 is just a natural extension of N increase of unit counts compared to T1 with a few tidbits in functionalities added to the mix. Now from that "unnecessary" bullshit degree we move within one year to SM35, just to have everyone tech savy scratching its head what the heck changed within such a short time frame that such a jump would occur. One way to view at it is that Logan is miles ahead or the more realistic one that anything up to T4 was/is just miles behind in terms of functionalities compared to the competition and NV is one single strike correcting the latter with Logan.

Last but not least that last sentence above is quite comical too in the grander scheme of things:

ULP GF T2 = 333MHz
ULP GF T3 = 520MHz
ULP GF T4 = 672MHz

....if I'd now compare each of them again competing solutions the frequencies here are in about all cases quite a bit higher than any competing solution. Since NV loves to put it against the iPad4 for example, let me refresh your memory that the GPU runs there are at only 280MHz.
 
I can see FP20 fragment shading giving some area benefit, but I can't see it giving anything like an order of magnitude like you said.

The smaller die size of the T4 GPU relative to it's competition is certainly a combination of both stripped out forward-looking features and higher GPU clock operating frequency. But the GPU clock operating frequency is not out of this world either. For instance, T4 has a GPU clock operating frequency that is "only" 26% higher than the 5XT GPU used in the SGS4 international variant.

Yes, and I told you that Cortex-A15 has worse perf/W than its competitors (at least at higher clocks but probably throughout much of the curve) when on a similar process, so there's no way it gets the same tasks done faster AND with equal or less consumed energy. Which is the reality defying logic that your scenario needs to work.

That may be true, but we have yet to see any actual perf/w data on quad-core Krait and quad-core Cortex A15 at 28nm, and there are some applications where the performance of Cortex A15 is way ahead of Krait 300 by a factor of 2x or more: http://www.hardwareluxx.de/images/stories/newsbilder/aschilling/2013/mwc/tegra4-press-3.jpg
 
Last edited by a moderator:
The smaller die size of the T4 GPU relative to it's competition is certainly a combination of both stripped out forward-looking features and higher GPU clock operating frequency. But the GPU clock operating frequency is not out of this world either. For instance, I think that T4 has a GPU clock operating frequency that is "only" 26% higher than the 5XT GPU used in the SGS4 international variant.

So what would you think the iPad4 performance would look like if its GPU block would be clocked at 480MHz just as the S4 smartphone SoC?

That may be true, but we have yet to see any actual perf/w data on quad-core Krait 300/400 and quad-core Cortex A15 at 28nm, and there are some applications where Cortex A15 is way ahead of Krait by a factor of more than 2x.
When it comes to perf/W or better perf/mW I'd rather have a Krait vs. big.LITTLE A15/A7 comparison than Krait vs. T4. There's a reason after all why T4i contains a quad A9 after all.

***edit: as for the added NV slide, nothing more to say than :LOL:
 
Right, because the scale doesn't start at 0.5 or 0.8 like the AMD marketing graphs?

In Sunspider, Google Octane, Kraken, and Vellamo Metal, Anandtech has already confirmed that quad-core S600 is ~ 30% faster than quad-core S4 Pro. Quad-core Cortex A15 will be way faster than quad-core Krait 300 in a wide variety of different applications.
 
Back
Top