NVIDIA Tegra Architecture

Last but not least: did I misread something or did Jensen really claim that X1 has the performance of the Xbox One console? I call for bullshit no matter how much tear and feathers are on their way :oops:
It has equal fill rate. But only 25.6 GB/s of BW (will scale up when LPDDR4 matures to ~34 GB/s). That's not even half of the BW of the main memory and there's no alternative for the 32 MB ESRAM. The raw FP32 rate is also under under half of the console (FP16 rate is closer to the FP32 rate of Xbox One, but not there yet). A two SMX version could be competetive against the consoles if (and only if) they solved the bandwidth issue.

They need to roughly 4x the memory controller width and wait for 2133 MHz LPDDR4. That would get them to 136.5 GB/s. Maxwell is more BW friendly than GCN (as it has delta color compression and bigger caches) so this should be enough. I don't actually even know the cache size of the Tegra Maxwell designs (2 MB L2 sounds a little bit high for such a small GPU, but not if you compare it to Intel designs).
 
You are both missing two vital things: its not only that the double fp16 rate is under conditionals, most rates including fillrate are subject to the frequency in final devices. The X1 marketing material was again using 950MHz for GK20A/K1 albeit peak frequency being at 850MHz in the end in real devices.
Yes at 1 GHz, with identical ops for fp16 & much higher bandwidth they'd be there, but that's 3 conditionals already.
 
Oh oh, let's do this and really confuse everyone:

- 2x Denver
- 2x Cortex-A57
- 4x Cortex-A53

+ GPU cores
+ x86 cores
+ x64 cores

:) when I see an architecture like that I immediately think of multi-kernel research OS's :) ... Maybe it's the perfect time for these research OS's to surface

B6qL61nCIAAADbh.png:large
 
But what does hosting disparate host CPU ISAs actually give you in this instance?
It's combining functionally equivalent cores with deeply similar architectural goals whose primary source of synergy is implementing them in dangerously incompatible ways.
If you want backwards compatibility, you don't want your code wandering onto a CPU that it wasn't developed for.
The apparent cross-node messaging system sounds like it negates a fair portion of the benefits of physically integrating them, on top of the massive validation space for that many architectures.
 
But what does hosting disparate host CPU ISAs actually give you in this instance?
It's combining functionally equivalent cores with deeply similar architectural goals whose primary source of synergy is implementing them in dangerously incompatible ways.
If you want backwards compatibility, you don't want your code wandering onto a CPU that it wasn't developed for.
The apparent cross-node messaging system sounds like it negates a fair portion of the benefits of physically integrating them, on top of the massive validation space for that many architectures.

One argument is certain apps can run in lower power mode and instantly switch to another more power hungry mode eg. Powerpoint in slideshow mode vs Edit mode. Or word in tablet mode versus keyboard connected edit mode..

Another argument is a way to fill the "app gap", some believe that MS will soon launch a solution where android apps can run on windows devices (either streamed from a server or virtualized who knows)..

Just as that saying goes, the right tools for the job, I guess you can extend that thinking to "the right architecture for the job"

p.s. the whole container based apps, eg. Docker tech or as MS will soon launch its own Windows Containers in Windows 10.. We already compile down Apps all the way down to the OS and the services that app needs. The next logical step is to also include the architecture that container based app works best on.. it's a virtualized container based future
 
Individually the higher clocked A57s can probably outpace the Jaguars in normal situations, but I don't think that plus the 4 Cortex-A53s is enough to match the 8 cores. That, and I think there will be few if any actual devices that will allow all four A57s to be run at peak clock simultaneously, at least for any meaningful length of time.
I wonder how difficult it would be to switch the 4xA53 cluster to another 4xA57 cluster. According to the marketing material, the two modules are already fully cache coherent and able to run simultaneously. It wouldn't be a big stretch to update the other cluster to beefier cores. With bigger TDP ceiling and active cooling (a home console) this setup would be able to compete with 8 Jaguar cores. Shouldn't take that much extra die space either. Obviously this setup would be completely stupid for mobile devices, but it could be useful for laptops (chrome books, etc) and consoles.
 
If I was NVIDIA, I would do the following:
- 16FF process
- 2 x Denver (2.5GHz+ with revisions)
- 4 x A53 (2GHz+ - optimised for speed)
- 4 x A53 (1.2GHz+ - optimised for power)
- All cores can be used at the same time...

Why not 4 x Denver and 4 x A53? (But using HMP). Aside from the fact that a 3 cluster design may be overly complicated..this seems like a far better solution to both increase peak performance and reduce idle/non-peak power consumption.
Erineyes: Their power/perf claims are based on SPECint measurements. I've avoided using that for power measurement as to not anger the people who provide us the suite, but I can tell you it's basically just a use-case with just 1 core having any load, furthering my suspicion that most gains are unrelated to the CM scheme and we're talking just process here.

Samsung's 20nm seems about as efficient as 28HPM when strictly talking about voltages.

Ahh..makes a bit more sense now. If its just one loaded core then yes..that would basically be down to the process and libraries as you said. Still quite a large difference though (Assuming we take Nvidia's claim of "40% performance increase at the same power" at face value).

Voltages dont tell us the full story but yes from all accounts TSMC does seem to have the better process. We also have to take into account process maturity though..Samsung's SoC was in a shipping product in September/October 2014 and TX1 will not ship until Q2 so thats a gap of at least six months. At this early stage of the process..it could mean a big difference. The difference between their FINFET processes should also be interesting to see. Anyway I await your article as that should give us a good comparison between the SoCs.
It may not be the Holy Grail but it's Qualcomm's main competitive advantage, and they do dominate the smartphone market. They do pretty well in tablets too. They also have Krait and Adreno, but neither seems particularly better than Cortex or Mali/PowerVR respectively. The market performance of their Cortex-powered S810 should bring us more controlled information about the competitive value of their modems, unless Adreno 430 turns out to be spectacular.

I'd argue that Qualcomm's success is also down to the quality of the integrated modems, aside from the fact that they do make quality SoCs overall. Qualcomm usually has the best modems in the industry, especially when it comes to LTE and you can see that even Apple switched to Qualcomm modems from AFAIK the iPhone 5 onwards. Another factor is that Qualcomm also offers the complete RF/analog solution and bundles with their SoCs/modems. I've also read..though I cannot verify this..that certification and validation tends be to easier with Qualcomm. Anyway..I have no doubt that the Snapdragon 810 will be as successful as 800 was.
As for Mediatek, I don't know what they're working on exactly for 2015, but I imagine they must have some sort of 4+4 A57/A53 setup with decent graphics. I'm not trying to say that they compete with NVIDIA on graphics performance.

Maybe they do..but they haven't announced anything apart from Octa core A53 so far..one with Mali 760 MP2 and one with PowerVR G6200 so very mediocre graphics.
Rather, I'm arguing that there's really not much you can do on a Tegra device that you can't do on a (cheaper) Mediatek one. Whatever it is that you can do on Tegra and not on an MT chip, I doubt it's enough for Tegra to be viable as a tablet product. Since (from what I've read) JHH spent far more time talking about cars than tablets when presenting Erista, he just might agree with me.

This I agree with..similar to what we saw in the PC/laptop space..we've reached an era where even low cost devices (with Cortex A7 onwards at least) can do pretty much all that a high end device can..and they do it well enough. With Cortex A53..the low end performance becomes even better.
Eh, it is not that far off. The CPU is probably as good or better, while the GPU is roughly half the performance (or much closer if using fp16). Bandwidth is the only real killer, but if one is willing to accept 720p instead of 1080p, I imagine Xbox One games could run just fine on it with minimal adjustment. It is certainly much better than the Wii U at any rate....

I doubt the CPU is even close to as good. I think 8 Jaguar cores would outpace 4 A57 + 4 A53 quite handily. And of course graphics wise..the X1 achieves 512 FP32 GFlops only at 1 Ghz whereas the Xbox GPU has 1.35 TFlops. Pixel fill rate is similar but the Xbox has much higher Texture fill and Memory bandwidth, not even counting the ESRAM. Still..for an SoC designed for the mobile segment..and which would be out ~1.5 years after the Xbox..the performance is remarkable.
I wonder how difficult it would be to switch the 4xA53 cluster to another 4xA57 cluster. According to the marketing material, the two modules are already fully cache coherent and able to run simultaneously. It wouldn't be a big stretch to update the other cluster to beefier cores. With bigger TDP ceiling and active cooling (a home console) this setup would be able to compete with 8 Jaguar cores. Shouldn't take that much extra die space either. Obviously this setup would be completely stupid for mobile devices, but it could be useful for laptops (chrome books, etc) and consoles.

I dont think it would be too difficult..but what would be the point? You barely have any use cases where more than 2 cores are used..so what's the point of having 8? And the whole selling point of big.LITTLE is lower power consumption and a second A57 cluster..even if optimised for lower clocks..wouldn't be as efficient as an A53 cluster. Regarding die space..AFAIK one A53 takes about 0.8 mm2 and one A57 a little less than 3 mm2 so switching 4 A53's for A57's would mean an increase of about 8 mm2, which is significant.
 
Last edited:
Voltages dont tell us the full story but yes from all accounts TSMC does seem to have the better process.
Yea I know. I actually devised a method to reverse engineer capacitance coefficient values for the chips that I hope I will be able to apply to in more upcoming SoCs. That should bring a new perspective onto things.

Anyway..I have no doubt that the Snapdragon 810 will be as successful as 800 was.
There's been a lot of fanfare going on about that in the last few weeks and it doesn't seem to be unsubstantiated from what I've heard either... The Exynos 7420 is really the chip to watch out for in 1H2015 in my opinion.
http://blogs.barrons.com/asiastocks...lcomms-snapdragon-delay/?mod=google_news_blog
 
I dont think it would be too difficult..but what would be the point? You barely have any use cases where more than 2 cores are used..so what's the point of having 8? And the whole selling point of big.LITTLE is lower power consumption and a second A57 cluster.
There's no point at all on mobile devices. However a 8 core A57 clocked high enough with a bigger (2-4 SMX) Maxwell GPU would be a good starting point for a ARM based gaming console. It would offer comparable performance to current generation consoles (with a slightly lower power draw). Obviously this would not be enough for true next generation (gen 9) consoles.
 
4 SMM would be 512 cores doing 1 TFLOP FP32 at 1GHz. It's basically the same as a GTX 750 desktop card.
Since it lags against a Bonaire in the desktop world, I don't think it'd compete with the xbone, much less the PS4.
Unless that 2x FP16 performance trick did wonders and the GPU was clocked closer to 1.5GHz.

Then again, who would sell such a console with that custom chip and more importantly who would write games for it?
 
How can the 2xFP16 do any wonders?

Let's do a theoretical exersize:

Apple A8X GPU@=/>500MHz
~256 GFLOPs FP32 or ~512 GFLOPs FP16
Manhattan offscreen 33.0 fps

NVIDIA GK20A GPU K1@~800MHz
~307 GFLOPs FP32
Manhattan offscreen 32.0 fps

1. Manhattan is quite ALU bound.
2. A8X GPU has other higher resources too apart from just FP16 ALUs.
 
That's not a theoretical exercise, it's a practical one ;)

Regardless, the Manhattan test may be ALU bound in most cases, but you don't know what kind of variables they're using in how many operations.
The Tegra X1 @1GHz gets 65FPS (twice the performance) in Manhattan while having over 3X the theoretical FP16 performance of Tegra K1 (1TFLOPs).

If you remember nVidia's first statements about the first Maxwell GM107, they said that 1 SMM would have around the same practical ALU performance as 90% of 1 SMX at the same clockspeeds.
Therefore, the performance jump between TK1 and TX1 in an ALU-bound test should go like TK1x(2 SMM)x0.9 = 57.6. And then the TX1 is clocked 25% higher, so 57.6x1.25= 72FPS.
So even if we were looking at pure FP32 performance, the TX1 should have a higher performance jump compared to TK1 - if offline Manhattan was ALU bound.
Conclusion: for this newest batch of high-end mobile GPUs, it doesn't seem that Manhattan is ALU bound anymore, and the added FP16 performance isn't doing much for GPUs doing over 256GFLOPs.
Perhaps the test is now limited by memory bandwidth and fillrate. The TX1 did in fact double the memory bandwidth compared to TK1.
 
4 SMM would be 512 cores doing 1 TFLOP FP32 at 1GHz. It's basically the same as a GTX 750 desktop card.
Since it lags against a Bonaire in the desktop world, I don't think it'd compete with the xbone, much less the PS4.
Unless that 2x FP16 performance trick did wonders and the GPU was clocked closer to 1.5GHz.
Assuming you had 50/50 mix of FP16/FP32 code, 4 SMM part would be quite comparable in raw flops (equivalent to 1.5 TFLOP at 1 GHZ). 50/50 split would be a realistic assumption if we are talking about console games tailor made for the hardware (obviously not for generic PC software). Fill rate is also already compatible in the 2 SMM model.

But the biggest problem would be the bandwidth. 25.6 GB/s is less than both previous generation consoles had (PS3 had GDDR + RDRAM and Xbox 360 had EDRAM + GDDR). Obviously new improved tech such as depth compression, delta color compression, early depth rejection, improved hiZ, big L2 cache, improved cache logic, and loading compressed data to caches (instead of uncompressing it, wasting 4x cache space) bring all relatively big effective BW gains for Maxwell. I would expect the current (2 SMM) version to actually slightly beat both last gen consoles in BW heavy scenarios. In all the other cases the X1 would be quite a bit faster than the last generation consoles, but the limited BW would likely make it perform generally much closed to last gen than current gen. And I am mainly talking about tablets and laptops (chromebooks, etc) here. Phones wouldn't have enough TDP to sustain maximum GPU and CPU clocks for long time.

But this is still good news. Last gen console performance is finally available in your pocket. Too bad we don't have The Last of Us and GTA 5 available on Android :(
 
That's not a theoretical exercise, it's a practical one ;)

Regardless, the Manhattan test may be ALU bound in most cases, but you don't know what kind of variables they're using in how many operations.
The Tegra X1 @1GHz gets 65FPS (twice the performance) in Manhattan while having over 3X the theoretical FP16 performance of Tegra K1 (1TFLOPs).

Nope look above; it's 1 TFLOP FP16 for the X1 GPU vs. 512 GFLOPs FP16 for the A8X GPU; twice the rate and almost twice the performance.

If you remember nVidia's first statements about the first Maxwell GM107, they said that 1 SMM would have around the same practical ALU performance as 90% of 1 SMX at the same clockspeeds.
Therefore, the performance jump between TK1 and TX1 in an ALU-bound test should go like TK1x(2 SMM)x0.9 = 57.6. And then the TX1 is clocked 25% higher, so 57.6x1.25= 72FPS.
So even if we were looking at pure FP32 performance, the TX1 should have a higher performance jump compared to TK1 - if offline Manhattan was ALU bound.
Conclusion: for this newest batch of high-end mobile GPUs, it doesn't seem that Manhattan is ALU bound anymore, and the added FP16 performance isn't doing much for GPUs doing over 256GFLOPs.
Perhaps the test is now limited by memory bandwidth and fillrate. The TX1 did in fact double the memory bandwidth compared to TK1.

FP16 isn't doing much in terms of performance on Rogues either. It's a power saving initiative mostly.

* This was an early presentation for the paperlaunch.
* As time goes by they have time to further fine tune drivers and squeeze out even more performance; they're now at 65, why on God's green earth should be 72 or even more fps with better optimized drivers be a problem? It's just 10% difference.
* With the desired performance level reached they don't need to clock at 1GHz if it should pose problems with power consumption or else history repeats itself as with GK20A.
 
Last edited:
I am not sure how much everyone knows about what is happening in AI world, but I strongly feel that the predominate reason for FP16 is to run convolutional neural networks very very fast. Once you train a convNet you don't need FP32. In fact you could probably train a convNet with FP16.

ConvNets will be everywhere ( and in fact are), but you will see them being used more and more and FP16 is a huge leg up in performance.
 
Yea I know. I actually devised a method to reverse engineer capacitance coefficient values for the chips that I hope I will be able to apply to in more upcoming SoCs. That should bring a new perspective onto things.

Great! Looking forward to seeing that.
There's been a lot of fanfare going on about that in the last few weeks and it doesn't seem to be unsubstantiated from what I've heard either... The Exynos 7420 is really the chip to watch out for in 1H2015 in my opinion.
http://blogs.barrons.com/asiastocks...lcomms-snapdragon-delay/?mod=google_news_blog

You are right..digging deeper it seems like I will have to retract my statement that the S810 will do as well. I've also heard from a friend at Samsung that even the S810 dev platform is overheating. And apparently the Galaxy S6 will be Exynos only..no Snapdragon version at all.

PS: After reading that link of yours..I had a question. The analysts say that TSMC will have a revenue shortfall in Q1 due to lower 20nm utilization as a result of Qualcomm's issues. But presumably these fab contracts are locked up months if not years in advance. So if Qualcomm does not utilize the booked capacity, wouldn't they be liable to pay a sizeable penalty?
There's no point at all on mobile devices. However a 8 core A57 clocked high enough with a bigger (2-4 SMX) Maxwell GPU would be a good starting point for a ARM based gaming console. It would offer comparable performance to current generation consoles (with a slightly lower power draw). Obviously this would not be enough for true next generation (gen 9) consoles.

True..it would have been a decent choice for a console..and the lower cost would have certainly been welcome. Maybe if Microsoft and Sony had waited a year or two we would have seen an ARM based console. But that ship has sailed and for someone else to make one now and especially for devs to develop for it would be a challenge.
it was fast, Audi to use Tegra X1

Two days after NVIDIA Tegra X1’s official launch, Audi confirmed today that it will use the new mobile superchip in developing its future automotive self-piloting capabilities.
source: http://blogs.nvidia.com/blog/2015/01/06/audi-tegra-x1/
I think Audi had confirmed it at the launch of X1 itself. Also..Audi has been collaborating with NV since the Tegra 2 days so its hardly a surprise.
 
Last edited:
Back
Top