NVIDIA Tegra Architecture

See the post above; give me one good reason why I should pay as much for as antiquated material. Give me something that I wouldn't consider paying for, even if the sum is diametrically higher and then I'll let you know why gaming never has picked up until now on any mobile platform.

I was not criticising you. It was merely an observation following my point that Android is a terrible gaming platform apart from casual titles, since there is not enough money to be made there. No one wants to pay full price for a mobile game be it a port or not.
 
These Volta cores must be hugely larger as the Pascal cores as for 7B transistors you get 512 for Volta vs 2560 for Pascal.
My guts have me thinking that they will actually spend a lot more silicon on CPU and caches than the GPU cores. I supect 512 cores organized as in the GP100 (SM of 64 cores hence 8 SM with full FP16 support).
I find interesting that Nvidia speak of custom CPU and cores without any reference to Denver. Knowing Nvidia standard practicies, I suspect the "Custom ARM" refers to the SIMD units of the CPU they re going to use.
I suspect Nvidia will go with either A72 or A73 backed with custom SIMD units (to match the proprietary API, software, etc.). I suspect they will spent lots of silicon on the L2 and L3 and on the GPU register files an cache too.
 
These Volta cores must be hugely larger as the Pascal cores as for 7B transistors you get 512 for Volta vs 2560 for Pascal.

And/Or the GPU proportion of the new SoC simply got smaller, like what happened with Parker. And/Or this particular version of Volta has additional hardware exclusively dedicated to INT8 operations (akin to PowerVR's FP16 units).
If the Denver cores were huge, Denver's successors are probably pretty big on transistor count, too.

News of Google's TPUs may have put a lot of pressure on getting dedicated hardware for neural networks. Repurposing ALUs that were originally made for floating point calculations may just not be competitive enough.


I suspect Nvidia will go with either A72 or A73 backed with custom SIMD units (to match the proprietary API, software, etc.).
If that was the case, I don't think they would call it "custom cores".
 
Well if Nvidia sticks to Denver all the better as the CPU space is growing boring, diversity in approch keeps geeks entertained :)
 
https://blogs.nvidia.com/blog/2016/09/28/xavier

This is a pretty crazy chip:
  • 7 billion transistors
  • 512 cores
  • 20 Tera ops
  • 16nm FF
  • 20 watts
  • Due end of 2017
NVIDIA claims this one chip backs all the power of a Drive PX 2 computer (2 x Parker + 2 x Parker).

I haven't figured out how this is possible, given that this is still on the 16nm process.

The power is a mystery. The GTX 1080 @ 7B transistors is 180 watts. Xavier is same number of transistors at 20 watts. I assume the latter uses LP process. But can it make that much difference?

As for perf—there's no sane way to get to 20 TOPS based on the existing arch. It would take 512 cores clocked at 5 GHz + INT8 to get there. But that's obviously absurd. Best guess is the computer vision accelerator has some kind of programmable low cost INT8 units that boosts performance.

Thoughts?
 
Last edited:
Tegra is being discussed here:

https://forum.beyond3d.com/posts/1945831/

Just some quick tidbits:

- Number of transistors alone doesn't dictate power consumption. Skylake Y is probably around 1.5B transistors and it has a 4.5W TDP.
- INT8 throughput in Xavier may not be entirely done on the GPU's "CUDA cores". In fact, there's a good chance they're not, since the same presentation said the SoC would have a GPU with 512 cores.
 
yeah, the majority of those TOPS don't likely come from the normal shader (or arm) cores. Do we have any kind of TOPS or Watt rating for the Google TPU? (which obviously lacks the more general cores, but..)
 
Some indirect measurements of Google's TPU from Google's paper on neural machine translation.
They note that because of workload mix / CPU-GPU transfer not being optimal, GPU is not performing optimally in these measurements.
 

Attachments

  • Screen Shot 2016-09-28 at 10.55.57 PM.png
    Screen Shot 2016-09-28 at 10.55.57 PM.png
    31.7 KB · Views: 22
  • Screen Shot 2016-09-28 at 10.55.48 PM.png
    Screen Shot 2016-09-28 at 10.55.48 PM.png
    205 KB · Views: 23
And/Or the GPU proportion of the new SoC simply got smaller, like what happened with Parker. And/Or this particular version of Volta has additional hardware exclusively dedicated to INT8 operations (akin to PowerVR's FP16 units).

OT but I could imagine that Series7XT Plus https://imgtec.com/blog/powervr-series7xt-plus-gpus-advanced-graphics-computer-vision/ has additional dedicated INT logic. Everything else before that was capable only of INT32; the Plus IP cores expand to up to 4* INT8 per INT32. That's why I asked Ryan why he thinks that 20 TOPs in a 20W power portofolio would be impossible. The pipelines would just need to be wide enough to reach a high enough throughput and yes I'd also consider it possible that other blocks of the SoC like the CVA mentioned above might contribute to those 20 TOPs.

Either way and even apart from the INT pipeline I'd expect Volta ALUs to be significantly wider than we've seen so far in green architectures.
 
The Xavier SoC manages 20 trillion operations per second, while only using 20 watts of power.

Because it’s used in cars, Xavier was designed to meet the ISO 26262 functional safety spec, which is an international standard that sets expectations for electronics used in cars designed for road use. The SoC uses a 16nm manufacturing process, and just one can replace Nvidia’s current DRIVE PX 2 in-car computer, including a configuration of said component that includes two mobile SoCs and two discrete GPUs, while also using less power.

Xavier is intended for use by carmakers, suppliers, research organizations and startups looking to field and test their own self-driving cars. You won’t see it in any cars in the near future, however — Nvidia says it will start shipping the first samples in the fourth quarter of next year.

Meanwhile, Nvidia also teamed up with TomTom in a partnership that will see the two companies combining Nvidia’s AI platform and TomTom’s mapping data to provide real-time, localized mapping data for use in highway and freeway driving situations. Nvidia also demonstrated its own AI-based self-driving research vehicle, which learned how to drive itself based entirely on observing human driving behavior.


https://techcrunch.com/2016/09/28/nvidias-new-xavier-soc-is-an-ai-supercomputer-for-cars/
 
Last edited:
yeah, the majority of those TOPS don't likely come from the normal shader (or arm) cores. Do we have any kind of TOPS or Watt rating for the Google TPU? (which obviously lacks the more general cores, but..)

I would think that the Neural Net computation is done with a new kind of special function block (that can be optionally added to a core)
There is a lot of efficiency to be gained compared to doing dot products via registers, the neuron inputs and accumulated values can be kept internally in those units reducing the amount of moved data and thus power consumption.
 
Given that its going to be in mass production only in 2018, I'm surprised that Xavier is not on 10nm. I was also expecting Nvidia to use some ARM R52 cores as well but looks like they have certified the Denver cores for ISO 26262.
I find that last sentence particularly amusing from the author:

Not really....but hey whatever floats anyones....errr autonomous boat.... *cough*

Its a huge market. The potential revenues from it could far exceed those from the GPU market.

ModEdit: Irrelevant bits removed & copied to spin-off
 
Last edited by a moderator:
Eyeriss: An Energy-Efficient Reconfigurable Accelerator
for Deep Convolutional Neural Networks


http://www.rle.mit.edu/eems/wp-content/uploads/2016/02/eyeriss_isscc_2016_slides.pdf

CVA is Eyeriss,maybe
Bingo ! That's a very good assumption. I don't see how you gain 4 times the power efficiency at the same node without a totally different uarch. Especially since we are talking about a very specific kind of mathematical problem. GPUs generic ALUs are not the most efficient to solve this computation need. This Eyeriss accelerator (or co-processor) is the only way to keep competitive against the dedicated deep learning ASICs that are under development. And it also proves how much Nvidia wants this market ...
 
I don't see how you gain 4 times the power efficiency at the same node without a totally different uarch.

How quickly you seem to have forgotten Maxwell.

Maxwell was on the same 28nm process as Kepler yet made vast uarch improvements.

He hasn't forgotten Maxwell. It sounds like you're in agreement with the second part of his sentence.
 
Is it realistic to expect Maxwell level gains again? You surely don't get to make efficiency/power optimisations on that scale twice?
 
Back
Top