NVIDIA Tegra Architecture

Good article at eetimes:
http://www.eetimes.com/document.asp?doc_id=1331727
In the brewing battle between Nvidia’s AI car computing platform and an Intel-Mobileye platform, Nvidia now appears to be building momentum.

According to Egil Juliussen, director research, Infotainment & ADAS at IHS Automotive, Toyota has become the fourth major car OEM publicly committed to Nvidia’s Drive PX for their highly automated vehicle. The other three OEMs are Audi, Daimler, and VW Group.

this chart particularly is very impressive. Well done Nvidia !
Nvidia%20vs%20Intel%20may%202017_1494471980.png
 
NVIDIA Gives Xavier Status Update ....
Essentially the successor to the Tegra family, Xavier is planned to serve several markets. Chief among these is of course automotive, where NVIDIA has seen increasing success as of late. However, similar to their talk at the primary GTC 2017, at GTC China NVIDIA is pitching Xavier as an “autonomous machine processor,” identifying markets beyond automotive such as drones and industrial robots, pushing a concept in line with NVIDIA’s Jetson endeavors. As a Volta-based successor to the Pascal-based Parker, Xavier does include Volta’s Tensor cores, something that we noted earlier this year, and is thus more suitable than previous Tegras for the deep learning requirements in autonomous machines.

In the keynote, Jen-Hsun additionally revealed updated sampling dates for the new SoC, stating that Xavier would begin in Q1 2018 for select early development partners and Q3/Q4 2018 for the second wave of partners. This timeline actually represents a delay from the originally announced Q4 2017 sampling schedule, and in turn suggests that volume shipments are likely to be in 2019 rather than 2018.
https://www.anandtech.com/show/1187...orrt-3-announcement-at-gtc-china-2017-keynote
 
Last edited:
With 2700 SPECint2K score, Carmel is a 50% performance uplift from Denver in Nexus9 (based on these results).
Carmel is also up to 43% wider with 10-wide architecture, in comparison, Denver is capable of 7+ instructions per cycle.
30 claimed DL TOPS are 20 GPU's TOPS combined with 10 DLA's int8 TOPS.
GPU frequency is 1300 MHz (based on FP32 CUDA flops and 512 CC).
With 1.3 GHz frequency and GV100's tensor cores configuration (8 TCs per SM), it would result into 10.6 teraflops.
For int8 TOPS, TCs throughput might be doubled to ~21.2 TOPS with GV100's register file bandwidth.
 
Denver way of "emulating" ARM instructions.
It emulates ARM instructions in the same way as every other ARM processor does - by decoding them into internal µ-ops format. There are 2 hardware decoders in Denver to convert ARM instructions into µops.
 
It emulates ARM instructions in the same way as every other ARM processor does - by decoding them into internal µ-ops format. There are 2 hardware decoders in Denver to convert ARM instructions into µops.

But wasn't there something widely different in Denver compared to other ARM CPUs?
 
But wasn't there something widely different in Denver compared to other ARM CPUs?
Instead of using OoO µops scheduling, Denver relies upon software optimization layer (Dynamic Code Optimization), which monitors perf counters and performs instructions reordering, loops unrolling, registers renaming and so on for frequently used "hot" parts of code, then it saves optimized µops code into RAM for reuse. More info on DCO can be found here.
 
With 2700 SPECint2K score, Carmel is a 50% performance uplift from Denver in Nexus9 (based on these results).
Carmel is also up to 43% wider with 10-wide architecture, in comparison, Denver is capable of 7+ instructions per cycle.
30 claimed DL TOPS are 20 GPU's TOPS combined with 10 DLA's int8 TOPS.
GPU frequency is 1300 MHz (based on FP32 CUDA flops and 512 CC).
With 1.3 GHz frequency and GV100's tensor cores configuration (8 TCs per SM), it would result into 10.6 teraflops.
For int8 TOPS, TCs throughput might be doubled to ~21.2 TOPS with GV100's register file bandwidth.

So what would this be for? They're pretty much out of the mobile devices market aren't they?

Self-driving cars or auto manufacturers trying to develop SDCs?

Or maybe Nintendo Switch 2?
 
So what would this be for? They're pretty much out of the mobile devices market aren't they?

Self-driving cars or auto manufacturers trying to develop SDCs?

Or maybe Nintendo Switch 2?
At 350 mm2 and 30W on what tsmc calls 12nm process, hardly.
At 7nm, well, it might not be impossible. At 7nm with EUV or 5nm, it could well be possible. However, with the sales volumes of the Switch, a custom SoC may be the better and no longer as risky option.
 
Instead of using OoO µops scheduling, Denver relies upon software optimization layer (Dynamic Code Optimization), which monitors perf counters and performs instructions reordering, loops unrolling, registers renaming and so on for frequently used "hot" parts of code, then it saves optimized µops code into RAM for reuse. More info on DCO can be found here.

Architecturally the processor's behavior is externally different with respect to the hardware and optimization software.
For example, the uops written out in the optimized code are a VLIW format that is not 1:1 to the internal format. A simpler decoder than is present for the ARM instructions is needed to expand the in-memory format, which multiplexes fields and lacks signals for which unit will execute a uop for the purposes of code density.
The pipeline is also skewed so that fetched bundle can contain a read-operation-write chain in one bundle, which at least from an ISA level is not generally matched with standard cores outside of specific RMW instruction.

The skewing itself is not entirely without precedent, although I'm not aware if any ARM microarchitectures chose to skew sufficiently for both source memory reads and destination writes. More significant is how the architecture commits results to the architectural state and to memory in a variable-length "transaction" at the granularity of a whole optimized sub-routine, which neither ARM's ISA or uops permit.

That transactional nature is something that I'm curious about in light of the recent Meltdown and Spectre disclosures, since Denver's architecture has an aggressive "runahead" mode for speculating data cache loads and an undisclosed method for tracking and unwinding speculative writes in the shadow of a cache miss. Per Nvidia circa 2014, the philosophy was to load whatever it could then invalidate any speculative operations and queued writes, thus specifically relying on cache side effects to carry over from a speculative path.
Also unclear is how Denver tracked its writes, since its transactional memory method might have meant an in-core write buffer, or potentially updates to the L1 cache that could be undone. The latter case might mean additional side effects.

The prefetch path alone seems like it could be susceptible to a Spectre variant, and even if the optimizer were changed to do more safeguards there's a lag before it is invoked for cold code.
Denver was also quoted as having the full set of typical direct, indirect, and call branch prediction hardware that could be leveraged for Spectre variant 2.
Meltdown might depend on what's effectively a judgement call for when permissions are checked and transactions are aborted, and whether the pipeline's speculative writes affect the L1 in a way other ARM cores wouldn't.

Unfortunately, I think the one Nexus device that used Denver aged out of security updates just shy of finding out what, if any mitigations might be needed.
According to the following, at least some of the above apply to Xavier.
https://www.morningstar.com/news/ma...ystem-is-also-affected-by-security-flaws.html
 
Xavier seems like an even greater departure from a gaming-oriented SoC than Parker.
The GPU should be less than 100mm^2, so those CPU cores must be THICC.
Did they take away the 2*FP16 capabilities of the previous iGPUs, since they now have the tensor units for deep learning?

I also wonder why the PX Pegasus needs two Xavier SoCs. One would think the souped up CPU and I/O from Xavier would enable it to drive two dedicated GPUs.


One thing we can count on is that Xavier is never going into a Switch 2, ever.
 
I thought it was with regards to how they split/balance the functionality and HW between both SoCs, which also includes the two additional GPUs for the full tier product.
 
Xavier seems like an even greater departure from a gaming-oriented SoC than Parker.
The GPU should be less than 100mm^2, so those CPU cores must be THICC.
Did they take away the 2*FP16 capabilities of the previous iGPUs, since they now have the tensor units for deep learning?

I also wonder why the PX Pegasus needs two Xavier SoCs. One would think the souped up CPU and I/O from Xavier would enable it to drive two dedicated GPUs.


One thing we can count on is that Xavier is never going into a Switch 2, ever.

At this point a Switch 2 will for sure include a real custom SoC. I bet nvidia is working on it as we speak, even if at only stages of defining requirements. The Switch sucess in sales and finally in third parties getting involved, all but assures a second iteration.

Unless, of course, nvidia deems such an endeavour not worthy and / or Nintendo unwillingness to pay as much as nvidia wants for it. Tegra X1 was an existing part after all, so R&D had been done and they needed to cover the cost. Will Nintendo pay nvidia for a custom SoC? Remains to be seen.
 
Probably one of the coolest application of these SOCs. It uses a Jetson with 4 GB RAM:

https://www.skydio.com/technology/

Overpriced for what they're asking for it. They use a tiny sensor so while you can probably recreate that Star Wars race in the forest sequence with this thing, the video won't be cinematic quality.

But they use all the AI buzzwords so some people might bite.
 
Probably one of the coolest application of these SOCs. It uses a Jetson with 4 GB RAM:

https://www.skydio.com/technology/

Overpriced for what they're asking for it. They use a tiny sensor so while you can probably recreate that Star Wars race in the forest sequence with this thing, the video won't be cinematic quality.

But they use all the AI buzzwords so some people might bite.

Shame it doesn't feature a 3-axis gimbal, that would really give it an edge. Although the fact that it can follow someone without something strapped to them is really nice. Featurewise it looks great.
 
Last edited:
Back
Top