NVIDIA Tegra Architecture

Discussion in 'Mobile Graphics Architectures and IP' started by french toast, Jan 17, 2012.

Tags:
  1. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    364
    Likes Received:
    290
    Good article at eetimes:
    http://www.eetimes.com/document.asp?doc_id=1331727
    this chart particularly is very impressive. Well done Nvidia !
    [​IMG]
     
    Florin and pharma like this.
  2. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,123
    Likes Received:
    907
    NVIDIA Gives Xavier Status Update ....
    https://www.anandtech.com/show/1187...orrt-3-announcement-at-gtc-china-2017-keynote
     
    #3922 pharma, Sep 27, 2017
    Last edited: Sep 27, 2017
  3. wco81

    Legend

    Joined:
    Mar 20, 2004
    Messages:
    5,979
    Likes Received:
    204
    Location:
    West Coast
    pharma likes this.
  4. Picao84

    Regular

    Joined:
    Feb 15, 2010
    Messages:
    975
    Likes Received:
    283
  5. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    308
    Likes Received:
    120
    With 2700 SPECint2K score, Carmel is a 50% performance uplift from Denver in Nexus9 (based on these results).
    Carmel is also up to 43% wider with 10-wide architecture, in comparison, Denver is capable of 7+ instructions per cycle.
    30 claimed DL TOPS are 20 GPU's TOPS combined with 10 DLA's int8 TOPS.
    GPU frequency is 1300 MHz (based on FP32 CUDA flops and 512 CC).
    With 1.3 GHz frequency and GV100's tensor cores configuration (8 TCs per SM), it would result into 10.6 teraflops.
    For int8 TOPS, TCs throughput might be doubled to ~21.2 TOPS with GV100's register file bandwidth.
     
    Laurent06 and pharma like this.
  6. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    308
    Likes Received:
    120
    It emulates ARM instructions in the same way as every other ARM processor does - by decoding them into internal µ-ops format. There are 2 hardware decoders in Denver to convert ARM instructions into µops.
     
  7. Picao84

    Regular

    Joined:
    Feb 15, 2010
    Messages:
    975
    Likes Received:
    283
    But wasn't there something widely different in Denver compared to other ARM CPUs?
     
  8. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    308
    Likes Received:
    120
    Instead of using OoO µops scheduling, Denver relies upon software optimization layer (Dynamic Code Optimization), which monitors perf counters and performs instructions reordering, loops unrolling, registers renaming and so on for frequently used "hot" parts of code, then it saves optimized µops code into RAM for reuse. More info on DCO can be found here.
     
    Grall and Picao84 like this.
  9. wco81

    Legend

    Joined:
    Mar 20, 2004
    Messages:
    5,979
    Likes Received:
    204
    Location:
    West Coast
    So what would this be for? They're pretty much out of the mobile devices market aren't they?

    Self-driving cars or auto manufacturers trying to develop SDCs?

    Or maybe Nintendo Switch 2?
     
  10. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,708
    Likes Received:
    528
    At 350 mm2 and 30W on what tsmc calls 12nm process, hardly.
    At 7nm, well, it might not be impossible. At 7nm with EUV or 5nm, it could well be possible. However, with the sales volumes of the Switch, a custom SoC may be the better and no longer as risky option.
     
  11. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    678
    Likes Received:
    12
    It's a 60% uplift. Quite a nice speedup.
     
  12. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    308
    Likes Received:
    120
    Yep, CES2018 presentation deck and other materials explain this pretty clearly
    This CES2018 presentation deck also contains some pretty cool Xavier die shot (thankfully, not artistic this time:-D)

    Yep, somehow miscalculated this :oops:
     
    Laurent06 likes this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,807
    Likes Received:
    2,072
    Location:
    Well within 3d
    Architecturally the processor's behavior is externally different with respect to the hardware and optimization software.
    For example, the uops written out in the optimized code are a VLIW format that is not 1:1 to the internal format. A simpler decoder than is present for the ARM instructions is needed to expand the in-memory format, which multiplexes fields and lacks signals for which unit will execute a uop for the purposes of code density.
    The pipeline is also skewed so that fetched bundle can contain a read-operation-write chain in one bundle, which at least from an ISA level is not generally matched with standard cores outside of specific RMW instruction.

    The skewing itself is not entirely without precedent, although I'm not aware if any ARM microarchitectures chose to skew sufficiently for both source memory reads and destination writes. More significant is how the architecture commits results to the architectural state and to memory in a variable-length "transaction" at the granularity of a whole optimized sub-routine, which neither ARM's ISA or uops permit.

    That transactional nature is something that I'm curious about in light of the recent Meltdown and Spectre disclosures, since Denver's architecture has an aggressive "runahead" mode for speculating data cache loads and an undisclosed method for tracking and unwinding speculative writes in the shadow of a cache miss. Per Nvidia circa 2014, the philosophy was to load whatever it could then invalidate any speculative operations and queued writes, thus specifically relying on cache side effects to carry over from a speculative path.
    Also unclear is how Denver tracked its writes, since its transactional memory method might have meant an in-core write buffer, or potentially updates to the L1 cache that could be undone. The latter case might mean additional side effects.

    The prefetch path alone seems like it could be susceptible to a Spectre variant, and even if the optimizer were changed to do more safeguards there's a lag before it is invoked for cold code.
    Denver was also quoted as having the full set of typical direct, indirect, and call branch prediction hardware that could be leveraged for Spectre variant 2.
    Meltdown might depend on what's effectively a judgement call for when permissions are checked and transactions are aborted, and whether the pipeline's speculative writes affect the L1 in a way other ARM cores wouldn't.

    Unfortunately, I think the one Nexus device that used Denver aged out of security updates just shy of finding out what, if any mitigations might be needed.
    According to the following, at least some of the above apply to Xavier.
    https://www.morningstar.com/news/ma...ystem-is-also-affected-by-security-flaws.html
     
  14. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    8,414
    Likes Received:
    3,059
    Xavier seems like an even greater departure from a gaming-oriented SoC than Parker.
    The GPU should be less than 100mm^2, so those CPU cores must be THICC.
    Did they take away the 2*FP16 capabilities of the previous iGPUs, since they now have the tensor units for deep learning?

    I also wonder why the PX Pegasus needs two Xavier SoCs. One would think the souped up CPU and I/O from Xavier would enable it to drive two dedicated GPUs.


    One thing we can count on is that Xavier is never going into a Switch 2, ever.
     
    Picao84 likes this.
  15. Laurent06

    Regular

    Joined:
    Dec 14, 2007
    Messages:
    678
    Likes Received:
    12
    Perhaps to enable some lockstep mode or some variant of it?
     
  16. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    1,626
    Likes Received:
    662
    I thought it was with regards to how they split/balance the functionality and HW between both SoCs, which also includes the two additional GPUs for the full tier product.
     
  17. Picao84

    Regular

    Joined:
    Feb 15, 2010
    Messages:
    975
    Likes Received:
    283
    At this point a Switch 2 will for sure include a real custom SoC. I bet nvidia is working on it as we speak, even if at only stages of defining requirements. The Switch sucess in sales and finally in third parties getting involved, all but assures a second iteration.

    Unless, of course, nvidia deems such an endeavour not worthy and / or Nintendo unwillingness to pay as much as nvidia wants for it. Tegra X1 was an existing part after all, so R&D had been done and they needed to cover the cost. Will Nintendo pay nvidia for a custom SoC? Remains to be seen.
     
  18. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    878
    Likes Received:
    146
    Absolutely. Would you really want to be in an autonomous vehicle that could have one part of the hardware fail and end up dead.
     
    Grall likes this.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...