NVIDIA Tegra Architecture

Discussion in 'Mobile Graphics Architectures and IP' started by french toast, Jan 17, 2012.

Tags:
  1. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    320
    Likes Received:
    144
    http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/
    http://www.tiriasresearch.com/downloads/nvidia-charts-its-own-path-to-armv8/ - register to get whitepaper, it's free
    So it's code morphing CPU with hardware ARM decoders, it searches for hot portions of code(heavy loops, etc) and optimises them, then it saves optimized translation to the memory to perform up to 7 native instructions per cycle for next iterations, the end result is quite amazing considering it's an in-order CPU
    Some results:
    [​IMG]
    A7 results are amazing too
     
    #2801 OlegSH, Aug 11, 2014
    Last edited by a moderator: Aug 11, 2014
  2. Pressure

    Veteran Regular

    Joined:
    Mar 30, 2004
    Messages:
    1,259
    Likes Received:
    215
    128MB cache? o_O
     
  3. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    366
    Likes Received:
    292
    it's based on main RAM, not inside the SoC

    After reading the whitepaper, glad to see some innovation on the ARM side. Now it's obvious why Denver had a very long dev time. It looks like it was first aimed at x86 market with opcode translation to avoid intel issues/licence/patents. But with ARM eco system taking off, Nvidia changed his mind and went for ARM.

    Curious to see what kind of performance improvement over number of bench runs are done by the Dynamic Code Optimizer
     
  4. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,742
    Likes Received:
    1,355
    According to the TechReport article (can't read white paper on my phone), it's 128KB. Not MB.
     
  5. ams

    ams
    Regular

    Joined:
    Jul 14, 2012
    Messages:
    914
    Likes Received:
    0
    Wow, Denver is nearly 2x faster than A7-Cyclone in Google Octane 2.0, and more than 2x faster than S800-Krait 400 in Geekbench 3 Single-Core score. Pretty impressive!
     
    #2805 ams, Aug 12, 2014
    Last edited by a moderator: Aug 12, 2014
  6. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    752
    Likes Received:
    192
    It looks like Denver's only about a third faster in Geekbench 3 to me (still impressive).
     
  7. ltcommander.data

    Regular

    Joined:
    Apr 4, 2010
    Messages:
    614
    Likes Received:
    11
    I'd be interested in seeing the Geekbench 3 Multi-Core result between Tegra K1 32 and K1 64 which nVidia conveniently left out. When they were the first to go quad core with Tegra 3 nVidia couldn't stop promoting multi-core and multi-threading as being all the rage and now they rather not talk about it.

    I wonder if ARM considers it a good thing or a bad thing that it looks like the first two shipping ARMv8 implementations are third-party custom designs rather than their own?
     
    #2807 ltcommander.data, Aug 12, 2014
    Last edited by a moderator: Aug 12, 2014
  8. ams

    ams
    Regular

    Joined:
    Jul 14, 2012
    Messages:
    914
    Likes Received:
    0
    Ok, here is roughly what the graph shows regarding performance relative to R3 Cortex A15 in Tegra K1:

    DMIPS
    Baytrail (Celeron N2910): 0.45x
    S800 (Krait 400 8974AA): 0.95x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): 1.30x
    Haswell (Celeron 2955U): 1.00x
    Tegra K1 (Denver): 1.80x

    SPECInt 2K
    Baytrail (Celeron N2910): 0.70x
    S800 (Krait 400 8974AA): 0.60x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): 0.90x
    Haswell (Celeron 2955U): 1.30x
    Tegra K1 (Denver): 1.45x

    SPECFP 2K
    Baytrail (Celeron N2910): 0.85x
    S800 (Krait 400 8974AA): 0.80x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): N/A
    Haswell (Celeron 2955U): 1.95x
    Tegra K1 (Denver): 1.75x

    AnTuTu 4
    Baytrail (Celeron N2910): N/A
    S800 (Krait 400 8974AA): 0.80x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): 0.70x
    Haswell (Celeron 2955U): N/A
    Tegra K1 (Denver): 1.00x

    Geekbench 3 Single-Core
    Baytrail (Celeron N2910): 0.65x
    S800 (Krait 400 8974AA): 0.80x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): 1.20x
    Haswell (Celeron 2955U): 1.20x
    Tegra K1 (Denver): 1.65x

    Google Octane v2.0
    Baytrail (Celeron N2910): 0.70x
    S800 (Krait 400 8974AA): 0.65x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): 0.70x
    Haswell (Celeron 2955U): 1.45x
    Tegra K1 (Denver): 1.30x

    16MB Memcpy (GB/s)
    Baytrail (Celeron N2910): 0.85x
    S800 (Krait 400 8974AA): 0.80x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): 1.15x
    Haswell (Celeron 2955U): 1.55x
    Tegra K1 (Denver): 1.40x

    16MB Memset (GB/s)
    Baytrail (Celeron N2910): 0.40x
    S800 (Krait 400 8974AA): 0.75x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): 0.80x
    Haswell (Celeron 2955U): 0.65x
    Tegra K1 (Denver): 1.05x

    16MB Memread (GB/s)
    Baytrail (Celeron N2910): 1.25x
    S800 (Krait 400 8974AA): 1.55x
    Tegra K1 (R3 Cortex A15): 1.00x
    A7 (Cyclone): 1.85x
    Haswell (Celeron 2955U): 2.55x
    Tegra K1 (Denver): 2.60x


    So on average, TK1-Denver is ~ 1.45x faster than A7-Cyclone.

    The A8 CPU is expected to be clocked ~ 40% higher than the A7-Cyclone CPU, so the overall performance should be similar to the TK1-Denver CPU.
     
  9. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    942
    Likes Received:
    229
    From the White Paper:

     
  10. ltcommander.data

    Regular

    Joined:
    Apr 4, 2010
    Messages:
    614
    Likes Received:
    11
    Assuming the A8 is a straight shrink + clock speed bump with no other architectural changes.

    Has there been a more refined estimate for Denver availability other than "later this year"? Given previous cadences and the number of leaks, the iPhone 6/A8 is almost certainly announcing and shipping in volume in September. Unless nVidia can get Denver into customer's hand in the next month, the A8 will be Denver's most direct comparison target.
     
  11. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,404
    Likes Received:
    168
    Location:
    Chania
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,280
    Location:
    Helsinki, Finland
    The commenters in the tech report article seem to be really pessimistic about the code translation. However most of them doesn't seem to understand that all modern CPUs perform some kind code translation (especially the x86 CPUs, since x86 instructions are variable length and thus inefficient to directly execute). Denver translates the code once. Traditional CPUs translate the same hot code sequences tens of thousands of times every frame. And the same is true for OoO machinery. Denver software does the register renaming and reordering once (likely adjusting the results based on CPU feedback slightly all the time), while a traditional CPU does it also again and again for the same code. The 80/20 rule seems to apply pretty well to code execution as well. 80% of the total CPU time is spend in 20% of the code (actually in games it's closer to 90/10). And this 20% of code is inside a loop (usually multiple layers of loops).

    Code that runs only once is not taking noticeable time to execute (even if the executable would be 100 megabytes in size), and thus it doesn't even need optimization. Denver tackles the right problem, code that is running frequently again and again. The big question is, how much feedback the code optimization process gets from the execution pipelines. If the feedback is based on real execution results, the software based OoO should match hardware OoO quite closely in most cases. Obviously there can be algorithms that behave completely differently when the data set changes (and the data set changes can be frequent). For example you need hardware OoO to hide cache misses of most pointer based data structures. Obviously the preprocessing step can encode cache prefetching hints to the code flow, but this is not possible for dependency chains (ptr->ptr->ptr).
     
    #2812 sebbbi, Aug 12, 2014
    Last edited by a moderator: Aug 12, 2014
  13. loekf

    Regular

    Joined:
    Jun 29, 2003
    Messages:
    613
    Likes Received:
    61
    Location:
    Nijmegen, The Netherlands
    Just wondering, how are these benchmarks computed ? In general, any mobile CPU will perform some kind of (periodic) throttling to prevent that running at peak performance (clock speed) too long will cause overheating and ensure the CPU (and system) will stay within the thermal envelope (and not drain the battery instantaneously).

    I never trust these slides, in general companies can fudge the numbers to their liking...
     
  14. ams

    ams
    Regular

    Joined:
    Jul 14, 2012
    Messages:
    914
    Likes Received:
    0
    TK1-Denver is rumored to be powering the very first 64-bit Android tablet this fall using the Android L OS in the form of an 8.9" Google Nexus tablet built by HTC. So October-November timeframe would be a good guess. A7/A8/etc. are really only indirect competitors to TK1 due to the completely different OS's targeted by these SoC's.
     
  15. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    I think too there's a bit of optimistic choice on their slides, and sadly i tend to be really cautionous with slides without much information...

    Anyway, certainly more info soon.
     
  16. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,404
    Likes Received:
    168
    Location:
    Chania
    Honest question: wasn't dual Denver supposed to break even in Antutu with the A15/K1 but with a frequency of 3.0GHz?
     
  17. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    947
    Likes Received:
    91
    Location:
    Luxembourg
    Antutu doesn't read the correct frequencies that the cores are running at. I doubt they were running at 3GHz given that the current bins are at 2.5GHz.
     
  18. Krysto

    Newcomer

    Joined:
    Aug 12, 2014
    Messages:
    1
    Likes Received:
    0
    Even if it wasn't the final chip, the main reason that benchmark showed the way it did, is because both Antutu and Passmark score cores almost liniarly. So a dual-core chip A that has cores that are 2x as fast as a quad-core chip B, will show up as "equal score" with B, in Antutu and Passmark.

    But in the real world, the dual-core with 2x the single-threaded performance, will feel much faster.

    This same problem is seen in Passmark when comparing the dual-core Haswell 2955u Celeron with the quad-core (Atom Bay Trail) "Celeron". They show with almost equal scores, but the lower clocked Haswell core is 2x faster than the higher clocked Atom core. So always look for single threaded performance in Antutu and Passmark, because everything else will be misleading when comparing them.
     
  19. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,404
    Likes Received:
    168
    Location:
    Chania
    Point accepted :)
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,405
    Likes Received:
    401
    Location:
    New York
    Well you have to give nVidia points for trying. Hope it works out for them. We can do with a shake up in CPU land.

    Not quite sure how Denver solves the problem of unpredictable cache misses though. The prefetcher can only do so much. The whole thing will still stall on a cache miss right or am I missing something?
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...