NVIDIA Tegra Architecture

OlegSH · Aug 11, 2014

http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/
http://www.tiriasresearch.com/downloads/nvidia-charts-its-own-path-to-armv8/ - register to get whitepaper, it's free
So it's code morphing CPU with hardware ARM decoders, it searches for hot portions of code(heavy loops, etc) and optimises them, then it saves optimized translation to the memory to perform up to 7 native instructions per cycle for next iterations, the end result is quite amazing considering it's an in-order CPU
Some results:

A7 results are amazing too

Pressure · Aug 11, 2014

128MB cache?

xpea · Aug 11, 2014

Pressure said:
128MB cache?

it's based on main RAM, not inside the SoC

After reading the whitepaper, glad to see some innovation on the ARM side. Now it's obvious why Denver had a very long dev time. It looks like it was first aimed at x86 market with opcode translation to avoid intel issues/licence/patents. But with ARM eco system taking off, Nvidia changed his mind and went for ARM.

Curious to see what kind of performance improvement over number of bench runs are done by the Dynamic Code Optimizer

silent_guy · Aug 12, 2014

Pressure said:
128MB cache?

According to the TechReport article (can't read white paper on my phone), it's 128KB. Not MB.

ams · Aug 12, 2014

OlegSH said:
http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/
http://www.tiriasresearch.com/downloads/nvidia-charts-its-own-path-to-armv8/ - register to get whitepaper, it's free
So it's code morphing CPU with hardware ARM decoders, it searches for hot portions of code(heavy loops, etc) and optimises them, then it saves optimized translation to the memory to perform up to 7 native instructions per cycle for next iterations, the end result is quite amazing considering it's an in-order CPU
Some results:

A7 results are amazing too

Wow, Denver is nearly 2x faster than A7-Cyclone in Google Octane 2.0, and more than 2x faster than S800-Krait 400 in Geekbench 3 Single-Core score. Pretty impressive!

iMacmatician · Aug 12, 2014

It looks like Denver's only about a third faster in Geekbench 3 to me (still impressive).

ltcommander.data · Aug 12, 2014

I'd be interested in seeing the Geekbench 3 Multi-Core result between Tegra K1 32 and K1 64 which nVidia conveniently left out. When they were the first to go quad core with Tegra 3 nVidia couldn't stop promoting multi-core and multi-threading as being all the rage and now they rather not talk about it.

I wonder if ARM considers it a good thing or a bad thing that it looks like the first two shipping ARMv8 implementations are third-party custom designs rather than their own?

ams · Aug 12, 2014

Ok, here is roughly what the graph shows regarding performance relative to R3 Cortex A15 in Tegra K1:

DMIPS
Baytrail (Celeron N2910): 0.45x
S800 (Krait 400 8974AA): 0.95x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 1.30x
Haswell (Celeron 2955U): 1.00x
Tegra K1 (Denver): 1.80x

SPECInt 2K
Baytrail (Celeron N2910): 0.70x
S800 (Krait 400 8974AA): 0.60x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 0.90x
Haswell (Celeron 2955U): 1.30x
Tegra K1 (Denver): 1.45x

SPECFP 2K
Baytrail (Celeron N2910): 0.85x
S800 (Krait 400 8974AA): 0.80x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): N/A
Haswell (Celeron 2955U): 1.95x
Tegra K1 (Denver): 1.75x

AnTuTu 4
Baytrail (Celeron N2910): N/A
S800 (Krait 400 8974AA): 0.80x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 0.70x
Haswell (Celeron 2955U): N/A
Tegra K1 (Denver): 1.00x

Geekbench 3 Single-Core
Baytrail (Celeron N2910): 0.65x
S800 (Krait 400 8974AA): 0.80x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 1.20x
Haswell (Celeron 2955U): 1.20x
Tegra K1 (Denver): 1.65x

Google Octane v2.0
Baytrail (Celeron N2910): 0.70x
S800 (Krait 400 8974AA): 0.65x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 0.70x
Haswell (Celeron 2955U): 1.45x
Tegra K1 (Denver): 1.30x

16MB Memcpy (GB/s)
Baytrail (Celeron N2910): 0.85x
S800 (Krait 400 8974AA): 0.80x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 1.15x
Haswell (Celeron 2955U): 1.55x
Tegra K1 (Denver): 1.40x

16MB Memset (GB/s)
Baytrail (Celeron N2910): 0.40x
S800 (Krait 400 8974AA): 0.75x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 0.80x
Haswell (Celeron 2955U): 0.65x
Tegra K1 (Denver): 1.05x

16MB Memread (GB/s)
Baytrail (Celeron N2910): 1.25x
S800 (Krait 400 8974AA): 1.55x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 1.85x
Haswell (Celeron 2955U): 2.55x
Tegra K1 (Denver): 2.60x

So on average, TK1-Denver is ~ 1.45x faster than A7-Cyclone.

The A8 CPU is expected to be clocked ~ 40% higher than the A7-Cyclone CPU, so the overall performance should be similar to the TK1-Denver CPU.

A1xLLcqAgt0qc2RyMz0y · Aug 12, 2014

silent_guy said:
According to the TechReport article (can't read white paper on my phone), it's 128KB. Not MB.

From the White Paper:

The optimization cache is stored in a private 128MB main memory buffer, a minor set aside in systems with 4GB (or more) of main memory.

...

The optimized code does incur some code expansion, which is why Denver has a larger 128KB L1 instruction cache.

ltcommander.data · Aug 12, 2014

ams said:
So on average, TK1-Denver is ~ 1.45x faster than A7-Cyclone.

The A8 CPU is expected to be clocked ~ 40% higher than the A7-Cyclone CPU, so the overall performance should be similar to the TK1-Denver CPU.

Assuming the A8 is a straight shrink + clock speed bump with no other architectural changes.

Has there been a more refined estimate for Denver availability other than "later this year"? Given previous cadences and the number of leaks, the iPhone 6/A8 is almost certainly announcing and shipping in volume in September. Unless nVidia can get Denver into customer's hand in the next month, the A8 will be Denver's most direct comparison target.

Ailuros · Aug 12, 2014

http://techreport.com/news/26906/nvidia-claims-haswell-class-performance-for-denver-cpu-core

I'm as cautious reaching any preliminary conclusions yet before I can see realtime results under specific thermal conditions as techreport. So far the first impression is outstanding.

sebbbi · Aug 12, 2014

Ailuros said:
http://techreport.com/news/26906/nvidia-claims-haswell-class-performance-for-denver-cpu-core

I'm as cautious reaching any preliminary conclusions yet before I can see realtime results under specific thermal conditions as techreport. So far the first impression is outstanding.

The commenters in the tech report article seem to be really pessimistic about the code translation. However most of them doesn't seem to understand that all modern CPUs perform some kind code translation (especially the x86 CPUs, since x86 instructions are variable length and thus inefficient to directly execute). Denver translates the code once. Traditional CPUs translate the same hot code sequences tens of thousands of times every frame. And the same is true for OoO machinery. Denver software does the register renaming and reordering once (likely adjusting the results based on CPU feedback slightly all the time), while a traditional CPU does it also again and again for the same code. The 80/20 rule seems to apply pretty well to code execution as well. 80% of the total CPU time is spend in 20% of the code (actually in games it's closer to 90/10). And this 20% of code is inside a loop (usually multiple layers of loops).

Code that runs only once is not taking noticeable time to execute (even if the executable would be 100 megabytes in size), and thus it doesn't even need optimization. Denver tackles the right problem, code that is running frequently again and again. The big question is, how much feedback the code optimization process gets from the execution pipelines. If the feedback is based on real execution results, the software based OoO should match hardware OoO quite closely in most cases. Obviously there can be algorithms that behave completely differently when the data set changes (and the data set changes can be frequent). For example you need hardware OoO to hide cache misses of most pointer based data structures. Obviously the preprocessing step can encode cache prefetching hints to the code flow, but this is not possible for dependency chains (ptr->ptr->ptr).

loekf · Aug 12, 2014

ltcommander.data said:
Assuming the A8 is a straight shrink + clock speed bump with no other architectural changes.

Has there been a more refined estimate for Denver availability other than "later this year"? Given previous cadences and the number of leaks, the iPhone 6/A8 is almost certainly announcing and shipping in volume in September. Unless nVidia can get Denver into customer's hand in the next month, the A8 will be Denver's most direct comparison target.

Just wondering, how are these benchmarks computed ? In general, any mobile CPU will perform some kind of (periodic) throttling to prevent that running at peak performance (clock speed) too long will cause overheating and ensure the CPU (and system) will stay within the thermal envelope (and not drain the battery instantaneously).

I never trust these slides, in general companies can fudge the numbers to their liking...

ams · Aug 12, 2014

ltcommander.data said:
Has there been a more refined estimate for Denver availability other than "later this year"? Given previous cadences and the number of leaks, the iPhone 6/A8 is almost certainly announcing and shipping in volume in September. Unless nVidia can get Denver into customer's hand in the next month, the A8 will be Denver's most direct comparison target.

TK1-Denver is rumored to be powering the very first 64-bit Android tablet this fall using the Android L OS in the form of an 8.9" Google Nexus tablet built by HTC. So October-November timeframe would be a good guess. A7/A8/etc. are really only indirect competitors to TK1 due to the completely different OS's targeted by these SoC's.

lanek · Aug 12, 2014

Ailuros said:
http://techreport.com/news/26906/nvidia-claims-haswell-class-performance-for-denver-cpu-core

I'm as cautious reaching any preliminary conclusions yet before I can see realtime results under specific thermal conditions as techreport. So far the first impression is outstanding.

I think too there's a bit of optimistic choice on their slides, and sadly i tend to be really cautionous with slides without much information...

Anyway, certainly more info soon.

Ailuros · Aug 12, 2014

lanek said:
I think too there's a bit of optimistic choice on their slides, and sadly i tend to be really cautionous with slides without much information...

Anyway, certainly more info soon.

Honest question: wasn't dual Denver supposed to break even in Antutu with the A15/K1 but with a frequency of 3.0GHz?

Nebuchadnezzar · Aug 12, 2014

Ailuros said:
Honest question: wasn't dual Denver supposed to break even in Antutu with the A15/K1 but with a frequency of 3.0GHz?

Antutu doesn't read the correct frequencies that the cores are running at. I doubt they were running at 3GHz given that the current bins are at 2.5GHz.

Krysto · Aug 12, 2014

Nebuchadnezzar said:
Antutu doesn't read the correct frequencies that the cores are running at. I doubt they were running at 3GHz given that the current bins are at 2.5GHz.

Even if it wasn't the final chip, the main reason that benchmark showed the way it did, is because both Antutu and Passmark score cores almost liniarly. So a dual-core chip A that has cores that are 2x as fast as a quad-core chip B, will show up as "equal score" with B, in Antutu and Passmark.

But in the real world, the dual-core with 2x the single-threaded performance, will feel much faster.

This same problem is seen in Passmark when comparing the dual-core Haswell 2955u Celeron with the quad-core (Atom Bay Trail) "Celeron". They show with almost equal scores, but the lower clocked Haswell core is 2x faster than the higher clocked Atom core. So always look for single threaded performance in Antutu and Passmark, because everything else will be misleading when comparing them.

Ailuros · Aug 12, 2014

Nebuchadnezzar said:
Antutu doesn't read the correct frequencies that the cores are running at. I doubt they were running at 3GHz given that the current bins are at 2.5GHz.

Point accepted

trinibwoy · Aug 12, 2014

Well you have to give nVidia points for trying. Hope it works out for them. We can do with a shake up in CPU land.

Not quite sure how Denver solves the problem of unpredictable cache misses though. The prefetcher can only do so much. The whole thing will still stall on a cache miss right or am I missing something?

NVIDIA Tegra Architecture

OlegSH

Pressure

xpea

silent_guy

ams

iMacmatician

ltcommander.data

ams

A1xLLcqAgt0qc2RyMz0y

ltcommander.data

Ailuros

Epsilon plus three

sebbbi

loekf

ams

lanek

Ailuros

Epsilon plus three

Nebuchadnezzar

Krysto

Ailuros

Epsilon plus three

trinibwoy

Meh

Similar threads