NVIDIA Tegra Architecture

Geekbench is less than useless (ie. downright misleading) comparing 32bit ARM to 64bit ARM.

Cheers

Of course, but even if one strips out the benefit in Geekbench of moving from 32-bit to 64-bit, Denver would still significantly outperform the latest and greatest R3 Cortex A15 in single core performance. And the single-threaded perf. per watt should be superior for Denver vs. R3 Cortex A15.

Don't get me wrong, clearly the A15 core has very good performance especially considering it's die size (and I can totally see why the A57 core would be very good at more advanced fab. process nodes), but there are also some limitations to single core perf. and single core perf. per watt due to the limitations in die size.
 
Last edited by a moderator:
The only benchmarks in that graph you can use are Spec[int|fp], Antutu and Octane. The latter two depends a lot on the entire software stack.

Highly disagree on AnTuTu. The CPU benchmark part (a port of bytemark) is completely pointless. nVidia probably just put it there to look balanced in showing a benchmark that didn't improve vs K1-32, while laughing to themselves knowing no one takes AnTuTu seriously.

There are so many unknowns in this slide and nVidia has played so many games with benchmarks before that I'm really taking it all with a ton of salt. Even the comparison between nVidia's own K1-32 and K1-64. There are some red flags already, like using a low end 1.6GHz Bay-Trail D and a Snapdragon 800 that, while not as low end, is still a fair bit below top clocks (2.26GHz). And while it's easy to assume the K1-32 is 2.3GHz we don't really know that for a fact. Another thing that stands out is the Haswell part getting about the same memory performance despite having a much wider memory bus (and a uarch that's good at extracting bandwidth), which makes me wonder if they normalized the memory speeds somehow. Other big unknowns are OSes used, compiler versions (for SpecInt), browser (for Octane), and if the K1-64 was running AArch32 or AArch64 code. I'm also wary of aggregate scores for Spec and Geekbench, I'd much rather see individual scores so I can see if they're not actually doing really, really well in some sub-tests and not so well in others.

AFAIK K1-64 is pin-compatible with K1-32 which should mean the same memory interface, although it can still be using higher clocked RAM. Still, I think they put three highly synthetic memory sub-tests there because they're particularly proud of their memory streaming performance, which could support aggressive run-ahead execution and/or very good prefetchers, or maybe some particular idiom matching in the recompiler.
 
It should be smaller as it only has two Denver cores vs five Cortex A15 cores in the 32bit K1.

I wouldn't count that as certain. Each Denver core has 3x more L1 cache than each A15 core, is a lot wider, clocks higher, carries an updated instruction set and uses wider buses.

The ninja core in TK1 32bit should be area-optimized because it clocks much lower than the other four, so it probably doesn't take as much space.

My guess is nVidia will try to make the Denver TK1 pin-compatible with the current version, so that OEMs can refresh their products without a motherboard redesign, but it's possible that the Denver version takes more space.
 
Highly disagree on AnTuTu. The CPU benchmark part (a port of bytemark) is completely pointless. nVidia probably just put it there to look balanced in showing a benchmark that didn't improve vs K1-32, while laughing to themselves knowing no one takes AnTuTu seriously.

Chinese OEM's seem to love AnTuTu because the overall score scales well with more and more CPU cores. :D
 
If you believe a one line quote with no data so be it but I require some actual data to back it up.

I will take Ashraf Eassa opinion over Rys at this point as at least Ashraf put in some effort to try to come up with the die size and published his reasons for it.

http://www.fool.com/investing/general/2014/06/04/just-how-big-is-nvidia-corporations-tegra-k1.aspx

Rys actually measured the SoC die size to be ~ 121 mm^2, so I think it is safe to say that Ashraf overestimated the SoC die size. Note that this is still ~ 50% larger than Tegra 4!

According to NVIDIA, the size of the GPU in Tegra 4 was only ~ 10.5 mm^2. So if Tegra K1 is ~ 40 mm^2 larger in overall SoC die size than Tegra 4, with CPU die size virtually the same between these two SoC's (since both have five A15 cores), then how in the world could mobile Kepler in Tegra K1 use 60-65 mm^2 area as estimated by Ashraf? It seems to me that the die size of mobile Kepler in Tegra K1 is actually less than 50 mm^2.
 
Last edited by a moderator:
It's okay if you don't trust another person's (or site's) word on the internet. But if we must debate someone's credibility, please create a thread in the proper sub-forum.
 
Highly disagree on AnTuTu. The CPU benchmark part (a port of bytemark) is completely pointless. nVidia probably just put it there to look balanced in showing a benchmark that didn't improve vs K1-32, while laughing to themselves knowing no one takes AnTuTu seriously.
Mea culpa. Don't know Antutu, I just discounted the synthetics and Geekbench which I know to be flawed (beyond redemption).
I'm also wary of aggregate scores for Spec and Geekbench, I'd much rather see individual scores so I can see if they're not actually doing really, really well in some sub-tests and not so well in others.

I would like to know absolute SpecInt scores instead of these relative comparisons. If Nvidia didn't play any games and used published scores from the competitors, then their score is quite good, considering they only have two cores and a couple of sub benchmarks benefit from autopar.

Even if you discount the crypto hash subtests from Geekbench, it is little more than a microbenchmark and doesn't correlate well with real world performance.

Still, I think they put three highly synthetic memory sub-tests there because they're particularly proud of their memory streaming performance, which could support aggressive run-ahead execution and/or very good prefetchers, or maybe some particular idiom matching in the recompiler.

My bet is their dynamic translation software unrolled the memory load/write/copy loop umpteen (or hundreds) times allowing for very early software prefetch. As you say, - a special case.
 
I would like to know absolute SpecInt scores instead of these relative comparisons. If Nvidia didn't play any games and used published scores from the competitors, then their score is quite good, considering they only have two cores and a couple of sub benchmarks benefit from autopar.

Does SpecInt2k even have autopar?

That's kind of another thing, Spec2k is crazy old.

These slides really remind me a lot of Intel's Silvermont slides, which made it look ridiculously fast. I haven't gone back to try and correlate it but I don't think my general impression from reviews has been nearly as exaggerated.

My bet is their dynamic translation software unrolled the memory load/write/copy loop umpteen (or hundreds) times allowing for very early software prefetch. As you say, - a special case.

I'd really love to see some snippets from their code translation. I hope they do a real paper on this.
 
I'd really love to see some snippets from their code translation. I hope they do a real paper on this.

Take a look at slides 15 through 23 of this Hot Chips 2014 Nvidia Denver Slide Deck:

https://drive.google.com/file/d/0B8mBa_eA8Zf2anBYZHBzX3FueGc/edit?usp=sharing

I sure would love to hear the actual talk that went with these slides but the slides show that early in the SpecInt:Crafty Execution about 55% was Hardware Decoder Execution, 15% was Dynamic Code Optimizer and 30% was Optimized ucode Execution. This ramps pretty quickly to about 95% Optimized ucode Execution. Then another blip happens about 25% into the program when another cycle of Hardware Decoder Execution and Dynamic Code Optimizer takes place then the Optimized ucode Execution runs at near 100% until the very end.
 
Then another blip happens about 25% into the program when another cycle of Hardware Decoder Execution and Dynamic Code Optimizer takes place then the Optimized ucode Execution runs at near 100% until the very end.

They double IPC from .6 to 1.2 instructions per cycle with their dynamic translation. Decoding ARM V8 instructions and packing them into 7-wide VLIW words and executing in-order is probably pretty far from optimal.

Cheers
 
If you believe a one line quote with no data so be it but I require some actual data to back it up.

I will take Ashraf Eassa opinion over Rys at this point as at least Ashraf put in some effort to try to come up with the die size and published his reasons for it.

http://www.fool.com/investing/general/2014/06/04/just-how-big-is-nvidia-corporations-tegra-k1.aspx
As pointed out, I actually measured it.

I've also pondered talking about that 'analysis' from Eassa a bunch of times, but it never seemed worth it because it was so bad. You don't have to look too hard at his reasoning to figure that out for yourself. The GPU area estimate is particularly egregious because it's the part he could most base on fact. He estimates 60-65mm2 for the K1 GPU, which is effectly a 1 SMX, 1 GPC Kepler, which he bases on the following:

"In fact, the 192-core Kepler configuration is more or less identical to the GeForce GT 630, which was a 192-core Kepler discrete GPU. The die size for this chip was 79 square millimeters on TSMC's 28-nanometer HKMG process."

:rolleyes: GT630 (v2) is a 384 CUDA core part and it's not 79mm2.
 
As pointed out, I actually measured it.

Thanks for clearing that up.

I've also pondered talking about that 'analysis' from Eassa a bunch of times, but it never seemed worth it because it was so bad.
The problem is that doing a Google search "tegra k1 die size" only has his analysis. So as bad as it is it is taken as gospel by many because it is the only one seen. And thanks for pointing out some of his errors.

EDIT: I have posted to his article some corrections.
 
Last edited by a moderator:
Take a look at slides 15 through 23 of this Hot Chips 2014 Nvidia Denver Slide Deck:

https://drive.google.com/file/d/0B8mBa_eA8Zf2anBYZHBzX3FueGc/edit?usp=sharing
https://drive.google.com/file/d/0B8mBa_eA8Zf2anBYZHBzX3FueGc/edit?usp=sharing

While those figures are mildly interesting that's nothing close to what I meant by code snippets. I mean examples of instruction sequences generated by the compiler.

I sure would love to hear the actual talk that went with these slides but the slides show that early in the SpecInt:Crafty Execution about 55% was Hardware Decoder Execution, 15% was Dynamic Code Optimizer and 30% was Optimized ucode Execution. This ramps pretty quickly to about 95% Optimized ucode Execution. Then another blip happens about 25% into the program when another cycle of Hardware Decoder Execution and Dynamic Code Optimizer takes place then the Optimized ucode Execution runs at near 100% until the very end.

What it shows is nice behavior with a benchmark that operates over no more than a few thousand instructions, which is a small program. This is less of a general characterization of their compiler at runtime than it is a characterization of 186.crafty. I'm sure there's a reason why they picked it to demonstrate and not 176.gcc.

Speaking of 186.crafty (http://www.spec.org/cpu2000/CINT2000/186.crafty/docs/186.crafty.html) it's one of those unusual tests that gets a big benefit from using 64-bit math (perhaps in vectorization, if the compiler is smart enough), much like Geekbench giving an unnatural advantage to AArch64. Which is why I'd really like to see individual SpecInt2k scores so we can see what sorts of things it's really good at and what it's not so good at, and speculate as to why, instead of just going "oooh such big Spec score nice"
 
Speaking of 186.crafty (http://www.spec.org/cpu2000/CINT2000/186.crafty/docs/186.crafty.html) it's one of those unusual tests that gets a big benefit from using 64-bit math (perhaps in vectorization, if the compiler is smart enough), much like Geekbench giving an unnatural advantage to AArch64. Which is why I'd really like to see individual SpecInt2k scores so we can see what sorts of things it's really good at and what it's not so good at, and speculate as to why, instead of just going "oooh such big Spec score nice"
crafty is using 64-bit integers all around the place (bitboard based chess engine).
 
They double IPC from .6 to 1.2 instructions per cycle with their dynamic translation. Decoding ARM V8 instructions and packing them into 7-wide VLIW words and executing in-order is probably pretty far from optimal.



Cheers


Just curious but what's the current benchmark for measured IPC? I have no idea if 1.2 is good, bad or somewhere in between.
 
crafty is using 64-bit integers all around the place (bitboard based chess engine).

Exactly, it's not a good representative test. Maybe vectorization can narrow the gap since NEON has pretty good 64-bit integer support, but that's assuming it can actually trigger. And in some cases it could make things worse - for example Crafty uses a 64-bit clz variant with intrinsics using bsf on x86 and a LUT elsewhere. A compiler that's cheating on this test will turn it into a 64-bit clz if available, but that's not an option on NEON and using its fields as inputs to a LUT is not good.

186.crafty also has very low cache pressure, not just icache but dcache:

http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/new_tables/186.crafty1_64.tab
 
Exactly, it's not a good representative test. Maybe vectorization can narrow the gap since NEON has pretty good 64-bit integer support, but that's assuming it can actually trigger. And in some cases it could make things worse - for example Crafty uses a 64-bit clz variant with intrinsics using bsf on x86 and a LUT elsewhere. A compiler that's cheating on this test will turn it into a 64-bit clz if available, but that's not an option on NEON and using its fields as inputs to a LUT is not good.

186.crafty also has very low cache pressure, not just icache but dcache:

http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/new_tables/186.crafty1_64.tab
IMHO only 176.gcc can be considered as representative in SPEC 2000. And that's why as you wrote having individual scores would have been much more interesting (though I wonder if SPEC rules wouldn't forbid that...).
 
Just curious but what's the current benchmark for measured IPC? I have no idea if 1.2 is good, bad or somewhere in between.

I would also like to know.

Since the IPC starts out at 0.6 using the ARM HW Decoder and then doubles to 1.2 using the Optimized ucode I take that to mean that the IPC mentioned in the slides is the ARMv8 instruction set and not the Nvidia ucode instructions.

Architecturally, the Cortex A57 is much like a tweaked Cortex A15 with 64-bit support. The CPU is still a 3-wide/3-issue machine with a 15+ stage pipeline. ARM has increased the width of NEON execution units in the Cortex A57 (128-bits wide now?) as well as enabled support for IEEE-754 DP FP. There have been some other minor pipeline enhancements as well. The end result is up to a 20 - 30% increase in performance over the Cortex A15 while running 32-bit code. Running 64-bit code you'll see an additional performance advantage as the 64-bit register file is far simplified compared to the 32-bit RF.

http://www.anandtech.com/show/6420/arms-cortex-a57-and-cortex-a53-the-first-64bit-armv8-cpu-cores
Does anyone have the IPC for SpecInt:Crafty Execution on Cortex A15's or any preliminary IPC data for the Cortex A57?
 
Back
Top