Samsung Exynos 7420 architecture analysis @ Anandtech

Great work indeed!

A comment: 181.mcf is getting a negative benefit on AArch64 due to larger pointer sizes which results in more cache thrashing (the main data structure is a pointer-based graph).

OTOH I'm surprised by Crafty bad result. This one should benefit a lot from 64-bit integers as it's using 64-bit bitboards.
 
Now this is what Anandtech is about! :)
A very interesting and understandable investigation that not only is thorough and revealing, but also clearly delineates the level of confidence the author has about specific statements.
 
very nice article, but I still have one minor complain... on the perf/w table on page 7 I miss the values for the 5420-GPU. This would make it easy to get a better picture how much the Nvidia and AMD GPU's could gain from the new process
 
I took a closer look at SPEC2000 results and some of them make no sense. In particular AArch64 vs AArch32 A57 speedup/slowdown for 253.perlbmk and 186.crafty are most likely wrong. How was this compiled? With what version of gcc?
 
NDK10d so GCC 4.8

-Ofast -ffast-math -flto -march=armv8-a -ftree-vectorize -fno-jump-tables -fgcse -fgcse-lm -fgcse-sm -mtune=cortex-a57
-funroll-all-loops -static -opt-mem-layout-trans=3 -opt-prefetch -Wall -fPIC -fPIE -pie -Ofast -ffast-math -flto -march=armv8-a -ftree-vectorize -fno-jump- tables -fgcse -fgcse-lm -fgcse-sm -mtune=cortex-a57

I agree that the crafty scores look wrong but I haven't been able to see find out why.
 
The size difference between the big and LITTLE cores is really large!
I though it'd be around 2x larger, but a Cortex A15 cluster is actually >4x larger than a Cortex A7 cluster.
The difference is smaller with the A53 because it's ~40% larger than the A7.

So if we assume that a 2-core cluster is ~60% the size of a 4-core cluster (the "MP core glue" probably can't be halved), then using Samsung's 20nm a 8-core Cortex A53 would occupy 9.16mm^2 whereas a 2*A53 + 2*A57 would occupy 11.81mm^2.

The chinese SoC makers like HiSilicon and MediaTek have been sacrificing single-threaded performance for a measly 2.5-3mm^2...
 
Some things I jotted down while I was reading it properly just now:
  • Samsung getting to mass production and customer delivery with complex 14nm SoCs before TSMC with 16FF or 16FF+ is noteworthy.
  • Dropping Qualcomm for Galaxy S and Galaxy Note is enough to cause Qualcomm a profit warning, which says a bit about ASPs for high-end chips.
  • The T760 core layouts on 5433 and 7420 are completely different and it is definitely 1MB L2 for the GPU (I haz very die shots, much cell counting, wow)
  • I don't really like the trend of heavily overvolting (30%!) the CPU complexes. Feels like it's just to high marketing numbers more than a better user experience.
  • The Cortex-M3 based live feedback loop thing is cool, and looks like the way forward for active dynamic power management.
  • I wonder how expensive the custom PMIC is, to be able to respond that quickly. I don't believe that's common.
  • For device minimum power you've got 358mW in the table, but 330mW in the text.
  • You could probably sell that undervolted kernel!
  • It's nice to have a public power figure for a high performance memory system plus DRAM (statements have always been in the ballpark but now there's something to link to).
 
You don't think people would pay for a prepackaged kernel they could install with an easy method on their rooted device, and restore the original if they don't like it? I would.
 
You're in the minority... doing such a thing would result in a witch-hunt for the developer and having a copy-cat doing it for free within the same day.

Btw regarding PMICs: basically every SoC nowdays come with their own special-purpose design. I think Samsung was one of the last to drop Maxim in favour of their own designs. The S2MPS15 in the S6 ramps up 12mV per µS.
 
NDK10d so GCC 4.8
There's no gcc 4.8 for AArch64 in r10d, it's 4.9.

-Ofast -ffast-math -flto -march=armv8-a -ftree-vectorize -fno-jump-tables -fgcse -fgcse-lm -fgcse-sm -mtune=cortex-a57
-funroll-all-loops -static -opt-mem-layout-trans=3 -opt-prefetch -Wall -fPIC -fPIE -pie
Aren't -opt-mem-layout-trans and -opt-prefetch ICC (Intel x86 compiler) specific optimization flags? I couldn't find them in any gcc document.

I agree that the crafty scores look wrong but I haven't been able to see find out why.
If you don't force the use of 64-bit types then it might explain the bad result. Try with -DHAS_LONGLONG or -DLONG_HAS_64BITS.
 
It's pleasantly surprising. And a very thorough and methodical article.
I now have to bow my head to those who said there is use in 4 little cores. Even in browsers. (I never had any doubts about power efficiency in games).
 
Back
Top