Samsung SoC & ARMv8 discussions

So clock-for-clock, Denver still beats A57..but Denver's problem is power consumption and not performance. Also FWIW, rumours say Nvidia will use Denver in its Parker SoC.
That is a misleading statement. Different perf/watt, different target form factors, different core counts, unknown die sizes all make a direct comparison pointless. The two chips are built with vastly different approach to performance characteristics. "Clock-for-clock Denver beats A57.." does not mean much, if at all. Clock-for-clock comparison between Tegra X1 and Exynos 5433 will be more interesting and meaningful.

I'm not aware of any intermediary resource allocation in the ARM IP beyond either having the crypto accelerators or not. It just looks like throttling on the S810, the 5433 performs better.

Is there a chance that Samsung has learned a thing or two about global task switching that Qualcomm has yet to master, and the scores reflecting it?

Edit: Never mind. You guys seem to have considered it already.
 
Qualcomm uses a totally custom scheduler for big.LITTLE, so you might be onto something suggesting it may sit on the A53 cores in the single-thread benchmark, I didn't think of that (that's a bad thing).
They (throttling and scheduling) may not be mutually exclusive. Could it be that the schedulers call for A53 when the chip is thermally challenged?
 
That is a misleading statement. Different perf/watt, different target form factors, different core counts, unknown die sizes all make a direct comparison pointless. The two chips are built with vastly different approach to performance characteristics. "Clock-for-clock Denver beats A57.." does not mean much, if at all. Clock-for-clock comparison between Tegra X1 and Exynos 5433 will be more interesting and meaningful.

I really dont know what you are trying to say here. We were only talking about performance, not perf/watt (though I did in fact mention that Denver consumes more power). Core counts also do not matter as we are comparing single threaded performance. And who cares what the die size is..again we are only talking about performance. It is by no means misleading to say that clock-for-clock, Denver beats A57. And either ways..my statement was in response to mboeller who said that the 7420 beats K1 in single thread clock for clock performance.
 
This piece was gestating for way too long but finally here it is: http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review

I've barely skimmed through it so far, but there's a lot of interesting stuff in there, nice work.

I do wonder about one thing: you compared the energy-efficiency of A53 vs. A7 and A57 vs. A15 at the same (or default?) clock speeds, which leads to different performance levels and different power levels.

I think it would be interesting to do it at iso-performance or iso-power (or both). In particular, the A53 exhibits superlinear power scaling (as you'd expect) so if you underclocked it to match the A7's performance, you'd probably get a different picture. The latter might remain more efficient, but likely not by this much.
 
I think it would be interesting to do it at iso-performance or iso-power (or both). In particular, the A53 exhibits superlinear power scaling (as you'd expect) so if you underclocked it to match the A7's performance, you'd probably get a different picture. The latter might remain more efficient, but likely not by this much.
I do mention the A53 seems like an extension to the perf/W curve of the A7. In retrospect I should have done what you mention but it would have exploded the article in terms of size due to complexity of matching power and performance for the various benchmarks, maybe I'll do it on the S810 or 7420 if it's feasible. Take into note that the DVFS scaling mechanism didn't change so comparing for same clocks is still a very valid comparison for real-world use.
 
I second Alexko, very interesting review. Thanks!

Can you comment on how SPEC is compiled? What compiler version? What flags?
 
This piece was gestating for way too long but finally here it is: http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review

Congratulations; Jesus I can't read and understand that in one day :oops: Call me a hair splitting nazi but:

* I can hate you for calling Mali GPU clusters = "cores" I guess... :rolleyes:
* Unless I'm reading something wrong in your text, it's NOT the first Android device to present such high driver overhead scores; that would go to Tegra SoC powered devices and no not just K1 at all: http://gfxbench.com/result.jsp?benchmark=gfx30&test=553&order=score&base=gpu&ff-check-desktop=0 I remember noticing it first on a Tegra4 device but looking at that database the first would be the Tegra2 powered Transformer TF101 with a 50.4 fps driver overhead offscreen score.

Link:
http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/9

I've already mentioned the Driver Overhead score in the the more in-depth analysis of the T760. It's the first Android device to truly stand out from the rest of the crowd, finally making some progress into trying to catch up with Apple's excellent performance on iOS. Here's hoping more vendors concentrate on improving this metric in future driver updates.

Also: http://gfxbench.com/device.jsp?benchmark=gfx30&os=Android&api=gl&D=Onda A9 Core4
 
I second Alexko, very interesting review. Thanks!

Can you comment on how SPEC is compiled? What compiler version? What flags?

For the integer tests, GCC 4.8:
Code:
int=base=default=default:
notes0080= Baseline C: arm-linux-gnueabihf-gcc -O3 -static -mcpu=cortex-a15 -mfloat-abi=hard -ffast-math -funroll-loops -flto -mfpu=vfpv4
COPTIMIZE = -O3 -static -marm -march=armv7-a -mtune=cortex-a15 -ffast-math -funroll-loops -flto -mfpu=vfpv4 -mfloat-abi=hard
notes0085= Baseline C++: arm-linux-gnueabihf-g++ -O3 -static -mcpu=cortex-a15 -mfloat-abi=hard -ffast-math -funroll-loops -flto -mfpu=vfpv4
CXXOPTIMIZE = -fpermissive -O3 -static -marm -march=armv7-a -mtune=cortex-a15 -mfloat-abi=hard -ffast-math -funroll-loops -flto -mfpu=vfpv4
Congratulations; Jesus I can't read and understand that in one day :oops: Call me a hair splitting nazi but:

* I can hate you for calling Mali GPU clusters = "cores" I guess... :rolleyes:
* Unless I'm reading something wrong in your text, it's NOT the first Android device to present such high driver overhead scores; that would go to Tegra SoC powered devices and no not just K1 at all: http://gfxbench.com/result.jsp?benchmark=gfx30&test=553&order=score&base=gpu&ff-check-desktop=0 I remember noticing it first on a Tegra4 device but looking at that database the first would be the Tegra2 powered Transformer TF101 with a 50.4 fps driver overhead offscreen score.

Link:
http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/9

Also: http://gfxbench.com/device.jsp?benchmark=gfx30&os=Android&api=gl&D=Onda A9 Core4
I know there's some higher performing outliers and rare devices out there but we never got to test them. Consider myself corrected on that regard.
 
Well the Onda A9/Mali 450 score shows that they could do even better than that. Anyway as I said I'll study it bit by bit and most likely send any notes or questions that might come up via PM.
 
For the integer tests, GCC 4.8:
Code:
int=base=default=default:
notes0080= Baseline C: arm-linux-gnueabihf-gcc -O3 -static -mcpu=cortex-a15 -mfloat-abi=hard -ffast-math -funroll-loops -flto -mfpu=vfpv4
COPTIMIZE = -O3 -static -marm -march=armv7-a -mtune=cortex-a15 -ffast-math -funroll-loops -flto -mfpu=vfpv4 -mfloat-abi=hard
notes0085= Baseline C++: arm-linux-gnueabihf-g++ -O3 -static -mcpu=cortex-a15 -mfloat-abi=hard -ffast-math -funroll-loops -flto -mfpu=vfpv4
CXXOPTIMIZE = -fpermissive -O3 -static -marm -march=armv7-a -mtune=cortex-a15 -mfloat-abi=hard -ffast-math -funroll-loops -flto -mfpu=vfpv4
Thanks, that looks like a good set of flags. Regarding Apple A8 compilation, do you know more? I wonder if it's 64-bit code given how Crafty is much faster than on other devices (Crafty benefits a lot from 64-bit integers).
 
Thanks, that looks like a good set of flags. Regarding Apple A8 compilation, do you know more? I wonder if it's 64-bit code given how Crafty is much faster than on other devices (Crafty benefits a lot from 64-bit integers).
I currently don't have the iOS version to check, I'll ask Josh and get back to you.
 
I'm afraid that if you'd go for a GPU perf/mm2, perf/mW or whatever else metric comparison of the T760 against the competition (especially Tegra K1 and Apple A8X), things could look times worse then illustrated in that article. 6W consumption for running Manhattan? :runaway:
 
Thanks for the interesting review!

Hmm.. 20nm, 30% higher performance in average with 2x the times power consumption? The more ARM advances in computing power, the more the ARM power efficiency myth gets busted, me thinks. Intel is managing to reduce x86 power consumption without having to develop big little approaches...
 
Wow so big difference between peak ALU and fillrate tests (and also memory bandwidth) but very comparable perf in overall Manhattan and T-Rex. From what I know, the peak ALU test doesn't utilize the vec4 dot product unit which decreases it's number by 7 flops per cycle per ALU pipe. Still it seems like Mali's design is a more balanced one than Qualcomm's.
 
This piece was gestating for way too long but finally here it is: http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review

Thanks..interesting article! The part I found particularly interesting was the real world performance of big.LITTLE and how it is can actually be detrimental to power. Perhaps this is part of the reason why Nvidia chose cluster migration for Tegra X1 instead of HMP. And I think this is one of the big reasons for Qualcomm's dominance of the high end SoC market for the last few years. They've demonstrated high energy efficiency at both high and low speeds with Krait.The fact that they've developed their own scheduler for S810 shows that they were able to recognize the problem and mitigate it..which is very good to see. If ARM is indeed 18 months away from a proper fix..they've really dropped the ball on this one.

I have a few doubts though..let me try and list a few that I remember:-

1. Any idea why the LDPPR3 speeds dropped from 933 mhz on the 5422 to 825 mhz on the 5430/5433?


The extra bandwidth surely would have been useful. Performance of the 7420 with LPDDR4 should give us a good idea if there are any bandwidth limitations.

2. Exynos 5420 vs Exynos 5430 block sizes.
Exynos 5420 Exynos 5430
A7 core 0.58mm² 0.4mm²
A7 cluster 3.8mm² 3.3mm²
A15 core 2.74mm² 1.67mm²
A15 cluster 16.49mm² 14.5mm²

The numbers seemed a bit off to me so I did a bit of math. Extrapolating from those figures, for the A7s, 512 KB of L2 cache on 5420 is 1.48 mm2 but on the 5430 it is 1.7 mm2. And for the A15s, the 2 MB L2 on the 5420 is 5.53 mm2 but on the 5430 it is 7.82 mm2. So in both cases the size of the cache increased significantly, despite being on a smaller node. Wouldn't one expect exactly the opposite? I understand that there can be optimizations for area or power but this seems a bit extreme.

3. A53 v/s A7 power consumption (Reg Pg 4, SoC Synthetic "Little" Load Power Chart)

When looking at 1 core, the A53 consumes 27% more power than the A7. But with 4 cores, it consumes 87% more power. I do not see any explanation for this.

And we see the same thing with the A57 v/s A15. One core consumes 17% more power but 4 cores consume 67% more power (comparing both at 1.8 ghz). I am at odds to understand this.


Edit: Just a minor nitpick but on page 2 you mention that Qualcomm moved from 28nm HP in previous SoCs. AFAIK they were on 28LP earlier, and 28HP is what AMD and Nvidia use for GPUs.
 
Last edited:
1. No idea. May have to do with power consumption.
2. It's not only cache but also other stuff within the cluster, PLLs, interfaces, etc.
3. There is a base amount of power which goes into the RoS and memory, remember these are SoC load numbers not just merely CPU core power figures. When you take away that base amount the scaling with threads is pretty linear on the A57 cores. I was also told that power consumption on the A15's was not linear because the cluster might be fighting for resources and each additional thread would decrease the actual work done per thread, making each additional thread use less power than the previous. The deltas from n to n+1 threads was 879, 708 and 637 mW. The same thing happened on the A7 cores.

Fixed the LP mention.
 
Back
Top