Samsung Exynos 5250 - production starting in Q2 2012

  • Thread starter Deleted member 13524
  • Start date
Has this been demonstrated in practice (I'm not disagreeing, I'm curious)? Meaning, is it really worth the extra software overhead? If there's no "practical/real-world benefit", perhaps Samsung should have first tried to implement the "easier solution".

Yes, they've showed it in some videos and I've posted about it some time ago in this thread. It's also why they can claim that the HMP solution is about 10% better performing than an IK(C)S solution, it's because it circumvents the CPUFreq governor logic. DVFS is stupidly slow, the current implementations is working on a 100ms sampling rate. Edit: Follow here the scheduler things: http://www.youtube.com/watch?v=NoeHIjqlriI

Samsung has absolutely no excuse to AT LEAST implement a working IKS driver for the new 5420 products. If they really deem that voltage regulator overhead, common cluster block and L2 power consumption of the other cluster is so big that it's not worth keeping the low-performing cluster alive on high load and gain the perf/W advantage of the A7's in that situation, I'm pretty certain to declare big.LITTLE a failure.
 
I was feeling the same way but thought it too hasty of a conclusion and too sensationalistic of a comment considering we'd be generalizing about the potential of big.LITTLE based only off of Samsung and their current implementation.
 
Who cares? As long as the migration works reasonably well, why do we need eight cores running simultaneously on a mobile device (I doubt many people can even make use of eight cores on a desktop...)? Am I missing something (power improvements?)?

I don't think max multi-threaded performance with was ever the concern here.

I thought the best part in big.LITTLE was to be able to use for example a single A15 for a demanding thread on par with one or two A7s for the rest. And that would be a lot more power efficient than being only able to use either the A15s or the A7s.

Samsung's implementations have all failed in that department so far.
It'd be really ironic if for example Mediatek's chip is fully capable of using big.LITTLE.
 
I don't think max multi-threaded performance with was ever the concern here.

I thought the best part in big.LITTLE was to be able to use for example a single A15 for a demanding thread on par with one or two A7s for the rest. And that would be a lot more power efficient than being only able to use either the A15s or the A7s.

Samsung's implementations have all failed in that department so far.
It'd be really ironic if for example Mediatek's chip is fully capable of using big.LITTLE.

If you go by the clear statements on Mediatek's website, they've already adopted the HMP / GTS /MP (whatever marketing call it nowadays) for their big.LITTLE devices, including the recently announced MT8135 (2xA15 & 2xA7 +g6200)

http://www.mediatek.com/_en/01_products/04_pro.php?sn=1088
http://www.mediatek.com/_en/Event/201307_TrueOctaCore/biglittle.php

This is backed up by the Antutu benchmark scores, of the device shown to journalists. It scores far higher than a Nexus 10 which also has 2 x A15s, some of the difference must be due to its faster Rogue GPU, but the 2 x A7 are definitely contributing.

http://www.engadget.com/2013/07/29/mediatek-mt8135-biglittle-mp-powervr-series6-g6200/
http://blog.laptopmag.com/mediatek-debuts-first-quad-core-tablet-chip-with
 
Nebuchadnezzar said:
DVFS is stupidly slow, the current implementations is working on a 100ms sampling rate

ARM has a notorious poor DVFS implementation correct? I wonder if improvements there would nullify most of HMP's advantages.

ToTTenTranz said:
I thought the best part in big.LITTLE was to be able to use for example a single A15 for a demanding thread on par with one or two A7s for the rest. And that would be a lot more power efficient than being only able to use either the A15s or the A7s.

Yes my mistake, I was originally under the impression they at least had IKS working.
 
ARM has a notorious poor DVFS implementation correct? I wonder if improvements there would nullify most of HMP's advantages.

I doubt 100ms is even close to the hardware limit. I'm sure the OS could do a better job than it is, but maybe doing this stuff at a higher sample rate would be bad for power consumption.

The people ragging on ARM for DVFS probably want them to include a hardware module for controlling it (like on Intel processors), but even if an SoC wanted this I don't think it should be something ARM actually provides as part of their licensed cores. Maybe as a separate IP. It could be useful to have an embedded Cortex-M series processor control this, like the ones on OMAP4/5.

And no, granularity or automation of DVFS has nothing to do with the advantages of HMP. Unless by poor you really mean lack of asynchronous DVFS like Qualcomm has, but that has its own costs.
 
I doubt 100ms is even close to the hardware limit. I'm sure the OS could do a better job than it is, but maybe doing this stuff at a higher sample rate would be bad for power consumption.
The whole hardware and software latency of a switch is in the order of 550-600µS, the ideal sampling rate is 1000th factor of the switching latency which puts it at 550ms. So we're already way under the ideal factor and it accounts for a large amount of overhead, about a factor of less than 200 or 0.2% overhead. The lower the sampling rate the higher the overhead gets.
And no, granularity or automation of DVFS has nothing to do with the advantages of HMP.
A large amount of the above latency comes from sending the control data to the PMIC via the I2C bus, the PMIC actually ramping up / down on the buck converters, and the software stack propagating the new frequency through the CPUFreq framework.

With HMP a given task can be switched over from the big cluster to the little cluster, regardless of P states of either, without all of this overhead.

If they would go the Intel route of on-SoC PMICs and a hardware governor, they could take in large amounts of benefits in terms of perf/W for always being near the ideal P state. But the whole thing is so convoluted and stupid in it's current state that it would be a giant change. The Linux guys aren't even contemplating such a possibility.
 
Last edited by a moderator:
The whole hardware and software latency of a switch is in the order of 550-600µS, the ideal sampling rate is 1000th factor of the switching latency which puts it at 550ms. So we're already way under the ideal factor and it accounts for a large amount of overhead, about a factor of less than 200 or 0.5% overhead. The lower the sampling rate the higher the overhead gets.

When you say sampling rate you mean sampling period, right? Otherwise I don't understand the remark that overhead goes up as sampling rate goes down. I also don't understand how the ideal sampling period is 1000 times the latency. I've never heard someone make that correlation between latency and bandwidth.

If you're saying that doing the switching at 100ms is already taking 0.5% CPU time then that's pretty bad. I don't know the average clock speed that's being done at but that'd at least mean dozens of millions of clock cycles per switch, which is crazy.

If they would go the Intel route of on-SoC PMICs and a hardware governor, they could take in large amounts of benefits in terms of perf/W for always being near the ideal P state. But the whole thing is so convoluted and stupid in it's current state that it would be a giant change. The Linux guys aren't even contemplating such a possibility.

I don't think Intel has on-SoC PMICs for Silvermont, do they? Just Haswell? This didn't change anything for the kernel, does it? I've actually been kind of fuzzy on how Intel's turbo boost works and how it interacts with the kernel, does it perform its own clock regulation based on percentage of time the CPU is spent in idle?
 
When you say sampling rate you mean sampling period, right?
Yes, sorry, 100ms period. We're just used to calling it the rate in the community.

The biggest problem is that the CPUFreq framework has listener chains for all kinds of stuff, like "update I2C frequency", "update memory QoS CPU bandwidth", and so on which happens on every single frequency transition. That's the massive amount of overhead there.

This didn't change anything for the kernel, does it? I've actually been kind of fuzzy on how Intel's turbo boost works and how it interacts with the kernel, does it perform its own clock regulation based on percentage of time the CPU is spent in idle?
I'm really fuzzy about the latest Intels too; I don't know how Intel handles these cases; some of them need to be handled in software as there's no hardware alternative to them.

Edit: But Samsung for one IS working on solutions, the Exynos 5440 for one has a hardware clock state controller: https://github.com/kgene/linux-samsung/blob/master/drivers/cpufreq/exynos5440-cpufreq.c

How it ties up with voltages and so on is yet to be seen. The way it works there is that software is still notified of transitions via IRQs.
 
Last edited by a moderator:
Exophase said:
And no, granularity or automation of DVFS has nothing to do with the advantages of HMP.

What I meant was if we assume one only needs >4 threads at any given moment, I was thinking there might be a benefit in not having all 8 cores active (reduced software/kernel complexity). For instance in the simple case where you have one "heavy" thread and three "light" threads, one could have 1 A15 active and 3 A7s active (the others would be "turned off"). This would only work if the migration could be done in an efficient (and quick) manner, which does not appear to be the case. I was just speculating that if this was the case, perhaps HMP would not be as desirable.
 
I read similar from here http://news.yahoo.com/leader-follows-again-samsung-galaxy-s5-64-bit-150505230.html and i was following right from ARM's website where the info on the Cortex 50 series was removed from their site earlier this year. It is updated. http://www.arm.com/products/processors/cortex-a50/index.php

If 1600mhz dual channel ddr3 was in 5410 and 1866mhz in 5420 I am guessing this new line may be using 2133mhz ram. The bandwidth ive seen on my laptop with 2133mhz ddr3 ram is in a similar range. I've been waiting for Cortex A57/53 to be available since I read of ARM's plans last year on a diagram showing their step after A15 cores.

64bit and 14nm are exciting and hearing of the possible 25gb bandwidth on the ram this is quite exciting and moving into higher end laptop performance ranges.

The 5410 is beating a Intel core 2 Quad Q9000 in geek benches, I can't imagine the power of all these components combined. I hope it has ARM's gpu again as they returned to in 5420.

Sorry, wish I knew a better translator than that.
 
Mali 628 indeed matches adreno 330 performance pretty well.
http://gfxbench.com/compare.jsp?D1=...SM-N900,+SM-N9002,+SM-N9005,+SM-N9006)&cols=2
Still the adreno 330 in Note 3 and other devices perform differently (in Gfxbenchmark and 3D mark)
http://gfxbench.com/compare.jsp?D1=...a+Z+Ultra+(C6802,+C6806,+C6833,+C6843)&cols=2
Even with old drivers and 2GB ram Snapdragon 800 S4 is performing the same as note 3 (i.e higher than other s800 devices) in gfxbenchmark .
Also according to anand Tech's article there are no particular boosts for Gfxbenchmark, in note 3.
Perhaps they missed something?
 
Last edited by a moderator:
The Ars article on the Note 3 boosting shows that 2.7 is among the list of apps detected, yet it didn't exhibit the same (lack of) idling behavior as other detected benchmark apps. They speculated about an LCD frame rate adjustment function mentioned in the boosting code as a possible indicator of Samsung's intent with graphics benchmarks, though.

http://arstechnica.com/gadgets/2013...rking-adjustments-inflate-scores-by-up-to-20/

Regardless, the Nexus 5 carries a 2.3 GHz bin of the S800, is a pure Android device, has 3 GB of RAM I presume, and comes with quickly updated software, so I'll be considering that as the reference point for S800 performance at this time. Non-Nexus S800 devices, especially larger sized devices, may legitimately use less conservative power profiles which aren't as quick to ramp down voltages and sleep cores, but it's hard to separate legitimate performance tuning from boosting at this point.

Then, of course, there's the whole AB variant with its over 20% GPU frequency advantage (and upclocks to other parts of the SoC) to consider when comparing certain S800 devices.
 
They speculated about an LCD frame rate adjustment function mentioned in the boosting code as a possible indicator of Samsung's intent with graphics benchmarks, though.
That LCD refresh rate adjustment has existed since the Galaxy S3 and is triggered on the stock camera app. The mechanism isn't used (As they pointed out in their decompiled code) on this generation due to UI fluidity and 60fps recording.

I've mentioned and I'll repeat again: That Ars article is pure trash.
 
You still haven't pointed out what's pure trash in Ars's revealing of the specific whitelist Samsung is using for that particular boost on the Note 3, their revelation that only benchmarks and no other apps have access to that level of performance this time (contrary to Samsung's claims), and Ars measuring the difference in performance when the app detect has been circumvented.
 
Ars's revealing
First of all, they didn't reveal anything. They just rehashed what me and AnandTech did in July for the Note 3.
their revelation that only benchmarks and no other apps have access to that level of performance this time
If that is what you understood from that article then it just proves my point of its worthlessness. The whole "otherwise unreachable performance levels" argument is absolutely null due to the very nature of the mechanism: it doesn't expose any kind of higher performance level. Period. What Samsung claimed back in July was related to the GPU frequencies of the 5410 on the i9500. The two cases have literally nothing in common in how they work or what their effect is.
we can confidently say that Samsung appears to be artificially boosting the US Note 3's benchmark scores with a special, high-power CPU mode that kicks in when the device runs a large number of popular benchmarking apps. Samsung did something similar with the international Galaxy S 4's GPU, but this is the first time we've seen the boost on a US device.
The above quote demonstrates that the writer fundamentally did not understand the technicalities of the two mechanisms at hand here.

Between the inclusion of that and the suspicious "frame rate adjustment" string, it's clear that Samsung is doing something to the GPU as well, though those clock speeds are more difficult to access than the CPU speeds (a method used by AnandTech on the international S 4 no longer works on the Note 3).
The bolded part is pure bullshit, as there has been zero evidence of this up to today, yet they're able to make such a statement. They further prove how the writer(s) are technically inept in that last sentence (Of course a class path for PowerVR GPUs won't work on the Adreno, you have to use a different one /facepalm), and I've already stated that the "suspicious" refresh rate adjustment (Which actually never happens) has been part of Samsung firmwares since several generations for a perfectly valid use-case.

and Ars measuring the difference in performance when the app detect has been circumvented.
we're seeing artificial benchmark increases across the board of about 20 percent;
This has already been proven to be bullshit by testing (See AnandTech), secondly, here they claim across the board while the very title of the article says the (much more correct) up to 20% [again, on GeekBench3 _only_].
Linpack showed a boosted variance of about 50 percent.
They use Linpack, a 3 year old benchmark whose run-time is nowdays <250ms, as a tool to state variance? If you don't get my point, here's a sequence of scores I just randomly benched: 569 412 443 366 581. Hey look, I got a 60% max variance just from that.

And last but not least, their whole editorial is based on a comparison to the G2, which they disastrously fail to properly analyse and detect that it has the very same "cheats" they so proudly denounce on the Note 3; which actually prompted AnandTech to post the follow-up article to correct the whole perspective on the story and post the proper facts.

The article is a failure on every aspect it's trying to do, it fails on the technical parts, it fails on representing the real effect this has on benchmarking due to its asinine methodology, and it fails on the journalistic/editorial conclusion it tried to make.

So please let's drop that ArsTechnica article as any kind of valid reference point.

And I'm sorry to have ranted off in here again about this, let's get back on topic.
 
Back
Top