Samsung SoC & ARMv8 discussions

73787.png


Galaxy S5 Broadband LTE-A: 5.1" 1440p previous gen AMOLED + 28 nm HPm Snapdragon 805,
pretty much the same result (98.9%) as
Galaxy S6: 5.1" 1440p current gen AMOLED + 14 nm LPE Exynos 7420.

S5 LTE-A has ~7.7% bigger battery (10.78 Whr vs 10.01 Whr).

So S6 is only ~8.9% more efficient in this benchmark than S5 LTE-A?
 
Mixed up S6 and S6 edge.
S6 has 9.81 Whr battery.

So S5 LTE-A has ~9.9% bigger battery, S6 ~11% more efficient in this benchmark.

Still, small advantage for 2 generation gap in process node and 1(?) generation gap in display technology.
 
http://www.anandtech.com/show/9146/the-samsung-galaxy-s6-and-s6-edge-review

No real degradation testing though. That's literally the most important benchmark! I don't care about a SoC's burst performance! I want to know what performance I can expect after playing a game for 30+ minutes. WTB: looping off-screen benchmarks... :mrgreen:

It looses nearly 40% after let's say give or take an hour in T-Rex onscreen (from 39.4 down to 24.3 fps):

73792.png


http://www.anandtech.com/show/9146/the-samsung-galaxy-s6-and-s6-edge-review/3

Typical KISS approach: it has under 3D a higher sustained 3D performance of just 20% compared to the Galaxy S5. The 6 Plus is still faster in that regard despite being one generation behind; one step forward and two steps back :p
 
Typical KISS approach: it has under 3D a higher sustained 3D performance of just 20% compared to the Galaxy S5. The 6 Plus is still faster in that regard despite being one generation behind; one step forward and two steps back :p
Both aren't quite true, this is onscreen test hence the sustained performance is actually a lot higher (twice as fast or so) than that of the S5 (which also dropped roughly 30% compared to single run).
Likewise performance is actually quite a bit higher than that of the IP6+ for the same reason (the IP6+ dropped 25% compared to single run), though given the generation advantage this is indeed not all that amazing but still I wouldn't call that bad.
 
To expand on mczak's point with a few numbers:
  • The Galaxy S6 has a screen resolution of 2560x1440 (~3.7 million pixels), so its sustained performance on that T-rex test (which, as the anandtech article pointed out just below that graph, is an on-screen test), measured in pixels per second, is about 2560*1440*24.34 = 89.7 MPix/s.
  • The Galaxy S5, HTC One M9 and IPhone 6 Plus all have a screen resolution of 1920x1080 (~2 million pixels), so their sustained performance are 41.0 MPix/s, 55.4 MPis/x and 66.7 MPix/s respectively.
  • The IPhone 6 has a screen resolution of 1334x750 (1 million pixels), giving a sustained performance of 49.8 MPix/s.
Granted, comparing performance numbers across resolutions like this isn't going to be terribly accurate (doubling the number of pixels doesn't double the geometry etc), but I don't see much of a valid basis for calling this "one step forward and two steps back".
 
Both aren't quite true, this is onscreen test hence the sustained performance is actually a lot higher (twice as fast or so) than that of the S5 (which also dropped roughly 30% compared to single run).
Likewise performance is actually quite a bit higher than that of the IP6+ for the same reason (the IP6+ dropped 25% compared to single run), though given the generation advantage this is indeed not all that amazing but still I wouldn't call that bad.


For the unnecessary hairsplitting the difference is at less than 22% according to Anandtech's results (32.19 vs. 41.10) , but apart from Anand's result if I set the two devices on compare in Kishonti's database it looks more like this: https://gfxbench.com/compare.jsp?benchmark=gfx31&D1=Samsung+Galaxy+S6+(SM-G920x,+SC-05G)&os1=Android&api1=gl&D2=Apple+iPhone+6+Plus&cols=2

Either way the throttling in one case is at >38% for one side and <22% in the other and if it's about a neck to neck quick cold synthetic comparison the difference is at 31% at 1080p TRex. Alas if the difference from 20SOC TSMC would be smaller than the latter compared to 14 FinFET Samsung.
 
To expand on mczak's point with a few numbers:
  • The Galaxy S6 has a screen resolution of 2560x1440 (~3.7 million pixels), so its sustained performance on that T-rex test (which, as the anandtech article pointed out just below that graph, is an on-screen test), measured in pixels per second, is about 2560*1440*24.34 = 89.7 MPix/s.
  • The Galaxy S5, HTC One M9 and IPhone 6 Plus all have a screen resolution of 1920x1080 (~2 million pixels), so their sustained performance are 41.0 MPix/s, 55.4 MPis/x and 66.7 MPix/s respectively.
  • The IPhone 6 has a screen resolution of 1334x750 (1 million pixels), giving a sustained performance of 49.8 MPix/s.
Granted, comparing performance numbers across resolutions like this isn't going to be terribly accurate (doubling the number of pixels doesn't double the geometry etc), but I don't see much of a valid basis for calling this "one step forward and two steps back".

See above; I was mislead from the Kishonti database results as they typically show a very small throttling persentage for the 6 Plus. For the resolution difference which I'm aware of:

* The difference in TRex offscreen isn't blowing anyone out of his socks either.
* Again 20SoC vs. 14FF.
* Peak frequencies for the GPUs would be 770+MHz vs. 533MHz?
* The GalaxyS6 will be rather competing with the next generation iPhone.

The frequency difference alone above is higher than the actual performance difference in offscreen TRex screen results anywhere, so the resolution point becomes sooner than later rather moot.
 
Either way the throttling in one case is at >38% for one side and <22% in the other and if it's about a neck to neck quick cold synthetic comparison the difference is at 31% at 1080p TRex. Alas if the difference from 20SOC TSMC would be smaller than the latter compared to 14 FinFET Samsung.
Yes, apple is less aggressive in pushing higher gpu clocks mostly just for benchmarks, no doubts about that. Still, the point was that the Galaxy S6 was still quite a bit faster in sustained performance.
I don't know though why the kishonti database results differ so much (for the S6), I guess SoC quality could influence this but probably they were using pre-release drivers (the results are pretty incomplete there).
Frequency is pretty irrelevant alone as you should know, I'd be however interested in knowing how the respective gpus differ in terms of complexity (transistor count). Some more benchmarks would also be nice (I'd not be surprised if generally Mali would do somewhat worse in lesser known apps since the shader compiler probably has a really difficult time to spit out well optimized code).
 
Yes, apple is less aggressive in pushing higher gpu clocks mostly just for benchmarks, no doubts about that. Still, the point was that the Galaxy S6 was still quite a bit faster in sustained performance.
I don't know though why the kishonti database results differ so much (for the S6), I guess SoC quality could influence this but probably they were using pre-release drivers (the results are pretty incomplete there).
It could very well be that while asking for a comparison the Kishonti system cuts an average of all results for a certain device, which can be misleading to the observer in some cases.

Frequency is pretty irrelevant alone as you should know,...

Why is it irrelevant? The A8X GPU should clock below 500MHz, while the Tegra X1 GPU at 1GHz. Is is still irrelevant in that case?

To avoid mistunderstandings: if Apple can clock N SoC block at 533MHz under 20nm, why shouldn't it be able to clock the very same block at the same frequency as in the 7420 under 14 FinFET without negatively influencing power consumption?

I'd be however interested in knowing how the respective gpus differ in terms of complexity (transistor count). Some more benchmarks would also be nice (I'd not be surprised if generally Mali would do somewhat worse in lesser known apps since the shader compiler probably has a really difficult time to spit out well optimized code).

You can guess the rough transistor count of the GX6450 in A8 (~2b transistors for the entire SoC, ~19 out of 89mm2 are for the GPU), yet not for the 7420 GPU unless we get more details. Under 20nm Samsung however if memory serves well the MaliT760MP6 was over 30mm2; now you have 14FF but 2 clusters more in the 7420.
 
Last edited:
GFXBench long term perf uses average over the whole battery run, our degradation metric is the last run fps.

I just noticed your post sorry; now a few things are even clearer. Do you have any clue why the application is reading out N frames here for long term performance while the fps values aren't corresponding?: https://gfxbench.com/device.jsp?ben...103839&D=Samsung Galaxy S6 (SM-G920x, SC-05G)

It reads more like an application quirk since it's also dropping over two times in frames compared to the onscreen result (1818 frames onscreen vs. 852 frames long term perf.). I doubt it has any relevance to reality since those 852 frames should correspond to about 15+ fps and the average framerate can hardly be lower that the last run value Anandtech got.
 
Why is it irrelevant? The A8X GPU should clock below 500MHz, while the Tegra X1 GPU at 1GHz. Is is still irrelevant in that case?
Because we don't really know how the frequency/voltage (or power) graph looks like for these different architectures. Plus in the sustained run the clock of the mali might not have been really higher anyway.
To avoid mistunderstandings: if Apple can clock N SoC block at 533MHz under 20nm, why shouldn't it be able to clock the very same block at the same frequency as in the 7420 under 14 FinFET without negatively influencing power consumption?
Yes I guess that should be possible though I don't know if that would be efficient - see above. Might be more efficient to use more clusters instead.

You can guess the rough transistor count of the GX6450 in A8 (~2b transistors for the entire SoC, ~19 out of 89mm2 are for the GPU), yet not for the 7420 GPU unless we get more details. Under 20nm Samsung however if memory serves well the MaliT760MP6 was over 30mm2; now you have 14FF but 2 clusters more in the 7420.
Ok I guess series 6gx still has some advantage in terms of perf/area there then.
 
Because we don't really know how the frequency/voltage (or power) graph looks like for these different architectures. Plus in the sustained run the clock of the mali might not have been really higher anyway.
It's true that we don't know the how the frequency/voltage graph looks. But I think it's a safe to assume that we have left the area where we can increase frequencies without increasing voltage at all, i.e. the frequency/power graph will increase faster than linearly, which brings us to...
Yes I guess that should be possible though I don't know if that would be efficient - see above. Might be more efficient to use more clusters instead.
For graphics, it would seem that the larger gain would be had by increasing width, rather than frequency for a given power draw. As far as I can understand, this is mostly a question of cost - increasing width increases die area, which increases cost per chip. Increasing frequencies is essentially "free". For the CPUs the issue is not so clear cut - increasing width may not pay off very well as typical code is not perfectly load balanced over many cores, and bottlenecks and contention in other parts of the system limits scaling in those cases when the code should theoretically scale reasonably well with more cores. I suspect that once you have a couple of cores, increasing frequencies generally pays off better overall than adding more cores for typical software.

I hope Apple make an iPad Pro, where they allow the SoC die area to grow substantially. It would be very interesting to see what could be done with a mobile architectural base, FinFETs, and 200mm2 or so die area.
Not likely to happen, of course.
 
For graphics, it would seem that the larger gain would be had by increasing width, rather than frequency for a given power draw.
I don't think it's all that clear cut - I suspect mobile gpus (at least those from the performance segment) are operating quite near their optimal (in terms of efficiency) point wrt frequency nowadays (at least for the sustained loads, the peaks might be above that). Though switching the process probably changes the equation. (Switching the process probably also might allow you to switch to an architecture which does better in terms of efficiency but might have been considered too expensive in terms of die area before.)
If you look for instance at Broadwell (ok different power envelope but should follow similar logic), at 15W intel of course doesn't even try GT3. At 28W GT3 is generally a win but still not by all that much - for twice the number of units that's quite an investment for little gain (http://www.anandtech.com/show/9166/intel-nuc5i7ryh-broadwellu-iris-nuc-review/4).
 
Because we don't really know how the frequency/voltage (or power) graph looks like for these different architectures. Plus in the sustained run the clock of the mali might not have been really higher anyway.

In addition to what Entropy said above, it throttles more either way; most if not all odds are against the possibilities that the 7420 has an advantage in terms of perf/mW and perf/mm2 (normalized hypothetically to the same process).

Yes I guess that should be possible though I don't know if that would be efficient - see above. Might be more efficient to use more clusters instead.

If it's not efficient for a design like the GX6450 then why would it be efficient for something like a T760MP8? It was a pure hypothetical question for the GX6450 as is. What Apple might have done for its upcoming generation of iPhones is a totally different chapter; they might not even need to increase neither frequency (significantly) nor add more clusters.

Ok I guess series 6gx still has some advantage in terms of perf/area there then.

I recall =/>30mm2 for the MP6 at 20nm Samsung; hypothetically for a MP8@20nm it should be =/>40mm2 which is twice as much. "Some"? :p The A8X GPU is over 38mm2 at 20SoC and is now at almost 38fps in Manhattan offscreen.
 
Last edited:
I don't think it's all that clear cut - I suspect mobile gpus (at least those from the performance segment) are operating quite near their optimal (in terms of efficiency) point wrt frequency nowadays (at least for the sustained loads, the peaks might be above that). Though switching the process probably changes the equation. (Switching the process probably also might allow you to switch to an architecture which does better in terms of efficiency but might have been considered too expensive in terms of die area before.)

Processes aside; GPU IP operates at wherever level each licensee integrates it. Any manufacturer can be either conservative or very aggressive with frequencies; the latter always has side effects especially in an as power constraint envorinment as a smartphone SoC.

If you look for instance at Broadwell (ok different power envelope but should follow similar logic), at 15W intel of course doesn't even try GT3. At 28W GT3 is generally a win but still not by all that much - for twice the number of units that's quite an investment for little gain (http://www.anandtech.com/show/9166/intel-nuc5i7ryh-broadwellu-iris-nuc-review/4).

It's far more than "just" a different power envelope. You cannot under any circumstance compare those with any smartphone SoC. GPU IP like Mali & Rogue scale as expected according to the increase in GPU clusters.
 
It's far more than "just" a different power envelope. You cannot under any circumstance compare those with any smartphone SoC. GPU IP like Mali & Rogue scale as expected according to the increase in GPU clusters.
I think Broadwell would scale as well, but due the power limitation it effectively means you get either 24 EUs at clock X or 48 EUs at little more than (clock X /2). I just see no reason why it would be different for these SoCs. IIRC for the few instances where someone did some clock/voltage measurements it really doesn't go down all that much below some point.
(FWIW I was wrong there are no GT3 15W parts, they exist as well, and from the few benchmarks I saw they share a similar fate.)
 
Again it depends how each of the licensees integrates them; if you take let's say a dual cluster G6230 (which is easier to clock higher due to its size) and pump it up to around 800MHz under 28nm it'll throttle just as much as many other GPUs out there (=/>30%). At the usual 533-600MHz it has been integrated under 28LP & 28HPm it'll throttle obviously far less (close to nothing). It shouldn't be any different for say dual cluster Mali T760MP2 for instance either. Of course are you correct that it's usually better to go wider (more units) than to rely on frequency, however the MP8 in the Exynos7420 is neither small in terms of die area nor is it's frequency considered as "humble" with a peak of 772MHz exactly considering it's complexity for a smartphone SoC.

Designs like Broadwell or anything else will have to scale that much down in frequency in order to sustain certain temperature levels in a smartphone, because they consume too much die area, because they haven't been designed for such an environment etc etc.

In the given case it would be interesting to see a test that shows how low frequencies go on each of the compared GPUs in a stress test like long term performance in GFxbench. The last one I had seen was for the Adreno430/S810 where the GPU drops from its peak 600MHz all the way down to 305MHz. That isn't and can't be a "normal" case, just what I'd dare to call sluggish engineering, irrelevant of the actual source of the problem.
 
Back
Top