Samsung Exynos 5250 - production starting in Q2 2012

  • Thread starter Deleted member 13524
  • Start date
Oh hell Samsung, shame on you!

I'm currently doing GPU overclocking and voltage control in the kernel for the 5410/i9500 and was screwing around with what was supposed to be a generic max limit only to be surprised by what it actually represents.

This GPU does not run 532MHz; that frequency level is solely reserved for Antutu and GLBenchmark* among things. The GPU on non-whitelisted applications is limited to 480MHz. The old GLBenchmark apps for example run at 532MHz while the new GFXBench app which is not whitelisted, runs at 480MHz. /facepalm

For anybody interested, here's some scores at 640MHz, for comparison's sake of what 544MP3 could do. I tried 700 but that wasn't stable within the prescribed upper voltage limit (1150mV).

GFXBench 2.7.2 (offscreen):
2.7 T-Rex: 14fps
2.5 Egypt: 48fps

Antutu 3DRating (onscreen): 8372 / 31.4fps
Antutu 3.3.1 3D benchmark: 8584

Basemark Taiji: 46.54

3DMark:
Ice storm standard: 11357 overall, 11486 graphics, 58.1fps GT1 43.8fps GT2
Ice storm extreme: 7314 overall, 6680 grapgics, 39.1fps GT1, 23.1fps GT2

with gfx benchmark the fill rate scores are drastically different when compared to iphone5 despite having similar gpu as s4, is that limitation due to platform or due to some other factors?
 
with gfx benchmark the fill rate scores are drastically different when compared to iphone5 despite having similar gpu as s4, is that limitation due to platform or due to some other factors?

Clock speeds. The GPU in the iphone 5 is lower clocked.
 
Clock speeds. The GPU in the iphone 5 is lower clocked.

325 (A6) vs. 480MHz (E5410)

By the way the lowest fillrate efficiency out of the crop of Series5XT/6 GPUs they used for that graph should be the 544MP3 in the Exynos5410:

PowerVR-Series5XT-Series6-vs-competing-GPUs-fillrate-efficiency.jpg


http://withimagination.imgtec.com/i...vrs-market-leading-fillrate-efficiency-part-8
 
Clock speeds. The GPU in the iphone 5 is lower clocked.

In glbenchmark, Samsung 5410 has 13% better off-screen fill rate than iphone5, but has 60%+ higher clock (assuming the 5410 is clocking @ 533Mhz for the test).

So either Samsung graphics datapath is inferior to the one in the A6, or its a driver issue.
 
325 (A6) vs. 480MHz (E5410)

By the way the lowest fillrate efficiency out of the crop of Series5XT/6 GPUs they used for that graph should be the 544MP3 in the Exynos5410:

Yes and even that is probably being generous, as the graph might assume 480Mhz clock to work out the theoretical fill rate.
 
Yes and even that is probably being generous, as the graph might assume 480Mhz clock to work out the theoretical fill rate.

It's so far my understanding that it actually clocks in the widest majority of cases at 480MHz and only in a couple of benchmarks at 532MHz. I might understood Nebu wrong but I think it clocks at 480MHz in GLB.

Besides that I'd still love to know which the nearly 100% efficiency variant is.
 
It's so far my understanding that it actually clocks in the widest majority of cases at 480MHz and only in a couple of benchmarks at 532MHz. I might understood Nebu wrong but I think it clocks at 480MHz in GLB.

Besides that I'd still love to know which the nearly 100% efficiency variant is.
480 in GFXBench and 532 in the old GLB apps.

I'm curious about the wording in that IMG blog, as if it wants to say that the inefficiency is because of the high clocks. I'm pulling straws here.

I did a quick bench of 350 vs 480MHz, both those clocks on the GPU force a memory lock to 800MHz so bandwidth shouldn't be an issue:

350:
2.5 Egypt offscreen: 3534 frames
[strike]Fill-rate offscreen: 987526ktex/s[/strike]

480:
2.5 Egypt offscreen: 4517 frames
[strike]Fill-rate offscreen: 1323662ktex/s[/strike]

37.14% superior frequency for 27.81% improvement in Egypt [strike]and 34.03% improvement in fill-rate.[/strike]

I can do some more synthetic benches while locking all of the phone's frequencies and several runs if somebody would like to see that.

PS: Does that ImgTech blog even take into account Exynos's cheating?
To illustrate this, the graph below shows fillrate efficiency calculated based on independent measured fillrate data from Kishonti’s GFXBench suite.

Would be funny if the efficiency is calculated based on a 532MHz clock but 480MHz results :D


PS2: I found the fill-rate to be very bogus, reran it:

480:
Run1 1902600 ktex/s
Run2: 1415437 ktex/s
Run3: 1911574 ktex/s
Run4: 1936990 ktex/s

350:
Run1 1630870 ktex/s
Run2: 350: 1644457 ktex/s
Run3: 350: 1674204 ktex/s

Given the above reruns, it's even worse: only 15.69% improvement on the best scores between 350 and 480, that's bandwidth limitation, right?

I'll have to investigate GPU thermal throttling...
 
Last edited by a moderator:
Would be funny if the efficiency is calculated based on a 532MHz clock but 480MHz results :D

6 TMUs * 480MHz = 2.88 GTexels/s
Kishonti results onscreen = 1.97 GTexels/s = 68%


480:
Run1 1902600 ktex/s
Run2: 1415437 ktex/s
Run3: 1911574 ktex/s
Run4: 1936990 ktex/s

350:
Run1 1630870 ktex/s
Run2: 350: 1644457 ktex/s
Run3: 350: 1674204 ktex/s

Given the above reruns, it's even worse: only 15.69% improvement on the best scores between 350 and 480, that's bandwidth limitation, right?

I'll have to investigate GPU thermal throttling...

Alas if its already throttling in a simple fillrate test. Either the driver needs some serious work, or there's something else wrong with bandwidth being one of probably many candidates.
 
6 TMUs * 480MHz = 2.88 GTexels/s
Kishonti results onscreen = 1.97 GTexels/s = 68%
And if you lower the frequency, the efficiency goes up.

Made a across-the-table sweep on some possible scenarios:

Ig5ac9i.png


EIHRvIw.png


I also tested lowering the internal bus but that didn't have any effect at all on the scores.

What are the actual bandwidth requirements per TMU per cycle?

Alas if its already throttling in a simple fillrate test. Either the driver needs some serious work, or there's something else wrong with bandwidth being one of probably many candidates.
I'm still not aware of any GPU throttling mechanism, but memory has throttling in place.
 
Last edited by a moderator:
.

Besides that I'd still love to know which the nearly 100% efficiency variant is.

Iphone5 isn't far away.

Assuming its 325 MHz. Then 650 per core. X3=1950 M t/s
Have to allow a small reduction as IMG have said multi core performance scales about 95% linear. Would work out about 1850.

Offscreen fillrate in gfxbench is 1835
 
Last edited by a moderator:
And if you lower the frequency, the efficiency goes up.

Made a across-the-table sweep on some possible scenarios:

Interesting.


I also tested lowering the internal bus but that didn't have any effect at all on the scores.

What are the actual bandwidth requirements per TMU per cycle?

No idea to be honest.

I'm still not aware of any GPU throttling mechanism, but memory has throttling in place.

Or Series5XT cores simply aren't meant for very high frequencies, unlike of course Series6 according to my so far understanding.
 
Can anybody theorise the difference in stream scores in the above results? The higher one is from A15 at 800MHz and the other one is A7's at 1500MHz.

This is by no means a solid analysis, but maybe the A7 is latency bound by the FPU operations while the A15 isn't (technically even the copy operation should be going through the FPU). Stream consists of very tight loops; if the compiler isn't unrolling it then you could end up with loading and storing to the same registers causing stalls due to WAW hazards (and hitting RAW with load-use latency, to some degree). The A15 would hide this due to its register renaming.

But that doesn't explain why Cortex-A9s get much better scores in Geekbench, since they should be subject to the same problem.
 
And if you lower the frequency, the efficiency goes up.
Very interesting data, thanks! :)
I'm still not aware of any GPU throttling mechanism, but memory has throttling in place.
Is there any way to change memory CAS like on the desktop? That could be very interesting to test the impact of latency vs bandwidth (although I don't know if the memory controller itself is clocked based on memory frequency and whether it plays a noticeable part in total latency or not).

It's worth pointing out that LPDDR3-1600 has worse latency than LPDDR2-1066 (not sure exactly how it compares to LPDDR2-800) so you might have a double whammy of higher latency than some competing systems with higher GPU frequency as well. So I suspect the memory latency in cycles might be higher than on any other SGX device (except for very badly designed ones perhaps, I don't really know).
 
Very interesting data, thanks! :)
Is there any way to change memory CAS like on the desktop? That could be very interesting to test the impact of latency vs bandwidth (although I don't know if the memory controller itself is clocked based on memory frequency and whether it plays a noticeable part in total latency or not)..
Yes, but the value fields are undocumented so i would be changing them blindly. Line 192; https://github.com/AndreiLux/Perseus-UNIVERSAL5410/blob/perseus/drivers/devfreq/exynos5410_bus_mif.c
 
Back
Top