Android benchmark. Looking for testers

codedivine · Sep 24, 2012

Any Snapdragon S4 users (such as US GS3, HTC One S etc) care to try the benchmark? Should be interesting. :smile:

Deleted member 13524 · Sep 24, 2012

AT&T One X here, with a Snapdragon S4.

1 thread: 730 MFlops
2 threads: 1460 MFlops
4 threads: 1333 MFlops (is it normal to get a performance break when the number of threads > number of cores?)

CPU usage is around 98% during the test.

codedivine · Sep 24, 2012

ToTTenTranz said:
AT&T One X here, with a Snapdragon S4.

1 thread: 730 MFlops
2 threads: 1460 MFlops
4 threads: 1333 MFlops (is it normal to get a performance break when the number of threads > number of cores?)

CPU usage is around 98% during the test.

Excellent! Thanks a lot! :smile:
Interesting to see much higher performance compared to a similarly clocked 1.5 GHz Snapdragon S3 dual-core (which gives about 1040 MFlops).

Yes some performance degradation was seen by users when threads>cores. Not sure if the almost 10% degradation you saw is normal though.

Deleted member 13524 · Sep 24, 2012

codedivine said:
Excellent! Thanks a lot! :smile:
Interesting to see much higher performance compared to a similarly clocked 1.5 GHz Snapdragon S3 dual-core (which gives about 1040 MFlops).

Yes some performance degradation was seen by users when threads>cores. Not sure if the almost 10% degradation you saw is normal though.

But you just said you got a 1750 MFlops score with your S3. It seemed like the score was pretty low to me.

Rurouni · Sep 24, 2012

ToTTenTranz said:
But you just said you got a 1750 MFlops score with your S3. It seemed like the score was pretty low to me.

With ver. 1.2 the score was 1GFlops @1.5GHz for S3

codedivine · Sep 24, 2012

Thanks everyone! The app is now published to Google Play:

https://play.google.com/store/apps/details?id=org.codedivine.rgbench

The results might be slightly different (but not too much) from even v1.2 as I found and fixed one bug

codedivine · Sep 24, 2012

I do apologize to everyone who tested with all the versions before, since I had bugs in code.

However, everything is good now I think. Please use the Play version from now on. I have removed APKs from my site.

If I ever meet anyone who spent time on the previous versions, I will buy you a beer to compensate for your time.

codedivine · Sep 25, 2012

Some results reported by users:

Nexus 7 (Tegra 3): 1488 MFlops
Galaxy S2X (Snapdragon S3 dual-core): 1175 MFlops
Galaxy S2 (Exynos 4 dual-core) : 998 MFlops
Acer A100 (Tegra 2): 714 MFlops

Rys · Sep 25, 2012

So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in

Rootax · Sep 25, 2012

On my good old SGS1 (i9000) under JB (cm10 20/09/2012) and Mackay Kernel v0.66 :

1thread : ... 68 Mflops

codedivine · Sep 25, 2012

Rys said:
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in

If I am interpreting Laurent06 correctly, Cortex A9 should do 1 fp64 flop/cycle. Based on that, I would say around 40% of peak on some Cortex A9 systems. I don't know for sure about Snapdragons since I didn't find any relevant Qualcomm documentation.

edit: Much lower efficiency on Tegra 3 though.

codedivine · Sep 25, 2012

Rootax said:
On my good old SGS1 (i9000) under JB (cm10 20/09/2012) and Mackay Kernel v0.66 :

1thread : ... 68 Mflops

If I am not wrong, the Cortex A8 VFP is not pipelined. Hence the result.

Rootax · Sep 25, 2012

codedivine said:
If I am not wrong, the Cortex A8 VFP is not pipelined. Hence the result.

Well, my phone is running JB good, so that's what matter to me

Nice new bench btw.

codedivine · Sep 25, 2012

Copying my comment from TR forums:

I think after looking at the data from the multithreaded numbers so far, we can conclude the following: Cortex A9 implementations are achieving about 0.4 flops/cycle on the multithreaded mode on Exynos and OMAP implementations. Snapdragon S3 is also doing 0.4 flops/cycle.

However, Tegra 3 to be the exception to rule and is only achieving about 0.32 flops/cycle average. Tegra 2 is also stuck at about 0.36 flops/cycle. Wonder if it has something to do with Nvidia's memory controller.

About the single threaded results, actually the single-threaded and multi-threaded versions are working on different problem sizes. The single threaded one is working on smaller matrices. This was to ensure that the single thread case does not take very long to run, but now starting to think that was a poor design decision on my part. So results from single threaded and multithreaded are not directly comparable. I think I will provide settings to choose the matrix sizes yourself sometime.

codedivine · Sep 25, 2012

Rootax said:
Well, my phone is running JB good, so that's what matter to me

Nice new bench btw.

Thanks :smile:

Exophase · Sep 25, 2012

Rys said:
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in

Cortex-A9 can apparently issue all FP64 operations once every other cycle. Traditional matrix multiplication algorithms for an NxN matrix require O(N^3) FLOPs, where (N^3 - N^2) of those are FMADDs and N^2 of them are FADDs. Both can be issued every other cycle on Cortex-A9 (according to Laurent, the TRM must be wrong about FP64 FADDs issuing every cycle), but FMACs (they're 3-op/destructive) have a really long latency so they're hard to fill, especially since I don't think there's a special forwarding path between dependent FMACs like there are for integer MACs.

An important question is if loads and stores can be issued on the second issue cycle of the FP64 operation. My guess is not. The typical text book tiling algorithm for traditional matrix multiplication (like this http://en.wikipedia.org/wiki/Loop_tiling) will do an RMW on every element. It's good for cache locality but poor for register locality. On a CPU with as much load and store as FLOP co-issuing capability this may not be a problem, but on many (most?) platforms this won't be the case. Cortex-A9 would definitely benefit from anything that can improve the load/store to FLOP ratio. There just aren't really enough registers to hide a lot.

Not really sure what'd be the best kernel for something like this.. someone else here probably already has experience with something like that..

codedivine · Sep 25, 2012

Exophase said:
Cortex-A9 can apparently issue all FP64 operations once every other cycle. Traditional matrix multiplication algorithms for an NxN matrix require O(N^3) FLOPs, where (N^3 - N^2) of those are FMADDs and N^2 of them are FADDs. Both can be issued every other cycle on Cortex-A9 (according to Laurent, the TRM must be wrong about FP64 FADDs issuing every cycle), but FMACs (they're 3-op/destructive) have a really long latency so they're hard to fill, especially since I don't think there's a special forwarding path between dependent FMACs like there are for integer MACs.

An important question is if loads and stores can be issued on the second issue cycle of the FP64 operation. My guess is not. The typical text book tiling algorithm for traditional matrix multiplication (like this http://en.wikipedia.org/wiki/Loop_tiling) will do an RMW on every element. It's good for cache locality but poor for register locality. On a CPU with as much load and store as FLOP co-issuing capability this may not be a problem, but on many (most?) platforms this won't be the case. Cortex-A9 would definitely benefit from anything that can improve the load/store to FLOP ratio. There just aren't really enough registers to hide a lot.

Not really sure what'd be the best kernel for something like this.. someone else here probably already has experience with something like that..

My kernel does 6 loads for 8 FMACs :smile:

codedivine · Sep 25, 2012

My blog post describing some preliminary analysis including GCC generated assembly of the innermost loop to keep Exophase happy

http://codedivine.org/2012/09/25/prelim-analysis-rgbenchmm/

Exophase · Sep 25, 2012

Thanks, I just read it

I wonder what kind of results you get just running the inner loop on the same data over and over again, to try to isolate the effects of cache misses (and to a much lesser extent, branch mispredicts). If cache misses really are showing up as a large percentage some prefetch instructions would help. If the inner loop cost w/o memory stalls is close to as high as the number you reported then I'm pretty surprised. I really expect there to be just 16 cycles for the FMACs, 6 cycles for the loads, and - if Cortex-A9 dispatches VFP/NEON instructions anything like A8 does - zero cycles for everything else since it'd be running in parallel.

Are you sure that the Tegra 3 is running at the clock speed you think it is, throughout the duration of the test?

On closer look, I think you might be suffering stall cycles due to a WAW hazard with the second vld (to d6) hitting the most recent vmacd (accessing d6). They'd appear almost back to back in the VFP pipeline, depending on how loads work. I wouldn't expect it to push you anywhere close to what you're seeing though, not by itself.

Are you compiling with VFPv3-d16? Is Tegra 2 using that? If so, I'd be curious to see a version compiled with full VFPv3 and a bigger kernel that can make use of more registers.

codedivine · Sep 25, 2012

Exophase said:
Thanks, I just read it

I wonder what kind of results you get just running the inner loop on the same data over and over again, to try to isolate the effects of cache misses (and to a much lesser extent, branch mispredicts). If cache misses really are showing up as a large percentage some prefetch instructions would help. If the inner loop cost w/o memory stalls is close to as high as the number you reported then I'm pretty surprised. I really expect there to be just 16 cycles for the FMACs, 6 cycles for the loads, and - if Cortex-A9 dispatches VFP/NEON instructions anything like A8 does - zero cycles for everything else since it'd be running in parallel.

Good suggestion. I will try that, might report back 1-2 week later though due to work.

Are you sure that the Tegra 3 is running at the clock speed you think it is, throughout the duration of the test?

Oh. No i have not verified that. Those results are what a user reported using a Nexus 7. I don't have a Tegra 3 device so I cannot say for sure. I just looked up the specs of T30L in Nexus 7 (1.2GHz quad-core) and assumed that frequency in my calculation of cycles taken to execute. How does one verify that on Android?

On closer look, I think you might be suffering stall cycles due to a WAW hazard with the second vld (to d6) hitting the most recent vmacd (accessing d6). They'd appear almost back to back in the VFP pipeline, depending on how loads work. I wouldn't expect it to push you anywhere close to what you're seeing though, not by itself.

Are you compiling with VFPv3-d16? Is Tegra 2 using that? If so, I'd be curious to see a version compiled with full VFPv3 and a bigger kernel that can make use of more registers.

I used the default GCC flags for armv7a target from the android NDK r8b with GCC 4.6. These are the relevant flags that the NDK was using: -march=armv7-a -mfloat-abi=softfp -mfpu=vfp

I guess I should look at using mfpu=vfpv3-d16 flag instead? I don't think I will be changing the play store kernel code now, as people seem happy enough with it, but I will do a build and post a link to an APK here soon for analysis purposes.

Thanks for your detailed remarks. This is why I love B3D (and also techreport forums). I got severely burnt by trolling at RWT forums (though partly because of my idiocy and miscommunication)

Android benchmark. Looking for testers

codedivine

Deleted member 13524

Guest

codedivine

Deleted member 13524

Guest

Rurouni

codedivine

codedivine

codedivine

Rys

Graphics @ AMD

Rootax

codedivine

codedivine

Rootax

codedivine

codedivine

Exophase

codedivine

codedivine

Exophase

codedivine

Similar threads