Android benchmark. Looking for testers

AT&T One X here, with a Snapdragon S4.

1 thread: 730 MFlops
2 threads: 1460 MFlops
4 threads: 1333 MFlops (is it normal to get a performance break when the number of threads > number of cores?)

CPU usage is around 98% during the test.
 
AT&T One X here, with a Snapdragon S4.

1 thread: 730 MFlops
2 threads: 1460 MFlops
4 threads: 1333 MFlops (is it normal to get a performance break when the number of threads > number of cores?)

CPU usage is around 98% during the test.

Excellent! Thanks a lot! :smile:
Interesting to see much higher performance compared to a similarly clocked 1.5 GHz Snapdragon S3 dual-core (which gives about 1040 MFlops).

Yes some performance degradation was seen by users when threads>cores. Not sure if the almost 10% degradation you saw is normal though.
 
Excellent! Thanks a lot! :smile:
Interesting to see much higher performance compared to a similarly clocked 1.5 GHz Snapdragon S3 dual-core (which gives about 1040 MFlops).

Yes some performance degradation was seen by users when threads>cores. Not sure if the almost 10% degradation you saw is normal though.

But you just said you got a 1750 MFlops score with your S3. It seemed like the score was pretty low to me.
 
I do apologize to everyone who tested with all the versions before, since I had bugs in code. :oops: :oops: :oops:

However, everything is good now I think. Please use the Play version from now on. I have removed APKs from my site.

If I ever meet anyone who spent time on the previous versions, I will buy you a beer to compensate for your time. :oops:
 
Some results reported by users:

Nexus 7 (Tegra 3): 1488 MFlops
Galaxy S2X (Snapdragon S3 dual-core): 1175 MFlops
Galaxy S2 (Exynos 4 dual-core) : 998 MFlops
Acer A100 (Tegra 2): 714 MFlops
 
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in :)
 
On my good old SGS1 (i9000) under JB (cm10 20/09/2012) and Mackay Kernel v0.66 :

1thread : ... 68 Mflops :devilish:
 
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in :)

If I am interpreting Laurent06 correctly, Cortex A9 should do 1 fp64 flop/cycle. Based on that, I would say around 40% of peak on some Cortex A9 systems. I don't know for sure about Snapdragons since I didn't find any relevant Qualcomm documentation.

edit: Much lower efficiency on Tegra 3 though.
 
Copying my comment from TR forums:

I think after looking at the data from the multithreaded numbers so far, we can conclude the following: Cortex A9 implementations are achieving about 0.4 flops/cycle on the multithreaded mode on Exynos and OMAP implementations. Snapdragon S3 is also doing 0.4 flops/cycle.

However, Tegra 3 to be the exception to rule and is only achieving about 0.32 flops/cycle average. Tegra 2 is also stuck at about 0.36 flops/cycle. Wonder if it has something to do with Nvidia's memory controller.

About the single threaded results, actually the single-threaded and multi-threaded versions are working on different problem sizes. The single threaded one is working on smaller matrices. This was to ensure that the single thread case does not take very long to run, but now starting to think that was a poor design decision on my part. So results from single threaded and multithreaded are not directly comparable. I think I will provide settings to choose the matrix sizes yourself sometime.
 
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in :)

Cortex-A9 can apparently issue all FP64 operations once every other cycle. Traditional matrix multiplication algorithms for an NxN matrix require O(N^3) FLOPs, where (N^3 - N^2) of those are FMADDs and N^2 of them are FADDs. Both can be issued every other cycle on Cortex-A9 (according to Laurent, the TRM must be wrong about FP64 FADDs issuing every cycle), but FMACs (they're 3-op/destructive) have a really long latency so they're hard to fill, especially since I don't think there's a special forwarding path between dependent FMACs like there are for integer MACs.

An important question is if loads and stores can be issued on the second issue cycle of the FP64 operation. My guess is not. The typical text book tiling algorithm for traditional matrix multiplication (like this http://en.wikipedia.org/wiki/Loop_tiling) will do an RMW on every element. It's good for cache locality but poor for register locality. On a CPU with as much load and store as FLOP co-issuing capability this may not be a problem, but on many (most?) platforms this won't be the case. Cortex-A9 would definitely benefit from anything that can improve the load/store to FLOP ratio. There just aren't really enough registers to hide a lot.

Not really sure what'd be the best kernel for something like this.. someone else here probably already has experience with something like that..
 
Cortex-A9 can apparently issue all FP64 operations once every other cycle. Traditional matrix multiplication algorithms for an NxN matrix require O(N^3) FLOPs, where (N^3 - N^2) of those are FMADDs and N^2 of them are FADDs. Both can be issued every other cycle on Cortex-A9 (according to Laurent, the TRM must be wrong about FP64 FADDs issuing every cycle), but FMACs (they're 3-op/destructive) have a really long latency so they're hard to fill, especially since I don't think there's a special forwarding path between dependent FMACs like there are for integer MACs.

An important question is if loads and stores can be issued on the second issue cycle of the FP64 operation. My guess is not. The typical text book tiling algorithm for traditional matrix multiplication (like this http://en.wikipedia.org/wiki/Loop_tiling) will do an RMW on every element. It's good for cache locality but poor for register locality. On a CPU with as much load and store as FLOP co-issuing capability this may not be a problem, but on many (most?) platforms this won't be the case. Cortex-A9 would definitely benefit from anything that can improve the load/store to FLOP ratio. There just aren't really enough registers to hide a lot.

Not really sure what'd be the best kernel for something like this.. someone else here probably already has experience with something like that..

My kernel does 6 loads for 8 FMACs :smile:
 
Thanks, I just read it :D

I wonder what kind of results you get just running the inner loop on the same data over and over again, to try to isolate the effects of cache misses (and to a much lesser extent, branch mispredicts). If cache misses really are showing up as a large percentage some prefetch instructions would help. If the inner loop cost w/o memory stalls is close to as high as the number you reported then I'm pretty surprised. I really expect there to be just 16 cycles for the FMACs, 6 cycles for the loads, and - if Cortex-A9 dispatches VFP/NEON instructions anything like A8 does - zero cycles for everything else since it'd be running in parallel.

Are you sure that the Tegra 3 is running at the clock speed you think it is, throughout the duration of the test?

On closer look, I think you might be suffering stall cycles due to a WAW hazard with the second vld (to d6) hitting the most recent vmacd (accessing d6). They'd appear almost back to back in the VFP pipeline, depending on how loads work. I wouldn't expect it to push you anywhere close to what you're seeing though, not by itself.

Are you compiling with VFPv3-d16? Is Tegra 2 using that? If so, I'd be curious to see a version compiled with full VFPv3 and a bigger kernel that can make use of more registers.
 
Last edited by a moderator:
Thanks, I just read it :D

I wonder what kind of results you get just running the inner loop on the same data over and over again, to try to isolate the effects of cache misses (and to a much lesser extent, branch mispredicts). If cache misses really are showing up as a large percentage some prefetch instructions would help. If the inner loop cost w/o memory stalls is close to as high as the number you reported then I'm pretty surprised. I really expect there to be just 16 cycles for the FMACs, 6 cycles for the loads, and - if Cortex-A9 dispatches VFP/NEON instructions anything like A8 does - zero cycles for everything else since it'd be running in parallel.

Good suggestion. I will try that, might report back 1-2 week later though due to work.

Are you sure that the Tegra 3 is running at the clock speed you think it is, throughout the duration of the test?

Oh. No i have not verified that. Those results are what a user reported using a Nexus 7. I don't have a Tegra 3 device so I cannot say for sure. I just looked up the specs of T30L in Nexus 7 (1.2GHz quad-core) and assumed that frequency in my calculation of cycles taken to execute. How does one verify that on Android?

On closer look, I think you might be suffering stall cycles due to a WAW hazard with the second vld (to d6) hitting the most recent vmacd (accessing d6). They'd appear almost back to back in the VFP pipeline, depending on how loads work. I wouldn't expect it to push you anywhere close to what you're seeing though, not by itself.

Are you compiling with VFPv3-d16? Is Tegra 2 using that? If so, I'd be curious to see a version compiled with full VFPv3 and a bigger kernel that can make use of more registers.

I used the default GCC flags for armv7a target from the android NDK r8b with GCC 4.6. These are the relevant flags that the NDK was using: -march=armv7-a -mfloat-abi=softfp -mfpu=vfp

I guess I should look at using mfpu=vfpv3-d16 flag instead? I don't think I will be changing the play store kernel code now, as people seem happy enough with it, but I will do a build and post a link to an APK here soon for analysis purposes.

Thanks for your detailed remarks. This is why I love B3D (and also techreport forums). I got severely burnt by trolling at RWT forums (though partly because of my idiocy and miscommunication) :cry:
 
Back
Top