codedivine
Regular
Any Snapdragon S4 users (such as US GS3, HTC One S etc) care to try the benchmark? Should be interesting. :smile:
AT&T One X here, with a Snapdragon S4.
1 thread: 730 MFlops
2 threads: 1460 MFlops
4 threads: 1333 MFlops (is it normal to get a performance break when the number of threads > number of cores?)
CPU usage is around 98% during the test.
Excellent! Thanks a lot! :smile:
Interesting to see much higher performance compared to a similarly clocked 1.5 GHz Snapdragon S3 dual-core (which gives about 1040 MFlops).
Yes some performance degradation was seen by users when threads>cores. Not sure if the almost 10% degradation you saw is normal though.
But you just said you got a 1750 MFlops score with your S3. It seemed like the score was pretty low to me.
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in
On my good old SGS1 (i9000) under JB (cm10 20/09/2012) and Mackay Kernel v0.66 :
1thread : ... 68 Mflops
If I am not wrong, the Cortex A8 VFP is not pipelined. Hence the result.
Well, my phone is running JB good, so that's what matter to me
Nice new bench btw.
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in
Cortex-A9 can apparently issue all FP64 operations once every other cycle. Traditional matrix multiplication algorithms for an NxN matrix require O(N^3) FLOPs, where (N^3 - N^2) of those are FMADDs and N^2 of them are FADDs. Both can be issued every other cycle on Cortex-A9 (according to Laurent, the TRM must be wrong about FP64 FADDs issuing every cycle), but FMACs (they're 3-op/destructive) have a really long latency so they're hard to fill, especially since I don't think there's a special forwarding path between dependent FMACs like there are for integer MACs.
An important question is if loads and stores can be issued on the second issue cycle of the FP64 operation. My guess is not. The typical text book tiling algorithm for traditional matrix multiplication (like this http://en.wikipedia.org/wiki/Loop_tiling) will do an RMW on every element. It's good for cache locality but poor for register locality. On a CPU with as much load and store as FLOP co-issuing capability this may not be a problem, but on many (most?) platforms this won't be the case. Cortex-A9 would definitely benefit from anything that can improve the load/store to FLOP ratio. There just aren't really enough registers to hide a lot.
Not really sure what'd be the best kernel for something like this.. someone else here probably already has experience with something like that..
Thanks, I just read it
I wonder what kind of results you get just running the inner loop on the same data over and over again, to try to isolate the effects of cache misses (and to a much lesser extent, branch mispredicts). If cache misses really are showing up as a large percentage some prefetch instructions would help. If the inner loop cost w/o memory stalls is close to as high as the number you reported then I'm pretty surprised. I really expect there to be just 16 cycles for the FMACs, 6 cycles for the loads, and - if Cortex-A9 dispatches VFP/NEON instructions anything like A8 does - zero cycles for everything else since it'd be running in parallel.
Are you sure that the Tegra 3 is running at the clock speed you think it is, throughout the duration of the test?
On closer look, I think you might be suffering stall cycles due to a WAW hazard with the second vld (to d6) hitting the most recent vmacd (accessing d6). They'd appear almost back to back in the VFP pipeline, depending on how loads work. I wouldn't expect it to push you anywhere close to what you're seeing though, not by itself.
Are you compiling with VFPv3-d16? Is Tegra 2 using that? If so, I'd be curious to see a version compiled with full VFPv3 and a bigger kernel that can make use of more registers.