I agree with Exophase: at least ARM64 does not really need a large uops cache, because decoding ARM64 instructions is much simpler than decoding x86 instructions, and it's unlikely a loop buffer to be able to fit 960 instructions (it's very rare an "inner" loop to be that large).
As for the testing, I write codes in C and check its assembly output, as I'm more familiar with x86 than with ARM
but I think if you guys have some suggestions for specific instructions, I can do that.
The assembly code of the load testing look like this:
ldr.w r3, [r4, r3, lsl #2]
add r1, r3
I didn't do any battery test though, as the device is plugged in and the test is very short (finished in maybe 10 seconds).
One thing to remember is, my tests here is just for finding out the internal designs of the CPU. In real world workloads, mobile devices, due to power restrictions, tend to be worse than on-paper performance, especially for phones. iPad Air has pretty good heat dissipation, but for better real world testing, I think it's better to test in a prolonged, not-plugged-in, environment (and with real workloads).