Next-Gen iPhone & iPhone Nano Speculation

Apple has released their LLVM compiler for Aarch64. This contains a file that gives some information about Cyclone. In particular 6 micro-ops can be dispatched per cycle and the re-order buffer can contain up to 192 instructions. I couldn't find decode/fetch bandwidth, so I'm still unsure 6 instructions per cycle can be sustained (I doubt there are 6x3 (ARM + Thumb + AArch64) full decoders).
 
Interesting, thanks Laurent.

It does say 6 uops are dispatched per cycle, not 6 instructions, and it lists one case of an instructions generating two uops. So there may not actually be six decoders of any type, let alone every type.

Different buffer sizes are given for the different execution pipe types, suggesting that there isn't a unified scheduler but that the scheduling is clustered. But the buffers seem gigantic, if they're really actually full schedulers. The reordering capability would be enormous.

Looks like it's doing move elimination, including converting moves of imm #0 to moves from the zero register.

Load-to-use latency isn't low for a CPU with a pretty restricted clock target, but I guess it makes sense given the large 64KB L1 dcache size.
 
I did some tests on iPad Air, and I'm pretty sure it's able to decode at least 4 instructions per cycle, but as you guys said, I'm not sure if it can decode 6 "real" instructions as some have suggested.
 
I did some tests on iPad Air, and I'm pretty sure it's able to decode at least 4 instructions per cycle, but as you guys said, I'm not sure if it can decode 6 "real" instructions as some have suggested.

Yeah, I think anything less than four would make it hard to utilize four ALUs, among other things. I also expect the capability to at least do two loads in parallel, if not two stores.

Interesting how variable the branch prediction is, 14-19 cycles "typical" (so maybe could be something else); could be that the low number is from bypassing some stages of the frontend that hit a loop buffer or even a post-decode cache. Or maybe the higher number is just for branches resolved later in the pipeline. Like 14 cycles for direct unconditional branches, 16 cycles for direct conditional branches, 19 cycles for indirect branches, just as an example.

The mispredict penalty/pipeline depth still seems pretty high for something only clocking to 1.4GHz (has anyone been able to overclock it?), I guess you need more stages to handle all that scheduling and execution width. The branch prediction is going to have to be really good to cope with that. I expect indirect branch prediction in particular to be superior than its peers - one thing that stood out to me is how well it runs emulators with interpreters vs Cortex-A15.

On paper this really seems a lot like a recent Intel core design. Except for perhaps lacking a unified scheduler. Only instead to have cluster scheduling that's just as deep.
 
I decided to pick up my old test program and do some more tests to iPad Air.
In my old test I only did integer instructions. However, according to AnandTech's test, Cyclone is able to do floating point along side with integer, so I decide to put some floating point instructions in the mix.

The test basically look like this:

add
add
add
add
fadd
fadd

The first few tests are not very successful as it's much slower with the fadds in the mix. I think that it could be that fadd's latency is much longer than integer add (it looks like integer add, including 64 bits ones, have one cycle use latency in the pipeline). So I put more floating point registers to avoid latency issue.

Of course, there's only 32 registers, but fortunately floating point registers are separate from integer ones, so I made a new test with all 32 registers used. This way, I got ~7800 MIPS on iPad Air. Since iPad Air's clock rate is 1.4GHz, it's about 5.5 instructions per cycle. I couldn't manage to make it higher in the short time I have.

These new test results do support the conjecture that Cyclone is able to decode 6 instructions per cycle, and fadd (both single and double precision) use latency is probably about 3 cycles.

[EDIT] Forgot to mention another possibility that it has some sort of micro-ops cache like SandyBridge.
 
Yeah, I think anything less than four would make it hard to utilize four ALUs, among other things. I also expect the capability to at least do two loads in parallel, if not two stores.

Interesting how variable the branch prediction is, 14-19 cycles "typical" (so maybe could be something else); could be that the low number is from bypassing some stages of the frontend that hit a loop buffer or even a post-decode cache. Or maybe the higher number is just for branches resolved later in the pipeline. Like 14 cycles for direct unconditional branches, 16 cycles for direct conditional branches, 19 cycles for indirect branches, just as an example.

The mispredict penalty/pipeline depth still seems pretty high for something only clocking to 1.4GHz (has anyone been able to overclock it?), I guess you need more stages to handle all that scheduling and execution width. The branch prediction is going to have to be really good to cope with that. I expect indirect branch prediction in particular to be superior than its peers - one thing that stood out to me is how well it runs emulators with interpreters vs Cortex-A15.

On paper this really seems a lot like a recent Intel core design. Except for perhaps lacking a unified scheduler. Only instead to have cluster scheduling that's just as deep.
https://github.com/llvm-mirror/llvm...1169a48d2a/lib/Target/ARM/ARMScheduleSwift.td

Code:
// Swift machine model for scheduling and other instruction cost heuristics.
def SwiftModel : SchedMachineModel {
  let IssueWidth = 3; // 3 micro-ops are dispatched per cycle.
  let MicroOpBufferSize = 45; // Based on NEON renamed registers.
  let LoadLatency = 3;
  let MispredictPenalty = 14; // A branch direction mispredict.
I don't know if it helps any, but the corresponding characteristics for Swift suggest pipeline depth has gone up in Cyclone even though clock speed is the same. Typical load latency is specified as 3 cycles and mispredict penalty as 14 compared to 4 and 16 for Cyclone. Issue width has doubled and the uop buffer more than quadrupled.
 
I decided to pick up my old test program and do some more tests to iPad Air.
Interesting. It could have separate decoders for different kinds of operations I guess, e.g. 4 INT decoders (ARMv7+Thumb+AArch64), 2 Load/Store decoders, 2 FP decoders (VFP+NEON), etc... Still 4 full decoders for 3 ISAs is quite a lot!
 
I decided to pick up my old test program and do some more tests to iPad Air.
In my old test I only did integer instructions. However, according to AnandTech's test, Cyclone is able to do floating point along side with integer, so I decide to put some floating point instructions in the mix.

The test basically look like this:

add
add
add
add
fadd
fadd

The first few tests are not very successful as it's much slower with the fadds in the mix. I think that it could be that fadd's latency is much longer than integer add (it looks like integer add, including 64 bits ones, have one cycle use latency in the pipeline). So I put more floating point registers to avoid latency issue.

Of course, there's only 32 registers, but fortunately floating point registers are separate from integer ones, so I made a new test with all 32 registers used. This way, I got ~7800 MIPS on iPad Air. Since iPad Air's clock rate is 1.4GHz, it's about 5.5 instructions per cycle. I couldn't manage to make it higher in the short time I have.

These new test results do support the conjecture that Cyclone is able to decode 6 instructions per cycle, and fadd (both single and double precision) use latency is probably about 3 cycles.

[EDIT] Forgot to mention another possibility that it has some sort of micro-ops cache like SandyBridge.
Try to make your loop much larger so as to defeat any loop buffer (but keep it smaller than L1 Icache :smile:).
 
I decided to pick up my old test program and do some more tests to iPad Air.
In my old test I only did integer instructions. However, according to AnandTech's test, Cyclone is able to do floating point along side with integer, so I decide to put some floating point instructions in the mix.

The test basically look like this:

add
add
add
add
fadd
fadd

The first few tests are not very successful as it's much slower with the fadds in the mix. I think that it could be that fadd's latency is much longer than integer add (it looks like integer add, including 64 bits ones, have one cycle use latency in the pipeline). So I put more floating point registers to avoid latency issue.

Of course, there's only 32 registers, but fortunately floating point registers are separate from integer ones, so I made a new test with all 32 registers used. This way, I got ~7800 MIPS on iPad Air. Since iPad Air's clock rate is 1.4GHz, it's about 5.5 instructions per cycle. I couldn't manage to make it higher in the short time I have.

These new test results do support the conjecture that Cyclone is able to decode 6 instructions per cycle, and fadd (both single and double precision) use latency is probably about 3 cycles.
According to the llvm code, fadd has actually 4 and 5 cycles latency for SP and DP respectively. FMA OTOH is twice that so if I'd wanted to think of weaknesses of the chip, the float latencies would probably be one.

[EDIT] Forgot to mention another possibility that it has some sort of micro-ops cache like SandyBridge.

Interesting. It could have separate decoders for different kinds of operations I guess, e.g. 4 INT decoders (ARMv7+Thumb+AArch64), 2 Load/Store decoders, 2 FP decoders (VFP+NEON), etc... Still 4 full decoders for 3 ISAs is quite a lot!
That may be possible but it might just be as well 6 "full" decoders based on the test results. Since there's only 4 integer pipes you'd never see higher throughput than 4 if you just use integer operations. I am not sure how easy that would be (with x86 wide decoder is not that easy, but even if it isn't that difficult you may end up decoding quite some instructions you never need), of course loop buffer would be a possibility too to get these 6 operations (even then though I'd think the decoder would be quite wide).
The chip is actually quite impressive, based on the llvm code it should be able to handle 4 int adds + 2 l/s for instance in a single clock (my guess would be 1 store max per cycle, but possibly 2 loads), or 2 int adds + 1 indirect branch + 1 direct branch + 2 l/s. Overall actually better than what IvyBridge could do and more in the territory of Haswell (which also can do 2 branches, though can do 2 loads + 1 store per clock so slightly better there). The ROB size also is larger than what IVB has and incidentally matches that from HSW.
That of course does not necessarily mean it will beat IVB in practical IPC, but it is very remarkable that just about everybody else (intel, amd, qualcomm, arm not counting a15/a57) agreed on relatively simple 2-wide OoO archs for low power (or say the most power efficient) designs, and apple went for a (much) wider (but lower clock) design.
 
I think that it probably won't be able to sustain 6 instructions/cycle when going from L1 icache, because it'd need a 256-bit fetch interface. This wider fetch will be less efficient due to fetch slots wasted from targets past the start of the fetch and taken branches before the end of the fetch. Even Intel has shied away from 256-bit decode, and if it's only 128-bit there won't be much point for more than 4 decoders. That said, 6 AArch64 decoders shouldn't be that expensive, I can't think of much in the way of work done in decoding outside of constructing logical operation immediates.

Since everyone else is using some kind of post-decode buffer or cache, at least a loop buffer, it's pretty much a given that Apple is too. Silvermont uses its ROB as a loop buffer, if a similar approach is employed by Cyclone that'd mean it could hold a pretty huge loop.

On the other side of things, renaming 6 instructions in one cycle would be expensive if it means 12 sources and 6 destinations from the scalar register file, due to having to maintain sequential ordering. If there's a separate vector register file it may be possible that there are two renamers that can run in parallel. A good test would be to see if it can sustain 4 ALU ops + load + store or something similar, or if it makes a difference if the ALU ops use immediates or not.

I think finding out if it can do load + load or store + store is more significant than finding out if you can really sustain 6 arbitrary ops.
 
I played with the iPad Air for a bit more.

I make the size of the inner loop 10 times larger, which makes the number of instructions to around 960. Of course, Sandy Bridge's uOps cache is said to be 1.5K, but I think Cyclone is unlikely to have something that big (if at all). This does not change the test result, except making it a bit faster as the loop overhead is now greatly reduced.

I also made a random load test, basically reading randomly from a 12KB integer array (in twelve independent "streams"). It managed ~1.6 loads per cycle. I didn't have enough time to tune the test more, but I think making the loop larger could bring the number up a bit.
 
Thanks for the tests! Very interesting.

Why didn't you use more than 960 instr? After all Cyclone reorder buffer is as large as the Haswell one, so one never knows how big a loop buffer they could use.
 
A big SB-style uop cache seems like overkill, I wonder if the overhead of accessing such a cache plus maintaining the arrays is even lower than that of decoding 64-bit instructions. A 960 uop large loop buffer also seems totally off the rails, not to say such straight line loop iterations w/o branches doesn't exist but most code isn't so extreme.

pcchen, could you show the exact instructions you're using? And maybe take requests for testing different instructions? ;)
 
I agree with Exophase: at least ARM64 does not really need a large uops cache, because decoding ARM64 instructions is much simpler than decoding x86 instructions, and it's unlikely a loop buffer to be able to fit 960 instructions (it's very rare an "inner" loop to be that large).

As for the testing, I write codes in C and check its assembly output, as I'm more familiar with x86 than with ARM :p but I think if you guys have some suggestions for specific instructions, I can do that.

The assembly code of the load testing look like this:

ldr.w r3, [r4, r3, lsl #2]
add r1, r3

I didn't do any battery test though, as the device is plugged in and the test is very short (finished in maybe 10 seconds).

One thing to remember is, my tests here is just for finding out the internal designs of the CPU. In real world workloads, mobile devices, due to power restrictions, tend to be worse than on-paper performance, especially for phones. iPad Air has pretty good heat dissipation, but for better real world testing, I think it's better to test in a prolonged, not-plugged-in, environment (and with real workloads).
 
TAG Heuer launching a mobile phone rechargeable with solar energy.

So, don't praise smartphones or smartwatches who lack this evolution. Apple's solution including. Is there any chance they will be so innovative to accept it?

Even my watch Casio Pathfinder Titanium Atomic Toughsolar is absolutely autonomous.
 
TAG Heuer launching a mobile phone rechargeable with solar energy.

So, don't praise smartphones or smartwatches who lack this evolution. Apple's solution including. Is there any chance they will be so innovative to accept it?

Even my watch Casio Pathfinder Titanium Atomic Toughsolar is absolutely autonomous.
It's not innovative if it doesn't work…

If a solar panel adds 15% per day in charge to a crappy $5500+ feature phone with a monochrome 320x200 screen, how much charge do you think it will add to a smartphone? My guess is 2%…

I once bought this relatively large solar panel (roughly the size of an iPad mini) to 'charge stuff when camping etc.' It works great for an e-paper Kindle, but that already has a 2 week battery life. And I never go camping for that long. For anything else, it's useless.
 
Back
Top