Apple A9 SoC

Definitely, this also suggests quite some architectural changes. Maybe added a decode cache?

A post-decode cache has less of an impact than it would for a processor with a long decode pipeline like the Core series, and it has a habit of adding variability to the penalty in the case that the correct target misses to the L1.

Oddly enough, there are two scenarios where variability has been measured.
Agner Fog's optimization testing has Haswell ranging 15-20 cycles, a testament to the complexity of its front end and uop cache.
However, his tests showed AMD's Jaguar ranging from 9-19, depending on what the subsequent instructions were.

Going from 14-19 to a flat 9, assuming the same test code was used, partly raises a question as to why A8 was that high and variable to begin with, then why A9 is so consistent and relatively low.
There are definite physical changes with the introduction of FinFETs that might allow a design to rebalance the pipeline and maybe cut out stages that might have existed as drive stages or to allow lower voltage scaling on leaky planar, the degree of reduction hints at a more significant change or that A8 was being very conservative for some reason.
 
A post-decode cache has less of an impact than it would for a processor with a long decode pipeline like the Core series, and it has a habit of adding variability to the penalty in the case that the correct target misses to the L1.
I wasn't suggesting this would be enough to bring down latencies that much (might be good for 2-3 cycles I'd guess), just that it contribute to it.
There are definite physical changes with the introduction of FinFETs that might allow a design to rebalance the pipeline and maybe cut out stages that might have existed as drive stages or to allow lower voltage scaling on leaky planar, the degree of reduction hints at a more significant change or that A8 was being very conservative for some reason.
I'm not sure FinFETs would really allow you to cut pipeline stages. At the same frequency, yes, but it looks to me like any such advantages were put into higher frequencies instead. So I'm still thinking we're looking at some "real" architectural changes here. I'd love to see some even more in-depth analysis of it...

[edit]: Albeit it's true that as we've seen for the cache hierarchy, it can clock higher AND have lower latency (and in case of l2, despite a large size increase even). I still can't quite imagine though that's all due to FinFET...
 
Last edited:
I wasn't suggesting this would be enough to bring down latencies that much (might be good for 2-3 cycles I'd guess), just that it contribute to it.
But doing so has introduced variablity in the penalty, whereas the latest architecture has become more consistent. Since the post-decode case is within the fetch pipeline, the penalty would be measured differently based on where the next fetch came from. To not vary means whatever decode cache there is has done something differently.

I'm not sure FinFETs would really allow you to cut pipeline stages. At the same frequency, yes, but it looks to me like any such advantages were put into higher frequencies instead. So I'm still thinking we're looking at some "real" architectural changes here. I'd love to see some even more in-depth analysis of it...
The A8's penalties are on the order of AMD's Bulldozer, and what Haswell apparently has, so one implication is that there is more slack to go around for a design with 1/2 or 1/3 the fMax. Another (edit: possibility) is that there was some other kind of penalty being tacked on to the A8 besides the number of pipe stages from fetch to whatever stage signals a mispredict.

It would not explain the full depth of the change, but it can explain some of them. For example, the lengthening of L1 cache latencies in the x86 world helped come about because the extra time afforded for the L1 arrays at lower voltages.

Since so little has been disclosed, one other way mispredict penalties have been reduced for Intel's Atom line was that stages dedicated to data cache access were decoupled when it went OoO, peeling off 3-4. Although I don't know if that would be the case here since the A8 was already OoO.
 
Last edited:
But doing so has introduced variablity in the penalty, whereas the latest architecture has become more consistent. Since the post-decode case is within the fetch pipeline, the penalty would be measured differently based on where the next fetch came from. To not vary means whatever decode cache there is has done something differently.
The anandtech article doesn't actually say the mispredict penalty is fixed. The table just has a nine in there but the text clearly says this is measured average penalty. (And I'm not sure what the test used actually does exactly, that is if it actually catches the worst cases. The variable numbers for the A8 all were official info which is simply missing for the A9.)
 
The anandtech article doesn't actually say the mispredict penalty is fixed. The table just has a nine in there but the text clearly says this is measured average penalty. (And I'm not sure what the test used actually does exactly, that is if it actually catches the worst cases. The variable numbers for the A8 all were official info which is simply missing for the A9.)

Then it seems I have misread what the A9 article was trying to say regarding those numbers.
 
mczac, those are cycles, its just the axis label that is off.
Whoops, sorry about that. Was in a bit of a rush.

I'd love to see a micro-benchmark to see how many loads and store the new core can sustain per cycle.
2 loads or 2 stores per clock (I actually don't have a test that tries both at once, but I suspect they're shared just like Cyclone).

The anandtech article doesn't actually say the mispredict penalty is fixed. The table just has a nine in there but the text clearly says this is measured average penalty. (And I'm not sure what the test used actually does exactly, that is if it actually catches the worst cases. The variable numbers for the A8 all were official info which is simply missing for the A9.)
Correct. 14-19 for Cyclone/Typhoon came from the LLVM source code. Our branch test doesn't have the ability to properly see the whole range and I can't be 100% sure we're seeing the worst case. The best I can give you is the average (note that the average was spot on for 16 on Cyclone/Typhoon)
 
2 loads or 2 stores per clock (I actually don't have a test that tries both at once, but I suspect they're shared just like Cyclone).

In normal workloads loads outnumber stores 2:1 (or more). The capability to do two loads and a store per cycle could increase IPC for a lot of kernels significantly, especially for a six-wide design.

Or, maybe they still only have two ports to the cache array, but can look three virtual addresses up in the DTLB every cycle, prioritize loads over stores (deferring the stores in a big-ass store queue w. forwarding), so that you on average get two accesses per cycle, but for significant bursts can handle two loads and a store per cycle.

Cheers
 
Back
Top