Apple A9 SoC

Erica Griffin posted a video comparison between the 6s & 6s Plus in terms of animation smoothness (FPS), didn't notice a big difference until she showed the camera peek animation @ 2.12, the 6s Plus is noticeably jankier.

Now I'm wondering given the the use of a low-level GPU API and the overall power of the A9 SoC, I find it hard to believe that it struggles with powering the extra pixels of the Plus, then I remembered that the Plus renders at a native resolution of 1,242-by-2,208 and then downsamples to 1080P. If Apple aims for 60 FPS animations, they have ~16ms for each frame, could the extra step of downsampling in the rendering pipeline, potentially cause FPS drops, and how does the PowerVR hardware / Metal API handle downsampling given its Tile-Based Deferred Rendering uArch?

 
I noticed the jerky home screen scrolling, and I've decided to skip this generation of phones. My 5S holds up brilliantly in pretty much every way really, and with "app thinning", much smaller update downloads and whatnot from iOS 9, even with only 16GB flash it's an even better phone now than it was at launch.

Not that I understand why the home screen jerks; surely blitting a grid of icons onto a single plane background can't be so horribly performance intensive. I don't know what the hell Apple's doing, TBH. ;)
 
Apparently iOS 9.2 fixes almost all of those stuttering issues that the 'Plus' iPhones have been getting.
 
Nice.
Got the answers I desired regarding latencies with larger caches. Combined with the architectural improvements that were found, it was a worthwhile wait for the article. (Of course the SoC is only a small part of the full review.) The SPEC2000 data was quite juicy. Might have been interesting to compare that with other (classes of) CPUs, but of course the interested reader can do that on their own. Very impressive gains, and scores. (Pretty much settles the argument that Geekbench is a toy benchmark that overestimates Apples ARM designs, when Apples work on the memory hierarchy makes SPEC improve significantly more.) As was mentioned in this review, the iPad Pro review almost requires an x86 comparison, but I can also understand if you want to avoid the resulting controversy.

Looking at those SPEC numbers, I really wish Apple was more forthcoming with technical information. But Kudos to the reviewers for everything they managed to extract and present!
 
Nice.
Got the answers I desired regarding latencies with larger caches. Combined with the architectural improvements that were found, it was a worthwhile wait for the article. (Of course the SoC is only a small part of the full review.) The SPEC2000 data was quite juicy. Might have been interesting to compare that with other (classes of) CPUs, but of course the interested reader can do that on their own. Very impressive gains, and scores. (Pretty much settles the argument that Geekbench is a toy benchmark that overestimates Apples ARM designs, when Apples work on the memory hierarchy makes SPEC improve significantly more.) As was mentioned in this review, the iPad Pro review almost requires an x86 comparison, but I can also understand if you want to avoid the resulting controversy.

Looking at those SPEC numbers, I really wish Apple was more forthcoming with technical information. But Kudos to the reviewers for everything they managed to extract and present!

Re: Geekbench and SPEC: these are Spec 2000 numbers. Everyone knows that memory hierarchy has an oversized impact in Spec 2000, that's why it was retired and replaced with Spec 2006.

It's unfortunate they couldn't run Spec2k6.
 
Re: Geekbench and SPEC: these are Spec 2000 numbers. Everyone knows that memory hierarchy has an oversized impact in Spec 2000, that's why it was retired and replaced with Spec 2006.

It's unfortunate they couldn't run Spec2k6.
A common critisism against Geekbench is that consists of small, largely close-cache resident code snippets, has small data sets, and doesn't exercise the memory hierarchy like a real man, err.. Code would. There is actually a bit if truth to this, and Geekbench adresses this with having dedicated memory benchmarks. (Of course, there are also upsides to small "core" benchmarks in this day and age, as long as you know what you're after. And they execute quickly on all platforms, and are thus practical to run and largely avoids thermal throttling issues (*cough*))

People arguing the superiority of intels x86 processors like to claim that the memory light aspect of Geekbench for some reason causes Apples SoCs to be "unfairly" favoured (for some reason typically not only over x86, but also over other ARM implementations with much weaker memory subsystems.)

The SPECint2000 scores demonstrate even greater gains than GeekBench, so those who argue that the benchmark score advances of the A9 is due to the small size subtests of Geekbench should, in a perfect world, be shut up. Of course, in this real world in which we live, no such thing will happen.

I don't agree with you on the reasons for SPECint2006 by the way, but the politics of the SPEC suite is a waste of time here. The 2006 version is surrounded by a set of controversy of its own. I do agree that it would be interesting to see SPECint2006 (and fp) for the iPad Pro A9x.

Cross architecture benchmarking is both very interesting and a terrible can of worms. It's impossible to do with any kind of accuracy unless you test for a specific application, which of course renders the result pointless for general conclusions. If on top of that the processors are targeted at different workloads.... The iPhone6s article specifically refers to A9x vs x86 Skylake, which could be said to adress similar markets, although the power draws of the Surface 4 and the iPad are likely quite different, and is the limiting factor for the performance of these products. If Anandtech decides to do that comparison explicitly, small differences will be wildly overinterpreted, they will be accused of partisanship, and the more technically minded who was the only meaningful audience will get bogged down in what compilers and compiler switches were used. :D And at the end of the day, does anyone at this point really expect the results under similar power constraints to be other than pretty close?

Nevertheless, it would be interesting.
 
Thank goodness iPad Pro has 4GB of RAM.:)
If you want to get the best score, you'll have to compile some of the benchmarks in 32-bit mode (cf. Intel results on spec.org) and in that case, it's possible all of SPEC 2006 could be run with 2 GB of RAM (as you didn't say which of the benchmark didn't fit there I can't guarantee this will work; if it's mcf, then it's one of the benchmarks that should be compiled for 32-bit, and its memory usage will be halved).
 
People arguing the superiority of intels x86 processors like to claim that the memory light aspect of Geekbench for some reason causes Apples SoCs to be "unfairly" favoured (for some reason typically not only over x86, but also over other ARM implementations with much weaker memory subsystems.)

The SPECint2000 scores demonstrate even greater gains than GeekBench, so those who argue that the benchmark score advances of the A9 is due to the small size subtests of Geekbench should, in a perfect world, be shut up. Of course, in this real world in which we live, no such thing will happen.
x86 fanboys will still complain that Geekbench favors too much crypto instructions. That's a valid complaint but it's not enough to dismiss Geekbench.

I don't agree with you on the reasons for SPECint2006 by the way, but the politics of the SPEC suite is a waste of time here. The 2006 version is surrounded by a set of controversy of its own. I do agree that it would be interesting to see SPECint2006 (and fp) for the iPad Pro A9x.
Come on, SPEC 2006 has no issue, in particular when compiled with icc :D

Cross architecture benchmarking is both very interesting and a terrible can of worms. It's impossible to do with any kind of accuracy unless you test for a specific application, which of course renders the result pointless for general conclusions. If on top of that the processors are targeted at different workloads.... The iPhone6s article specifically refers to A9x vs x86 Skylake, which could be said to adress similar markets, although the power draws of the Surface 4 and the iPad are likely quite different, and is the limiting factor for the performance of these products. If Anandtech decides to do that comparison explicitly, small differences will be wildly overinterpreted, they will be accused of partisanship, and the more technically minded who was the only meaningful audience will get bogged down in what compilers and compiler switches were used. :D
In my experience, when you compile for x86-64 and AArch64 with close enough versions of gcc, you get very similar dynamic count of instructions, close dynamic code size (~5-10% advantage to x86 here), and close memory usage. Comparison is definitely possible.
 
Come on, SPEC 2006 has no issue, in particular when compiled with icc :D
(* cough *)
{Deleted SPEC stuff. Damn it's hard to stay away from.}

In my experience, when you compile for x86-64 and AArch64 with close enough versions of gcc, you get very similar dynamic count of instructions, close dynamic code size (~5-10% advantage to x86 here), and close memory usage. Comparison is definitely possible.
This is pretty much the way to do it - pick as level and reasonably realistic common compiler baseline as possible and don't worry too much about absolute scores, note generational trends within the respective architectures, and note areas of clear differences between archs. That's interesting and might even have predictive value! Beyond that though the trouble starts. It is very difficult not to overinterpret whatever numbers you have in front of you, and forget all the data that is absent.
Example: Power budget is a huge factor, selection and relevance of benchmarks, target application areas of respective products, performance over time for products such as these, the list goes on and on. So if you were to make a typical SPECint2006 run on the iPad Pro and a similar version of Surface 4, and saw, say, a seemingly substantial 50% difference in the numbers - what conclusions could you draw, really? Other than the trivial and useless "Device A delivers 50% higher numbers than Device B on this particular benchmark, under these particular conditions"?

I'm not lecturing really, just pointing out that whatever data is produced and presented will be overinterpreted by most readers, including the old, wise and knowledgeable. It's human nature. My hope is that a responsible article author will take the opportunity to caution and educate his readership a bit in the process.

God, I ramble...
 
All of the above said, inexact science though it may be, benchmarking is interesting. The A9 is remarkable CPU wise, and I have the feeling there is more to its improved performance than has been revealed so far.
But - what about the GPU? It certainly seems to perform outstandingly, can anything be inferred beyond increased number of functional units and higher clocks (and what are those exactly)?
 
All of the above said, inexact science though it may be, benchmarking is interesting. The A9 is remarkable CPU wise

It's an astonishing jump in IPC, - from an already very high IPC. The faster, bigger L2 can't explain more than a few percent of this. If we look at the GCC subtest in SpecInt 2000, the missrate is less than 0.15% with a 1MB cache, - in the noise. That also implies the doubling of off chip bandwidth has nothing to do with the gain.

, and I have the feeling there is more to its improved performance than has been revealed so far.

Almost nothing has been revealed, but yeah, what can explain the improvement ? They only things I can think of, are:
1. Maybe they've added an extra cache port, supporting 2 loads and a store /cycle
2. Great improvements in memory disambiguation (possibly alleviating bugs in previous version?)
.... and lots of minor improvements.


Cheers

P.S.: Geekbench is still a Mickey Mouse benchmark
 
It's an astonishing jump in IPC, - from an already very high IPC. The faster, bigger L2 can't explain more than a few percent of this. If we look at the GCC subtest in SpecInt 2000, the missrate is less than 0.15% with a 1MB cache, - in the noise. That also implies the doubling of off chip bandwidth has nothing to do with the gain.
Unfortunately the Anandtech article didn't show the latency data in cycles, L3 latency seems to have dropped even when counting in cycles, but L1 and L2 seem to mostly keep up with the clock increases rather than substantially improve it further (difficult to tell from the graph), leaving increased size of L2 and L3 as the other visible improvements. Together with bandwidth improvements it ensures better fed cores, but as you say that doesn't seem nearly sufficient to explain more than a small part of the IPC improvement. 42% in gcc is huge!
Almost nothing has been revealed, but yeah, what can explain the improvement ? They only things I can think of, are:
1. Maybe they've added an extra cache port, supporting 2 loads and a store /cycle
2. Great improvements in memory disambiguation (possibly alleviating bugs in previous version?)
.... and lots of minor improvements.
The improvements to branch mispredict penalties that was revealed is sure to be a factor.
The CPU is definitely worth further study. The other ARM lisencees are sure to dissect it in minute detail, but it would be really nice if some more info could be brought into the public domain. All and any speculation is also interesting, really.

P.S.: Geekbench is still a Mickey Mouse benchmark
And eventually Carthago actually was destroyed. :smile2:
 
Unfortunately the Anandtech article didn't show the latency data in cycles
Just for you guys:

LatencyCycles.png


Avg Latency (6 vs 6s)
L1: 4/3
L2: 19/17
L3: 108/96
DRAM: 260/334
 
Avg Latency (6 vs 6s)
L1: 4/3
L2: 19/17
L3: 108/96
DRAM: 260/334

He said _cycles_, not ns :). More relevant when looking at architecture...
[edit] Ok so it is cycles. Quite amazing then indeed (except the dram latencies).
[deleted some stuff - I only noticed the axis scaling and never even read the title ;-).]
 
Last edited:
Just for you guys:

LatencyCycles.png


Avg Latency (6 vs 6s)
L1: 4/3
L2: 19/17
L3: 108/96
DRAM: 260/334
Damn Ryan, my eyes are getting all misty here...
Thanks a bunch. That's quite impressive, given the frequency increase. The L1 improvement is a significant IPC boost in and of itself, and they shaved cycles off throughout the hierarchy, even with increased sizes.

mczac, those are cycles, its just the axis label that is off.
 
Last edited:
Thanks a bunch. That's quite impressive, given the frequency increase. The L1 improvement is a significant IPC boost in and of itself, and they shaved cycles off throughout the hierarchy, even with increased sizes.

In pure pointer chasing that amounts to 33% higher performance, in something like the GCC subtest it is probably worth 8-12%, - still very far from the whole story here.

I'd love to see a micro-benchmark to see how many loads and store the new core can sustain per cycle.

Cheers
 
Back
Top