Apple A9X SoC

Back on the Intel vs Apple topic, I would think that Intel designs are a lot more constrained than Apple wrt to the variety and cheers numbers of requirements they must meet.
The whole memory architecture can deal from one cores to... many for example. I also think that accommodating the huge SIMD units Intel packs in take its toll on the design. Then there is the benching environment Intel processors are tested against a bunch of different tasks and environments.
Intel is not design it cores ( and uncore) with a proprietary phone and tablet, they aim at a way broader market with a single scalable IP, it sounds like a completely different effort. I'm not sure it is the best way to do it but that is what they do.
The Atom line is interesting but Intel does not really put its wait behind it, they iterate quite slowly compare to the mobile manufacturers (and Apple) but here again I suspect they are setting for themselves standard that may be a little out of place, those are not server chips (actually Intel do server chips out of Atom).
Overall an "issue" I see with Intel approach and how they chips compare to Apple ones or other mobile manufacturers is something I could sum up like that in the GPU world/ try to ship a competitive IEEE FP32 compliant GPU at a time when FP16 happens at a quarter the speed and any type of IEEE compliance is a secondary concern. Overall it impact time to market, costs, power efficiency, etc.
 

The L3 being a victim cache in A9 explains why they killed it in A9X. With the increased pixel count for the display, the GPU would be constantly flushing/thrashing a 4MB L3; Zero benefits for non-zero power/die area.

I'd expect it back in next year's iteration with the LLC being bypassable by the GPU, similar to how Intel changed the memory semantics from Crystalwell -> SkyLake w. Iris Pro.

Cheers
 
I think I missed the documentation or discussion on how the GPU's use of the L3 was measured. Is it certain it cannot already bypass the LLC?
 
I think I missed the documentation or discussion on how the GPU's use of the L3 was measured. Is it certain it cannot already bypass the LLC?

No, but I think the L3 cache was mainly there for GPU performance. CPU benchmarks don't suggest that iPhone 6s is much better off with it.

That's not to say that it couldn't still be used selectively for the GPU for some memory accesses only, maybe vertex data. But it could be a big SoC design and software shift to do this.

I actually wonder if the L3 cache could even keep up with the 51.2GB/s memory bandwidth. On A9X it'd need to push a cache line every 3 CPU cycles or so which is really fast for an LLC on a low power SoC. If it's actually lower bandwidth that A9X's main memory then that'd be a pretty good reason not to include it.
 
No, but I think the L3 cache was mainly there for GPU performance. CPU benchmarks don't suggest that iPhone 6s is much better off with it.

That's what I'm guessing too. It's for increasing effective bandwidth to the GPU and lowering power.

To be effective, the L3 has to be able to cache the various buffers the GPU uses. The devices (tablets) where A9X is to be deployed all have very high resolution, so the L3 would have to be excessively big to be of any use. Since these larger devices have relaxed power consumption constraints, killing the L3 and saving a ton of die area makes sense.

Cheers
 
Well, the Anandtech article is up, and with it some SPEC2006 data.
Once libquantum is excluded, the other subtests show similar performance to Intels offerings in the segment, with the wide spread in individual results that can be expected testing different architectures and compilers. Graphics performance of the A9x at the same power level is higher than Intels counterparts, but that has been well documented elsewhere. That Apple increased the sampling of its pen from 120 to 240Hz was known already. Probably didn't happen without justification, so there is probably some relevance there for VR and similar.

By now, I wonder where the A9x will show up next.
 
Graphics performance of the A9x at the same power level is higher than Intels counterparts, but that has been well documented elsewhere.

In no small part because the A9X is almost %50 larger than Skylake's 2+2 and uses a much more expensive and higher-performing memory subsystem (128bit LPDDR4 3200MT/s vs. 128bit DDR3L 1600MHz).

Nonetheless, it's impressive how Intel's alien technology is close to being matched in their tablet form factors.
 
If they can get A9X performance into a iPad Air form factor and price this year, it'll be pretty good progress.
 
Spreadsheet with SPECint2006 data from AnandTech and other places:
https://docs.google.com/spreadsheet...fDWXMpn231WXvrlx5zokj8m0Q/edit#gid=1255253279

For some reason I expected a higher overall per-clock improvement from Typhoon to Twister. The overall score of the A9X isn't far from the scores of the similarly Core M's in the comparison if one excludes libquantum, so I could see Apple catching up in a couple of generations.

Also, how does LLVM compare to icc in the context of SPEC performance?
 
For some reason I expected a higher overall per-clock improvement from Typhoon to Twister. The overall score of the A9X isn't far from the scores of the similarly Core M's in the comparison if one excludes libquantum, so I could see Apple catching up in a couple of generations.
That would be interesting, although is it catching up with the Core M or catching up with a Core successor?
A couple generations is a long time for the A9 and its successors to thread the needle of being performant at this target range, but not becoming threatening enough to prod Intel into a Core M architecture that eschews the high-end performance and scalability features that don't help in this range.
There are some signs that Intel is no longer as committed to the one-Core philosophy, and a generous time frame like multiple generations is enough time to see if it strays.
 
Spreadsheet with SPECint2006 data from AnandTech and other places:
https://docs.google.com/spreadsheet...fDWXMpn231WXvrlx5zokj8m0Q/edit#gid=1255253279

For some reason I expected a higher overall per-clock improvement from Typhoon to Twister. The overall score of the A9X isn't far from the scores of the similarly Core M's in the comparison if one excludes libquantum, so I could see Apple catching up in a couple of generations.

Also, how does LLVM compare to icc in the context of SPEC performance?
Libquantum and to a lesser extent hmmer is broken by icc. For the purpose of comparing architectures those two at least should have been excluded, or ideally LLVM be used for both architectures.
As far as this benchmark run goes, the CPU cores in the iPad Pro and the MacBook actually perform equivalently.
 
Note that Anandtech compiled SPEC targetting 32-bit code on x86. This makes gcc and mcf scores significantly better than they should be. And I wonder why they didn't use llvm or gcc for x86, everybody knows Intel has spent a large amount of time and effort to tune icc for SPEC which makes such comparisons dubious.
 
The takeaway I got from the Anandtech article is that the A9x is competitive with, but overall a little slower than Intels 1 generation old CoreM base model and quite a bit slower than Intels 1 generation old top end CoreM model - all passive cooled.

Yes they are catching up, but that's more likely due to there being more low hanging fruit at Apples previous performance levels than at Intels. It doesn't mean Apples performance will continue to grow at the same rate and ultimately overtake Intel.
 
As an aside, I'm aware of the outlier that is LibQuantum w/ ICC and agree it shouldn't be admissible on test focused on the hardware itself, but how generalizable are the optimizations that Intel's made here? Do they detect the specific code signature of SPEC or are they sufficiently general that you really might get better performance from real code?

I think Intel's compiler is genuinely better than a lot of others, I've read about a lot of the neat tricks it's able to pull off like reordering loops to hoist certain variables:

http://stackoverflow.com/questions/...-a-sorted-array-faster-than-an-unsorted-array

Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).

In the real world, as much as iOS is a part of the A9's advantage, I have to say ICC is a part of Intel's.
 
Last edited:
I don't think Apple necessarily needs to get better than Intel performance on the iPad Pro but it does carry a high price so I guess it needs to be able to run applications that other mobile devices can't run.

So the demo they gave were for loading a huge AutoCAD model. But drawing with the Pencil is what more people will be drawn to. Do you need a lot of processing power for drawing apps, with minimal input lag? Or is it the 4 GB of RAM?
 
As an aside, I'm aware of the outlier that is LibQuantum w/ ICC and agree it shouldn't be admissible on test focused on the hardware itself, but how generalizable are the optimizations that Intel's made here? Do they detect the specific code signature of SPEC or are they sufficiently general that you really might get better performance from real code?
No not really. Which is why I wrote that libquantum and hmmer should probably be excluded. Mcf is an outlier too, btw, when you compare icc to other compilers.
Look, The SPEC suite has been analyzed by some pretty sharp people. It is well known that intel does things in their compiler that some justifiably consider cheating although it formally stays within the limits of the benchmark rules. You can find formal studies comparing the results of intels compiler with what is used for producing commercial code, and any number of discussion threads on the web.

Everyone who is dry behind his benchmarking ears knows this, and if you're not, a five minute google search should be enough to explain why using intels compiler specifically, but basically any different compilers and settings is a lousy idea in a case like this. Unless of course you're trying to make a point, rather than investigate it. That is what people like me and laurent are saying - that the choices made here are remarkable. We are actively avoiding asking the question of why such a procedure was chosen.
 
No it's a genuine question; a 5 minute google search shows icc can parallelize code like libquantum across multiple threads and that it was a matter of some controversy the same optimizations weren't showing up for non-Intel CPUs, but it wasn't clear to me if it was considered a cheat since that compiler seems to have a lot of tricks up its sleeves that do work in many real situations which other compilers don't have.
 
It's a cheat because it basically works only for libquantum.
ICC is a good compiler, but the problem with ICC+SPEC is that it has too much SPEC specific optimisations (this is not just an Intel thing though, everyone's doing it when SPEC was still very relevant). That's why people tend to look at "difficult" benchmarks such as gcc to get a better idea of how well these CPU perform. It's not because gcc shares performance characteristics with more applications (no benchmark can claim that), but since for most people it's very unlikely that Intel would do an application specific optimisation for you, comparing using gcc is more likely to get meaningful results.
 
I've seen ICC perform optimization on some code that disappeared when switching to 64-bit, even though they'd have made the code faster. It was for one of the tests of AnTuTu. I don't trust ICC for anything benchmark-related, and I've never experienced any speedup beyond 5% on my own code (which isn't vectorizable) and even experienced some slowdowns and various crashes.
 
It's a cheat because it basically works only for libquantum.
ICC is a good compiler, but the problem with ICC+SPEC is that it has too much SPEC specific optimisations (this is not just an Intel thing though, everyone's doing it when SPEC was still very relevant). That's why people tend to look at "difficult" benchmarks such as gcc to get a better idea of how well these CPU perform. It's not because gcc shares performance characteristics with more applications (no benchmark can claim that), but since for most people it's very unlikely that Intel would do an application specific optimisation for you, comparing using gcc is more likely to get meaningful results.
Let me reinforce what pcchen is saying here. It is pretty much spot on.

You can use the SPEC suite for a lot of things. Comparisons within an ISA, between ISAs, between compilers - all have their uses and issues.
IF you are not comparing compilers (versions/opts) explicitly then you should a) keep compiler and settings identical across your sampling (if possible), and b) avoid using a compiler that specifically targets your benchmark, since that alone will invalidate transferability of your results to the general case.

You can note that in spite of the A9x beating the MacBook in gcc pcchen does not draw any wide reaching conclusions about A9x vs. X86 performance despite arguing that it may be the most reliably consistent subtest, but rather notes the limitations of benchmarking. His remarks are doubly true when comparing across architectures. Not all of us have an agenda. I was genuinely interested to see if SPEC2006 could show more about relative strengths and weaknesses of the processors. It's a bit of a lost opportunity.
 
Last edited:
Back
Top