Apple A9X SoC

No it's a genuine question; a 5 minute google search shows icc can parallelize code like libquantum across multiple threads and that it was a matter of some controversy the same optimizations weren't showing up for non-Intel CPUs,
libquantum uses AOS (array of structs) in its internal data structures, icc magically transforms this to SOA, which is why it is much faster than alternative compilers. It is not a transform that is applicable generally and thus renders icc scores for libquantum pointless.

Autopar is allowed and all compilers use it. The reason the gcc subtest score is typically quoted as a measure of single thread performance is because it is the most resistent to any shenanigens (like autopar)

Cheers
 
Thanks for the detailed replies re: icc. It was meant to be an aside, and I agree icc should be replaced, probably by an LLVM targeted compiler with the same flags across all hardware being tested (if the point of the test is to be hardware test). I like the idea behind the gcc subtest; presumably if a compiler could optimize gcc substantially on compilation of arbitrary code, you could turn it toward compiling its own source code and keep going... (a joke, don't kill me).

libquantum uses AOS (array of structs) in its internal data structures, icc magically transforms this to SOA, which is why it is much faster than alternative compilers. It is not a transform that is applicable generally and thus renders icc scores for libquantum pointless.

Autopar is allowed and all compilers use it. The reason the gcc subtest score is typically quoted as a measure of single thread performance is because it is the most resistent to any shenanigens (like autopar)

Cheers

I take it that makes the memory locality during execution much better than it otherwise would be?
 
Last edited:
Very interesting read, thanks.

The store to load forwarding optimization for hmmer mentioned early on would be taken care of by a hardware feature on many architectures right? It could add a bit of register pressure in some cases...

The interprocedural fusing of loops seems like a reasonable idea, if you're a conscious programmer. It's impressive they got a compiler to do it: it involves the hoisting of several in-method conditionals to global variables, and it's far from obvious from the code itself that it would be possible or be a benefit without a lot of runtime analysis. Heroic indeed, but I can see how the applicability might be so narrow that it's considered a cheat.
 
Last edited:
I take it that makes the memory locality during execution much better than it otherwise would be?

It also makes it easier to vectorize, especially if the uarch doesn't have strided loads. AArch64 has de-interleaving loads but they add a ton of register pressure, who knows if a compiler like LLVM would try to use them for something like this.

The interprocedural fusing of loops seems like a reasonable idea, if you're a conscious programmer. It's impressive they got a compiler to do it: it involves the hoisting of several in-method conditionals to global variables, and it's far from obvious from the code itself that it would be possible or be a benefit without a lot of runtime analysis. Heroic indeed, but I can see how the applicability might be so narrow that it's considered a cheat.

Is there any indicator that they've actually implemented these optimizations in a general form?

They've identified transformations that would greatly improve hmmr and libquantum but they haven't shown that they've done more than perform these transformations by hand. If they implemented this in the compiler I think they would have provided measurements for how they effect something else, anything else really.

It's relatively easy to look at a benchmark with a very limited hot section that's been poorly optimized, then point out how it could be optimized. Actually getting a compiler to do it in a way that's both safe and generally not a performance degradation is another thing.
 
It looks like the original optimization is rather specific to this particular benchmark as implemented in ICC (at least the post below seems to indicate this and it's unclear to me what exactly triggers it in ICC), but it seems to have inspired work on some rather general loop fusion optimizations in LLVM:

http://lists.llvm.org/pipermail/llvm-dev/2015-January/080809.html

and the slides also seem to mention some generic qualifiers for enabling this optimization:

Whole program visibility
Alias analysis
GlobalModRef

Cost Model

and

Techniques needed for ‘heroics’ can be generalized to advance optimization technology
 
Last edited:
Apple announced a 9.7" version of the iPad Pro at an event today.

Phil Schiller gave two pieces of information about the A9X that I haven't seen before:
  1. It has over 3 billion transistors.
  2. The GPU has over 0.5 TFLOPS. (I assume this is for both iPad Pros.)
(Go to 50:25 in the video for the first piece and 50:50 for the second piece.)

For 12 PowerVR Series 7 clusters, 0.5 FP16 TFLOPS implies 326 MHz and 0.5 FP32 TFLOPS implies 651 MHz.
 
Yeah he also said the GPU is more powerful than the one in the Xbox360.

Wonder how much more improvements they can make every year. Does the PowerVR roadmap show it catching up to the current generation of consoles in a battery-powered device?
 
Yeah he also said the GPU is more powerful than the one in the Xbox360.

Wonder how much more improvements they can make every year. Does the PowerVR roadmap show it catching up to the current generation of consoles in a battery-powered device?
I'd say this has more to do with Apples ambitions and lithographic technique than roadmaps. At 10nmFF with InFO the capability to get close to XB1 in a new 12.9 iPad would seem to be there, and it can only be helped by any architectural advances. The lithographic process will be here in a year. If Apple will keep pushing remains to be seen, but why should they relent?
 
I wouldn't take for granted that the 12.9 and 9.7 iPad share the same clocks. We'll see soon enough.
Well both of them could be > 0.5 TFLOPS even if the clocks are different.

And it looks like they are different: Apple's compare page shows different numbers for the two sizes.

iPad Pro 12.9"
  • CPU 2.5x A7
  • GPU 5x A7
iPad Pro 9.7"
  • CPU 2.4x A7 [~0.96x the 12.9", ~2.17 GHz]
  • GPU 4.3x A7 [~0.86x the 12.9"]
 
I don't know what the frequency of the 12.9" A9X GPU is, but I take Apple's marketing word for granted that claimed that it's 360x times more powerful than the original iPad GPU.

2 TFLOPs FP16 * 360 = 720 GFLOPs FP16 which would at 470MHz. If the 9.7" GPU has a lower frequency then it could be somewhere in the 400+MHz ballpark which means >310GFLOPs FP32 or >620GFLOPs FP16 and yes if you blink one eye for the missing bandwidth amongst others, it is more powerful than the original XBox GPU.

Marketing set the bar somewhere around the GT7600 mark in the past, however obviously not with as humble frequencies as above: http://images.anandtech.com/doci/8706/7XT_SKUs.png
 
Last edited:
http://arstechnica.com/apple/2016/03/a-day-with-the-9-7-inch-ipad-pro-and-its-accessories/

Geekbench and GFXBench results are up. Overall the 9.7" iPad Pro has 95% of the single core CPU, 96% of the multi core CPU, 73% of the offscreen T-Rex HD, and 64% of the offscreen Manhattan HD performance of the 12.9" iPad Pro.

Interestingly, the memory performance in Geekbench is only 81% single core, 79% multi core of the 12.9" iPad Pro. Presumably Geekbench is not sensitive to the reduced RAM capacity so Apple might be using slower LPDDR4.
 
http://arstechnica.com/apple/2016/03/a-day-with-the-9-7-inch-ipad-pro-and-its-accessories/

Geekbench and GFXBench results are up. Overall the 9.7" iPad Pro has 95% of the single core CPU, 96% of the multi core CPU, 73% of the offscreen T-Rex HD, and 64% of the offscreen Manhattan HD performance of the 12.9" iPad Pro.

Interestingly, the memory performance in Geekbench is only 81% single core, 79% multi core of the 12.9" iPad Pro. Presumably Geekbench is not sensitive to the reduced RAM capacity so Apple might be using slower LPDDR4.

According to Anandtech, the iPad Pro - big boy edition only uses 1600 MHz - LPDDR4, which is pretty much entry level. Given the large difference in both memory benchmark and GPU benches, could we be looking at both an underclocked GPU and drop down to a 64-bit memory bus? A slightly underclocked LPDDR4 doesn't seem drastic enough for the perf drop recorded with the 9.7 Pro.
 
According to Anandtech, the iPad Pro - big boy edition only uses 1600 MHz - LPDDR4, which is pretty much entry level. Given the large difference in both memory benchmark and GPU benches, could we be looking at both an underclocked GPU and drop down to a 64-bit memory bus? A slightly underclocked LPDDR4 doesn't seem drastic enough for the perf drop recorded with the 9.7 Pro.
Entry level? It uses LPDDR4-3200, which is anything but entry level.
 
Could the 9.7in Pro have 2GB RAM because of space considerations? I.e. 2GB can fit on the package, 4GB cannot?
 
Entry level? It uses LPDDR4-3200, which is anything but entry level.

Isn't the 3200 in the LPDDR4 label the MegaTransfers per second, not the the MHz it is clocked at which is I believe is 1600 MHz, as I wrote. Entry level may be a little incorrect, but in terms of LPDDR4 is anybody shipping with anything lower than 1600 MHz (3200 MT/s)?, which was JEDEC's stated launch speed.
 
Could the 9.7in Pro have 2GB RAM because of space considerations? I.e. 2GB can fit on the package, 4GB cannot?
According to iFixit, Apple used 16Gb (2GB) RAM modules in the 12.9 Pro, so they only need 2 modules per PCB to spec it to 4GB. The iPad Air 2 also had 2 RAM modules per PCB, so that should not be the limiting factor. If I were cynical, they didn't want to steal too much thunder from the flagship 12.9 inch version.
 
Or they just wanted to keep costs low for a SKU which would be priced lower but have much greater volumes. So the cost savings per unit is multiplied over greater unit volumes.

That is why Apple is rolling in cash.
 
Back
Top