Apple A9X SoC

Discussion in 'Mobile Devices and SoCs' started by tangey, Nov 8, 2015.

  1. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    libquantum uses AOS (array of structs) in its internal data structures, icc magically transforms this to SOA, which is why it is much faster than alternative compilers. It is not a transform that is applicable generally and thus renders icc scores for libquantum pointless.

    Autopar is allowed and all compilers use it. The reason the gcc subtest score is typically quoted as a measure of single thread performance is because it is the most resistent to any shenanigens (like autopar)

    Cheers
     
    Raqia likes this.
  2. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Thanks for the detailed replies re: icc. It was meant to be an aside, and I agree icc should be replaced, probably by an LLVM targeted compiler with the same flags across all hardware being tested (if the point of the test is to be hardware test). I like the idea behind the gcc subtest; presumably if a compiler could optimize gcc substantially on compilation of arbitrary code, you could turn it toward compiling its own source code and keep going... (a joke, don't kill me).

    I take it that makes the memory locality during execution much better than it otherwise would be?
     
    #102 Raqia, Jan 25, 2016
    Last edited: Jan 25, 2016
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Yes.

    Apparently the biggest gain in libquantum is from interprocedural fusing of loops. A LLVM presentation on the matter, here

    Cheers
     
    fuboi likes this.
  4. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Very interesting read, thanks.

    The store to load forwarding optimization for hmmer mentioned early on would be taken care of by a hardware feature on many architectures right? It could add a bit of register pressure in some cases...

    The interprocedural fusing of loops seems like a reasonable idea, if you're a conscious programmer. It's impressive they got a compiler to do it: it involves the hoisting of several in-method conditionals to global variables, and it's far from obvious from the code itself that it would be possible or be a benefit without a lot of runtime analysis. Heroic indeed, but I can see how the applicability might be so narrow that it's considered a cheat.
     
    #104 Raqia, Jan 25, 2016
    Last edited: Jan 25, 2016
  5. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    It also makes it easier to vectorize, especially if the uarch doesn't have strided loads. AArch64 has de-interleaving loads but they add a ton of register pressure, who knows if a compiler like LLVM would try to use them for something like this.

    Is there any indicator that they've actually implemented these optimizations in a general form?

    They've identified transformations that would greatly improve hmmr and libquantum but they haven't shown that they've done more than perform these transformations by hand. If they implemented this in the compiler I think they would have provided measurements for how they effect something else, anything else really.

    It's relatively easy to look at a benchmark with a very limited hot section that's been poorly optimized, then point out how it could be optimized. Actually getting a compiler to do it in a way that's both safe and generally not a performance degradation is another thing.
     
  6. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    It looks like the original optimization is rather specific to this particular benchmark as implemented in ICC (at least the post below seems to indicate this and it's unclear to me what exactly triggers it in ICC), but it seems to have inspired work on some rather general loop fusion optimizations in LLVM:

    http://lists.llvm.org/pipermail/llvm-dev/2015-January/080809.html

    and the slides also seem to mention some generic qualifiers for enabling this optimization:

    and

     
    #106 Raqia, Jan 25, 2016
    Last edited: Jan 25, 2016
  7. wishiknew

    Regular

    Joined:
    May 19, 2004
    Messages:
    341
    Likes Received:
    9
    Johny Srouji's interview in Bloomberg, the A9X was moved up half a year.
     
    Raqia likes this.
  8. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    Apple announced a 9.7" version of the iPad Pro at an event today.

    Phil Schiller gave two pieces of information about the A9X that I haven't seen before:
    1. It has over 3 billion transistors.
    2. The GPU has over 0.5 TFLOPS. (I assume this is for both iPad Pros.)
    (Go to 50:25 in the video for the first piece and 50:50 for the second piece.)

    For 12 PowerVR Series 7 clusters, 0.5 FP16 TFLOPS implies 326 MHz and 0.5 FP32 TFLOPS implies 651 MHz.
     
  9. wco81

    Legend

    Joined:
    Mar 20, 2004
    Messages:
    6,920
    Likes Received:
    630
    Location:
    West Coast
    Yeah he also said the GPU is more powerful than the one in the Xbox360.

    Wonder how much more improvements they can make every year. Does the PowerVR roadmap show it catching up to the current generation of consoles in a battery-powered device?
     
  10. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,360
    Likes Received:
    1,377
    I wouldn't take for granted that the 12.9 and 9.7 iPad share the same clocks. We'll see soon enough.
     
  11. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,360
    Likes Received:
    1,377
    I'd say this has more to do with Apples ambitions and lithographic technique than roadmaps. At 10nmFF with InFO the capability to get close to XB1 in a new 12.9 iPad would seem to be there, and it can only be helped by any architectural advances. The lithographic process will be here in a year. If Apple will keep pushing remains to be seen, but why should they relent?
     
  12. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    Well both of them could be > 0.5 TFLOPS even if the clocks are different.

    And it looks like they are different: Apple's compare page shows different numbers for the two sizes.

    iPad Pro 12.9"
    • CPU 2.5x A7
    • GPU 5x A7
    iPad Pro 9.7"
    • CPU 2.4x A7 [~0.96x the 12.9", ~2.17 GHz]
    • GPU 4.3x A7 [~0.86x the 12.9"]
     
  13. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    I don't know what the frequency of the 12.9" A9X GPU is, but I take Apple's marketing word for granted that claimed that it's 360x times more powerful than the original iPad GPU.

    2 TFLOPs FP16 * 360 = 720 GFLOPs FP16 which would at 470MHz. If the 9.7" GPU has a lower frequency then it could be somewhere in the 400+MHz ballpark which means >310GFLOPs FP32 or >620GFLOPs FP16 and yes if you blink one eye for the missing bandwidth amongst others, it is more powerful than the original XBox GPU.

    Marketing set the bar somewhere around the GT7600 mark in the past, however obviously not with as humble frequencies as above: http://images.anandtech.com/doci/8706/7XT_SKUs.png
     
    #113 Ailuros, Mar 22, 2016
    Last edited: Mar 22, 2016
  14. ltcommander.data

    Regular

    Joined:
    Apr 4, 2010
    Messages:
    616
    Likes Received:
    15
    http://arstechnica.com/apple/2016/03/a-day-with-the-9-7-inch-ipad-pro-and-its-accessories/

    Geekbench and GFXBench results are up. Overall the 9.7" iPad Pro has 95% of the single core CPU, 96% of the multi core CPU, 73% of the offscreen T-Rex HD, and 64% of the offscreen Manhattan HD performance of the 12.9" iPad Pro.

    Interestingly, the memory performance in Geekbench is only 81% single core, 79% multi core of the 12.9" iPad Pro. Presumably Geekbench is not sensitive to the reduced RAM capacity so Apple might be using slower LPDDR4.
     
  15. Turbotab

    Newcomer

    Joined:
    Feb 19, 2013
    Messages:
    214
    Likes Received:
    3
    According to Anandtech, the iPad Pro - big boy edition only uses 1600 MHz - LPDDR4, which is pretty much entry level. Given the large difference in both memory benchmark and GPU benches, could we be looking at both an underclocked GPU and drop down to a 64-bit memory bus? A slightly underclocked LPDDR4 doesn't seem drastic enough for the perf drop recorded with the 9.7 Pro.
     
  16. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    629
    Likes Received:
    1,131
    Location:
    PCIe x16_1
    Entry level? It uses LPDDR4-3200, which is anything but entry level.
     
  17. Kaarlisk

    Regular Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
    Could the 9.7in Pro have 2GB RAM because of space considerations? I.e. 2GB can fit on the package, 4GB cannot?
     
  18. Turbotab

    Newcomer

    Joined:
    Feb 19, 2013
    Messages:
    214
    Likes Received:
    3
    Isn't the 3200 in the LPDDR4 label the MegaTransfers per second, not the the MHz it is clocked at which is I believe is 1600 MHz, as I wrote. Entry level may be a little incorrect, but in terms of LPDDR4 is anybody shipping with anything lower than 1600 MHz (3200 MT/s)?, which was JEDEC's stated launch speed.
     
  19. Turbotab

    Newcomer

    Joined:
    Feb 19, 2013
    Messages:
    214
    Likes Received:
    3
    According to iFixit, Apple used 16Gb (2GB) RAM modules in the 12.9 Pro, so they only need 2 modules per PCB to spec it to 4GB. The iPad Air 2 also had 2 RAM modules per PCB, so that should not be the limiting factor. If I were cynical, they didn't want to steal too much thunder from the flagship 12.9 inch version.
     
  20. wco81

    Legend

    Joined:
    Mar 20, 2004
    Messages:
    6,920
    Likes Received:
    630
    Location:
    West Coast
    Or they just wanted to keep costs low for a SKU which would be priced lower but have much greater volumes. So the cost savings per unit is multiplied over greater unit volumes.

    That is why Apple is rolling in cash.
     
    Grall likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...