Nebuchadnezzar
Legend
You can just extract the images from the report, here's the native res:
http://i.imgur.com/tAdjKs3.png
http://i.imgur.com/tAdjKs3.png
Nice.
Got the answers I desired regarding latencies with larger caches. Combined with the architectural improvements that were found, it was a worthwhile wait for the article. (Of course the SoC is only a small part of the full review.) The SPEC2000 data was quite juicy. Might have been interesting to compare that with other (classes of) CPUs, but of course the interested reader can do that on their own. Very impressive gains, and scores. (Pretty much settles the argument that Geekbench is a toy benchmark that overestimates Apples ARM designs, when Apples work on the memory hierarchy makes SPEC improve significantly more.) As was mentioned in this review, the iPad Pro review almost requires an x86 comparison, but I can also understand if you want to avoid the resulting controversy.
Looking at those SPEC numbers, I really wish Apple was more forthcoming with technical information. But Kudos to the reviewers for everything they managed to extract and present!
Thank goodness iPad Pro has 4GB of RAM.It's unfortunate they couldn't run Spec2k6.
A common critisism against Geekbench is that consists of small, largely close-cache resident code snippets, has small data sets, and doesn't exercise the memory hierarchy like a real man, err.. Code would. There is actually a bit if truth to this, and Geekbench adresses this with having dedicated memory benchmarks. (Of course, there are also upsides to small "core" benchmarks in this day and age, as long as you know what you're after. And they execute quickly on all platforms, and are thus practical to run and largely avoids thermal throttling issues (*cough*))Re: Geekbench and SPEC: these are Spec 2000 numbers. Everyone knows that memory hierarchy has an oversized impact in Spec 2000, that's why it was retired and replaced with Spec 2006.
It's unfortunate they couldn't run Spec2k6.
If you want to get the best score, you'll have to compile some of the benchmarks in 32-bit mode (cf. Intel results on spec.org) and in that case, it's possible all of SPEC 2006 could be run with 2 GB of RAM (as you didn't say which of the benchmark didn't fit there I can't guarantee this will work; if it's mcf, then it's one of the benchmarks that should be compiled for 32-bit, and its memory usage will be halved).Thank goodness iPad Pro has 4GB of RAM.![]()
x86 fanboys will still complain that Geekbench favors too much crypto instructions. That's a valid complaint but it's not enough to dismiss Geekbench.People arguing the superiority of intels x86 processors like to claim that the memory light aspect of Geekbench for some reason causes Apples SoCs to be "unfairly" favoured (for some reason typically not only over x86, but also over other ARM implementations with much weaker memory subsystems.)
The SPECint2000 scores demonstrate even greater gains than GeekBench, so those who argue that the benchmark score advances of the A9 is due to the small size subtests of Geekbench should, in a perfect world, be shut up. Of course, in this real world in which we live, no such thing will happen.
Come on, SPEC 2006 has no issue, in particular when compiled with iccI don't agree with you on the reasons for SPECint2006 by the way, but the politics of the SPEC suite is a waste of time here. The 2006 version is surrounded by a set of controversy of its own. I do agree that it would be interesting to see SPECint2006 (and fp) for the iPad Pro A9x.
In my experience, when you compile for x86-64 and AArch64 with close enough versions of gcc, you get very similar dynamic count of instructions, close dynamic code size (~5-10% advantage to x86 here), and close memory usage. Comparison is definitely possible.Cross architecture benchmarking is both very interesting and a terrible can of worms. It's impossible to do with any kind of accuracy unless you test for a specific application, which of course renders the result pointless for general conclusions. If on top of that the processors are targeted at different workloads.... The iPhone6s article specifically refers to A9x vs x86 Skylake, which could be said to adress similar markets, although the power draws of the Surface 4 and the iPad are likely quite different, and is the limiting factor for the performance of these products. If Anandtech decides to do that comparison explicitly, small differences will be wildly overinterpreted, they will be accused of partisanship, and the more technically minded who was the only meaningful audience will get bogged down in what compilers and compiler switches were used.![]()
(* cough *)Come on, SPEC 2006 has no issue, in particular when compiled with icc![]()
This is pretty much the way to do it - pick as level and reasonably realistic common compiler baseline as possible and don't worry too much about absolute scores, note generational trends within the respective architectures, and note areas of clear differences between archs. That's interesting and might even have predictive value! Beyond that though the trouble starts. It is very difficult not to overinterpret whatever numbers you have in front of you, and forget all the data that is absent.In my experience, when you compile for x86-64 and AArch64 with close enough versions of gcc, you get very similar dynamic count of instructions, close dynamic code size (~5-10% advantage to x86 here), and close memory usage. Comparison is definitely possible.
All of the above said, inexact science though it may be, benchmarking is interesting. The A9 is remarkable CPU wise
, and I have the feeling there is more to its improved performance than has been revealed so far.
Unfortunately the Anandtech article didn't show the latency data in cycles, L3 latency seems to have dropped even when counting in cycles, but L1 and L2 seem to mostly keep up with the clock increases rather than substantially improve it further (difficult to tell from the graph), leaving increased size of L2 and L3 as the other visible improvements. Together with bandwidth improvements it ensures better fed cores, but as you say that doesn't seem nearly sufficient to explain more than a small part of the IPC improvement. 42% in gcc is huge!It's an astonishing jump in IPC, - from an already very high IPC. The faster, bigger L2 can't explain more than a few percent of this. If we look at the GCC subtest in SpecInt 2000, the missrate is less than 0.15% with a 1MB cache, - in the noise. That also implies the doubling of off chip bandwidth has nothing to do with the gain.
The improvements to branch mispredict penalties that was revealed is sure to be a factor.Almost nothing has been revealed, but yeah, what can explain the improvement ? They only things I can think of, are:
1. Maybe they've added an extra cache port, supporting 2 loads and a store /cycle
2. Great improvements in memory disambiguation (possibly alleviating bugs in previous version?)
.... and lots of minor improvements.
And eventually Carthago actually was destroyed. :smile2:P.S.: Geekbench is still a Mickey Mouse benchmark
Definitely, this also suggests quite some architectural changes. Maybe added a decode cache?The improvements to branch mispredict penalties that was revealed is sure to be a factor.
Just for you guys:Unfortunately the Anandtech article didn't show the latency data in cycles
Avg Latency (6 vs 6s)
L1: 4/3
L2: 19/17
L3: 108/96
DRAM: 260/334
Damn Ryan, my eyes are getting all misty here...Just for you guys:
![]()
Avg Latency (6 vs 6s)
L1: 4/3
L2: 19/17
L3: 108/96
DRAM: 260/334
Thanks a bunch. That's quite impressive, given the frequency increase. The L1 improvement is a significant IPC boost in and of itself, and they shaved cycles off throughout the hierarchy, even with increased sizes.