Apple A12, A12X, and A12Z SoCs

DavidGraham · Oct 31, 2018

Is the A12X chip really equal to an i7 6700K in single threaded performance? Some even claim it's close to an i7 in multi core performance too!

Entropy · Oct 31, 2018

DavidGraham said:
Is the A12X chip really equal to an i7 6700K in single threaded performance? Some even claim it's close to an i7 in multi core performance too!

I claimed that. By initial appearances the A12X is close enough to the 6700K that what benchmark you select is going to determine what outcome you get. Pick one that suits your agenda.

DavidGraham · Oct 31, 2018

Entropy said:
I claimed that. By initial appearances the A12X is close enough to the 6700K that what benchmark you select is going to determine what outcome you get. Pick one that suits your agenda.

On average though, across multitude of benchmarks, how close is A12X to it?
And how is that even possible? the difference in silicon real estate is massive!

Voxilla · Nov 1, 2018

DavidGraham said:
On average though, across multitude of benchmarks, how close is A12X to it?
And how is that even possible? the difference in silicon real estate is massive!

Both Skylake 6700K and A12X are about 120 mm2 die area.
The Skylake is 14nm and 1.75B transistors, the A12X at 7nm is 10B transistors.
(https://en.wikipedia.org/wiki/Transistor_count)
Clock frequency for the A12X cores are currently not known AFAIK, something like 2.5Ghz probably.
At 4Ghz the 6700K is likely faster.
In any case it shows how far Intel has fallen back both in silicon process and architectural efficiency of x86.

Laurent06 · Nov 1, 2018

Voxilla said:
In any case it shows how far Intel has fallen back both in silicon process and architectural efficiency of x86.

I'd rather say that Apple has been doing wonders with its microarchitecture

I'm still not sure it can scale frequency wise, it's nonetheless a great achievement.

First Geekbench results are here: https://browser.geekbench.com/v4/cpu/search?q=ipad8

About 5000/18000 (that's close to the score of the highest end MacBook pro 13" from 2018). When comparing with A12 iPhone, everything is very close except memory bandwidth that is doubled. In particular this means frequency is the same at 2.5 GHz.

Gubbi · Nov 1, 2018

Laurent06 said:
I'm still not sure it can scale frequency wise, it's nonetheless a great achievement.

With a 128KB L1 cache ? Not a chance!

Cheers

Laurent06 · Nov 1, 2018

Gubbi said:
With a 128KB L1 cache ? Not a chance!

You mean you don't know one can increase L1 latency? Or reduce cache size?

That's not the part of the design that makes me wonder about frequency scaling.

Gubbi · Nov 1, 2018

Laurent06 said:
You mean you don't know one can increase L1 latency? Or reduce cache size? That's not the part of the design that makes me wonder about frequency scaling.

You can't increase latency or decrease the size of the L1 without negatively impacting performance. So no, the CPU design as it is won't scale to significantly higher frequencies.

Which other parts make you wonder about frequency scaling ?

Cheers

Pressure · Nov 1, 2018

iPad8,2 and iPad 8,8 shows 6GB RAM, while iPad8,3 only shows 4GB.

DavidGraham · Nov 1, 2018

Voxilla said:
Both Skylake 6700K and A12X are about 120 mm2 die area.
The Skylake is 14nm and 1.75B transistors, the A12X at 7nm is 10B transistors.

Is there any in depth write up about A12 or A12X other than Anandtech? (sadly they are not deep enough).

Nebuchadnezzar · Nov 1, 2018

DavidGraham said:
Is there any in depth write up about A12 or A12X other than Anandtech? (sadly they are not deep enough).

What exactly are you expecting?....

DavidGraham · Nov 1, 2018

Nebuchadnezzar said:
What exactly are you expecting?....

Some comparisons to X86 CPUs for starters, and a deep dive into the A12 ARM heritage.

Nebuchadnezzar · Nov 1, 2018

DavidGraham said:
Some comparisons to X86 CPUs for starters, and a deep dive into the A12 ARM heritage.

Feel free to send me devices, for starters. As for "heritage", do you want RTL changelogs with that?

Entropy · Nov 1, 2018

DavidGraham said:
On average though, across multitude of benchmarks, how close is A12X to it?
And how is that even possible? the difference in silicon real estate is massive!

Well, that is the question everyone would like to see answered, isn't it? It is also hugely complex.
There are levels to this.
The lowest level is the minutia of the physical architecture. Branch prediction methods and implementation, prefetching details, and so on. Most of this stuff requires you to have information from the chip architects.
A higher level would be assessing the results of the above - branch prediction hit rates, effective memory latencies and bandwidths, read/write ports and what not. Some of this is actually adressed by Andrei in his Anandtech piece.
Next level would be benchmarking. This actually gives some lower level understanding as well, if you know which SPEC or Geekbench subtests that put pressure on what various aspects of the architecture. This is where you find the bulk of review information regarding where work has been made between generations, and perhaps where an ISA/architecture has strengths/weaknesses versus others.

The last issue may be the thorniest - what relevance does differences in benchmark results actually have?
Let me exemplify: Memory bandwidth is important once your data sets (aggregate over all active threads) are large compared to the cache sizes. But it makes for extremely poor benchmarketing. For the cases where bandwidth is limiting, you won't see much if any differences between old and new products or different CPU vendors. It only changes substantially with faster/wider memory subsystems. Which means that in most reviews it only gets tested once or not at all. The importance of bandwidth is downplayed by pretty much all reviews, for very understandable reasons. It's a boring story, and the slow advances doesn't encourage spending on new hardware.
As a contrasting example, multithreaded performance is typically emphasised by reviews and benchmarks. When single core performance started to flatten out, vendors started adding cores (connected to the same memory subsystem). To demonstrate advantages to this, benchmarks were sought and found that could show these advances, and today it is quite common that benchmarks present "single thread" and "multi thread" results. But if you look at how the CPU is loaded over the course of a day, the number of times and amount of time you will see a single core maxed out will be vastly greater than the amount of time that all cores are maxed out. (Unless you are doing very specialised work). There is more to it than that of course, but the take home message is that just because two numbers are presented doesn't mean that they are equally important generally, and definitely not to a specific user.
Benchmarking is difficult in and of itself. Benchmarking for the purpose of comparing CPUs with different architectures is much harder still. Interpreting the results in order to make meaningful predictions for use cases that aren't directly covered by the tests - well, good luck chasing the moon.

As a personal speculation, I was always a bit doubtful when it came to the claims that the x86 tax was some 10% or so. The claim is based on the idea that the x86 ISA is translated to a lower level code that once the translation is done runs optimally, and that the conversion carries only a small cost (and perhaps a somewhat higher cost in terms of complexity). But this has never really been truly tested. Intel has always had the best lithographic process, the most resources to throw at optimising the physical design and layout and so on. But even though they spent on the order of 10 billion dollars to break into mobile they failed, not only in the marketplace, but more damning they failed to actually produce something that was notably better than the competition in spite of their advantages in process and design resources. And now, as their process advantage is dwindling, I feel that their performance/power quota suggest that maybe supporting the x86 ISA is carrying a cost in efficiency that is far greater than the 10% often thrown around. While translating the x86 to code more easily handled by hardware, that hardware still has to be designed to run legacy x86 code. It is impossible for me to evaluate how much more difficult it is to build efficient hardware that supports all accumulated x86 32-bit and 64 bit codes and the associated debris from MMX and onward, as opposed to build a processor that implements say only AArch64 ARM8.4 and ditches all legacy cruft. I would guess - a lot. And I would guess that the cost in efficiency is way more than 10%... But I don't expect ever to get confirmation of this from the companies whose livelyhood depends on the competitiveness of x86.

PS. Note that benchmarks such as Geekbench is coded to use the regular cores alone. The subtests do not leverage either the GPU or the NPU (or various dedicated blocks) even if that would theoretically be possible. From the perspective of comparing cores from different vendors, this makes perfect sense. But if the codes actually running on the devices use these resources either directly or through using supplied APIs, then the predictive value of the benchmark drops. The A12X SoC provides supplemental computing resources that typical x86 processors do not. Also, note that Apple cannot bin these processors at all. Rather, they set clock/power limits that ensure that they don't have to discard otherwise functional dies. That limitation does not apply to the x86 products.

Laurent06 · Nov 1, 2018

Gubbi said:
You can't increase latency or decrease the size of the L1 without negatively impacting performance. So no, the CPU design as it is won't scale to significantly higher frequencies.

If the gain of getting back to 32KB (or 64KB) is compensated by the gain in frequency then it's a no brainer. And that's one of the most easy part of a design to change. Adding a cycle of latency to L1D decreases CPU bound benchmarks (with only L1 hits)speed by much less than 10%.

Which other parts make you wonder about frequency scaling ?

Scheduling engine, pipe stage latency of FP operations, forward paths of more complex operations, address translation come to mind. All of these can be more complex to fix than adding a cycle of latency to L1.

snc · Nov 1, 2018

Imagine apple soc with active cooling and ~300mm^2 die size.

Gubbi · Nov 2, 2018

Laurent06 said:
If the gain of getting back to 32KB (or 64KB) is compensated by the gain in frequency then it's a no brainer. And that's one of the most easy part of a design to change.

Why do you think it is easy ? The cache path is a critical timing path. Accessing 128KB in four cycles indicate a rather high number of FO4 delays per pipeline stage. The rest of the core is made with the same timing constraints per pipe stage (same number of FO4 delays); A high number of FO4 delays means more work per pipeline stage, and thus likely a shorter pipeline (and lower power).

You can't just reduce the size of the L1 and clock the core higher, you have to redo the entire core.

Laurent06 said:
Adding a cycle of latency to L1D decreases CPU bound benchmarks (with only L1 hits)speed by much less than 10%.

Maybe true for 3 or 4 wide CPUs, Vortex is 6-wide; Increasing latency from four to five cycles means you have to schedule around 30 instructions instead of 24 at peak issue rate.

Cheers

Nebuchadnezzar · Nov 2, 2018

Gubbi said:
Maybe true for 3 or 4 wide CPUs, Vortex is 6-wide; Increasing latency from four to five cycles means you have to schedule around 30 instructions instead of 24 as peak issue.

It's 7-wide. In the future if Apple will want to go higher frequency, they will have no issue in doing so.

pcchen · Nov 2, 2018

Entropy said:
As a personal speculation, I was always a bit doubtful when it came to the claims that the x86 tax was some 10% or so. The claim is based on the idea that the x86 ISA is translated to a lower level code that once the translation is done runs optimally, and that the conversion carries only a small cost (and perhaps a somewhat higher cost in terms of complexity).

IMHO, if one considers from a "highest possible performance" point of view, it's possible that x86 tax is "only" 10%. However, from a "performance per watt" point of view, it can be much larger. Of course, "performance per watt" is not a very good measure because you can use very little power if you only need very little performance. So back in the old times people could claim that since you'll need this much complexity to achieve this kind of performance anyway, the extra burden of x86 is not a big deal. As ARM is approaching the performance level of a typical x86 CPU, the advantage of ARM is probably more clear than ever.

However, since A12 is, right now, probably the only ARM CPU that can be said to have similar performance level with a typical x86 CPU, and A12 does have some of its legacy stuff removed (mainly the 32 bits ARM things), so it's probably not that fair. It'd be interesting to see a x86 CPU without those legacy 16 bits and 32 bits stuff.

Laurent06 · Nov 2, 2018

Gubbi said:
Why do you think it is easy ?

Because I worked in teams that had to change L1 latency

OK that's not really easy, but other parts of the designs are much more difficult to change to accommodate frequency increases.

Apple A12, A12X, and A12Z SoCs

DavidGraham

Entropy

DavidGraham

Voxilla

Laurent06

Gubbi

Laurent06

Gubbi

Pressure

DavidGraham

Nebuchadnezzar

DavidGraham

Nebuchadnezzar

Entropy

Laurent06

snc

Gubbi

Nebuchadnezzar

pcchen

Moderator

Laurent06

Similar threads