Discussion in 'Mobile Devices and SoCs' started by iMacmatician, May 7, 2018.
Great, thanks a lot!
I agree although at least the extra registers of ARMv8 should help compared to ARMv7 or x86-64, right? Also an extra load unit might be difficult to feed from the L1 data cache - I'm not sure what trade-offs they are doing there in terms of bandwidth/banks/etc. given their huge 128KiB capacity.
Coming from a GPU background I have maybe an irrational dislike of register spilling, but in extreme cases I've wondered whether it'd be more energy efficient for the compiler to recompute certain results rather than store it in L1 then reload it (if the data needed to compute is being kept in registers anyway for another reason - not sure how common that is in CPU workloads to be honest, might be too rare to focus on). I don't think modern compilers do this? Anyway it probably wouldn't make much difference and it's a bit academic...
BTW - do we know what's the L1 data cache line size for Apple? ARM's cores have 64 bytes cache lines, but in my mind, that's partly because some customers will use memory controllers with 64 bytes granularity. Since Apple controls the entire SoC, they might (or might not) have decided that 32 bytes granularity is still beneficial despite the cost in the memory controller, at which point it might make sense for the L1 data cache to also have 32 bytes cache lines. With smaller cache lines prefetching also becomes slightly more important, but given Apple's performance levels it's obvious they must have good prefetching algorithms.
I was not thinking about having a third load, but rather being able to issue two loads and one store; that's useful for some computing tasks that stream their data (e.g., summing two vectors). IIRC Intel can do it since Haswell and their CPU are less wide.
9to5mac claims that the rumored iPad Pro (2018) will have an A12X SoC.
No numbers are given, but I think a reasonable guess from previous -X SoCs is the following:
3 Vortex cores clocked slightly higher than in the A12
6? Tempest cores
128-bit memory interface
8 GPU cores (2x the A12)
16 Neural Engine cores (2x the A12), ~10 TOPS.
This may be a silly question, but do the numbers of Vortex and Tempest cores have to be in a fixed (1:2) ratio?
One of the more interesting rumors (Kuo, 9to5mac) is that the iPad Pro may have a USB-C port instead of the Lightning port, which allows for 4K output. This change further differentiates the iPad Pro from the iPhone and iPad (non-Pro) and seems to push it a bit closer to laptop territory.
The ipad pro getting a USB-C interface would make it more port-friendly than the Surface Pro 6, which would be hilarious.
Especially considering Microsoft gimped the GPU on their high-end Core i7 offering, which now only gets a remarkably old Gen9 GT2 GPU.
Interesting times ahead.
AnandTech has released SPEC2006 estimates for the small CPU cores in the A11 and A12, as well as Neural Engine benchmarks.
Is there any benefit for a future A-series SoC to have "big" cores, out-of-order "little" cores, and a third tier of in-order tiny cores?
I was referring to the Android SoCs - the middle gap is now quite big. We'll see some interesting solutions in the next gen for this.
Andrei, how did you run SPEC on little cores?
Seems iOS11 didn't paid attention to thread affinity, at least last time I tried.
BTW, your frequency measurements are bit off.
All precise frequencies I measured on A7,A9,A11 are divisible by 24MHz. (So 1587 or 2083 are just not possible)
This is a CNTFRQ_EL0 timebase.
The fastest way to read a timer is
mrs x0, CNTPCT_EL0
While 2064MHz (on Monsoon) was measured by my early freq timing code, the later versions with simultaneous measurements on N=3..6 cores
got the same 2304MHz Monsoon max freq as with dual cores. But I think I need to re-check min frequency in this situation.
I'll going to buy iPhone XR and revisit my measurements.
That's just some assumption, you can program PLLs with any frequency.
In theory, but we have frequency range/steps defined by Apple.
BTW do you have a plan to dig into core microarchitecture?
I'm trying to code some uarch tests as well.
You did a good job with review, a lot of Intel fanboys were seriously butthurt