ARM announces new Cortex A72 core

Deleted member 13524 · Feb 4, 2015

Yes, but it'll be a TRUE DOUBLE-OCTA-COREZ!!!!!11111oneone

Ailuros · Feb 4, 2015

ToTTenTranz said:
Yes, but it'll be a TRUE DOUBLE-OCTA-COREZ!!!!!11111oneone

Gimme an 120k score in Antutu :runaway:

Erinyes · Feb 4, 2015

Nebuchadnezzar said:
It's still Midgard, minor refresh.

More on the new CCI: http://community.arm.com/groups/pro...ed-system-coherency--part-3--corelink-cci-500

Ahh ok. So it seems IMG Series 7 will still be quite a bit ahead on performance then.

iMacmatician said:
How would the A72 compare to Apple's Cyclone in terms of clock for clock performance?

Cyclone is significantly wider than A57 so even if they did go a bit wider with A72, Cyclone would still beat it. Geekbench single core scores show a 1.4 ghz Cyclone beating a 2.1 Ghz A57 (Score of 1640 v/s 1520). So even if A72 is 50% faster, clock-for-clock it would still fall short of a 2014 cyclone..forget a 2016 one.

Mariner said:
No mention of die size for the A72, that I've seen? I'd imagine this indicates it is substantially larger than the A57, which is understandable when you consider the purported increase in performance and the size of the Cyclone and Denver cores in comparison to A57.

Perhaps ARM have decided that A53 provides good enough performance for now when die size is taken into comparison and a larger A7x version wouldn't be worth producing at present?

I dont think they gave any die size comparisons to A15 when they launched A57 either..but yea A72 is likely a fair bit larger than the A57 on the same process.

I feel that they think A53 is fine for big.LITTLE because a higher performance core wont be of much use when paired with A72. The job of the LITTLE core is basically to save power. But for the mid-range market I can easily see them introducing a new core with performance in between A53 and A57..like they did with Cortex A12.

mboeller said:
I wouldn't expect such a large increase in DMIPS at all. ARM talks about "sustained" performance in the comparison. The increase from the A57 @20nm and A72 @16nm is 84% (3,5/1,9). The process chance contributes around 20% to the increase according to a analyst PDF from BNP Paribas. Therefore at the most "only" 53% can be attributed to the SoC improvements. From this the CCI-500 contributes ~30% which leaves only 18% for the improvement of the core itself.

How do you conclude that the CCI-500 contributes 30%? As per ARM, the memory performance is up 30% but the CPU performance wont increase anywhere near that much.

Gubbi said:
So did I.

This tweet is outright misleading. "Performance" here means power efficiency at some arbitrary power level.

Cheers

Well if they mean sustained performance then the figure would be ok..but the tweet does not imply that in any way at all. Even I thought they meant absolute performance and only saw the sustained performance bit on the A72 product page later.

Ailuros said:
Gimme an 120k score in Antutu

Baah..the use of Antutu for marketing has given us a lot of useless SoC designs. At least we have Snapdragon 808 though.

mboeller · Feb 5, 2015

Erinyes said:
Ahh ok. So it seems IMG Series 7 will still be quite a bit ahead on performance then.

but don't underestimate the T880. If it is realy nearly 2x the performance than even a MP2 setup will do quite well. For comparison here are the benchmark results of an MT6752 which contains an T760 MP2 @ up to 700MHz:

http://www.gizchina.com/2015/01/27/jiayu-s3-mt6752-64bit-benchmarks/

How do you conclude that the CCI-500 contributes 30%? As per ARM, the memory performance is up 30% but the CPU performance wont increase anywhere near that much.

The conclusion came from here:
http://community.arm.com/groups/pro...ed-system-coherency--part-3--corelink-cci-500

This reduced snoop latency can benefit processor performance, and benchmarking has shown a 30% improvement in memory intensive processor workloads. This can help make your mobile device faster, more responsive and accelerate productivity applications like video editing.

Therefore I think that for sustained performance the benefit is up to 30% but not for peak/benchmark performance.

Ailuros · Feb 5, 2015

mboeller said:
but don't underestimate the T880. If it is realy nearly 2x the performance than even a MP2 setup will do quite well. For comparison here are the benchmark results of an MT6752 which contains an T760 MP2 @ up to 700MHz:

Huh? I don't know how ARM's marketing thinks, but it's likelier that the 80% is on a per core and per clock comparison.

A T760MP4@700MHz is slightly faster in Manhattan than a Rogue G6230@600MHz. Note that that's 4 clusters vs. 2 clusters and a 16+% higher frequency (***edit: strike that, they're roughly even both at a bit under 9 fps in Manhattan offscreen). According to IMG's own marketing you have Series6XT ending at up to 50% faster compared to Series6 (G6230 above) and Series7XT another up to 60% faster compared to 6XT. I don't know if the latter is true, but the GX6450 in Apple A8 seems close to the claim for the 6 to 6XT difference.

http://semiaccurate.com/2015/02/03/arm-outs-a57-successor-maya-core-cortex-a72/

Next up is the new Mali-T880 GPU family, successor to last fall’s Mali-T860 GPU. The T880 is a claimed 1.8x faster than last generation’s T760 and 40% more efficient. Beating a low-end part from the last generation is no big trick for performance but if you recall the T760 was not just slower, it was architected for efficiency. Because of this, beating the T760 at efficiency is a bit of a trick, beating it handily for raw performance too is a bonus.

Let me be generous in ARM Mali's favour that the differences between Series6 and Series7XT and 760 vs. 880 are roughly the same, you still have the incident that you need twice as many clusters and a slightly higher frequency to beat with a recent Mali a Rogue. One partial explanation is that despite both having the same FLOPs to TMU ratio, in Midgaard you have that ultra-weird vector-ALU gordian knot with always 1 TMU per cluster.

Long story short: Erinyes is right as it seems.

Zeross · Feb 5, 2015

Interesting information from Peter Greenhalgh on RWT forum :

Firstly, there are micro-architectural enhancements throughout the Cortex-A72 design which improve both IPC and power. In fact, the Cortex-A72 power improvements are achieved on the same process with the same library as Cortex-A57. We aren't relying on a process shrink to achieve the power improvement or boost performance purely through frequency. Depending on the workload we're seeing anywhere between 10-50% more clock-for-clock performance than Cortex-A57 under identical system conditions while also reducing power. I'm talking about a range of decent sized, representative workloads, not micro-benchmarks.

Erinyes · Feb 5, 2015

mboeller said:
The conclusion came from here:
http://community.arm.com/groups/pro...ed-system-coherency--part-3--corelink-cci-500

This reduced snoop latency can benefit processor performance, and benchmarking has shown a 30% improvement in memory intensive processor workloads. This can help make your mobile device faster, more responsive and accelerate productivity applications like video editing.

Click to expand...

Therefore I think that for sustained performance the benefit is up to 30% but not for peak/benchmark performance.

Yes but there would be very few workloads where the CPU will be limited by memory bandwidth alone..which is why I said that in general use cases it would not be anywhere near that much. You also have to remember that available bandwidth will double with LPDDR4 anyway. So even for sustained performance..I do not think the benefit would be anywhere near 30%.

Ailuros said:
Huh? I don't know how ARM's marketing thinks, but it's likelier that the 80% is on a per core and per clock comparison.

It's not even on a per clock basis..as we've seen with the CPU performance claims they take the benefit of the process into account. They have considered a clock speed of 850 Mhz on 16nm. The comparison to T760 is probably a 700 mhz clock on a 20nm (this is just my estimation). See the performance tab on the product page here - http://www.arm.com/products/multimedia/mali-performance-efficient-graphics/mali-t880.php

Mali-T880 (MP16)
Feature Value Description
Frequency 850MHz in 16nm (16 FinFET)
Throughput 1700Mtri/s, 13.6Gpix/s in 16nm (16 FinFET)

Edit: The performance tab on the product page on T760 lists the following:-

Mali-T760 MP16

Feature Value Description
Frequency 650MHz in 28nm HPM
Throughput 1300Mtri/s, 10.4Gpix/s in 28nm HPM

Ailuros · Feb 5, 2015

http://www.arm.com/products/multimedia/mali-performance-efficient-graphics/mali-t760.php
MP16@28HPm
16 TMUs * 650MHz = 10.4 GPixels/s

http://www.arm.com/products/multimedia/mali-performance-efficient-graphics/mali-t880.php

MP16@16FF
16 TMUs * 850MHz = 13.6 GPixels/s

liquidboy · Feb 5, 2015

Ailuros said:
http://www.arm.com/products/multimedia/mali-performance-efficient-graphics/mali-t880.php

MP16@16FF
16 TMUs * 850MHz = 13.6 GPixels/s

There's probably no relationship BUT I do find it totally intriguing that the Mali-t880 is

1. 850MHz
2. 13.6Gpixels/s
3. 1700 M triangles/s (1.7 G triangles/s)

and the XB1 GPU is

1. 850MHz
2. 13.6Gpixels/s - CB/DB block
3. 1.71 G primitives/s - Geometry Block

ARM weren't kidding when they said that the " Mali-T880 delivers 'console-quality gaming' "

Ailuros · Feb 5, 2015

You're not going to see in all likeliness any device with 16 T880 clusters in the first place, because at that height the result will consume way too much for a mobile device

Besides and unless my math is wrong there's a severe difference in FLOPs between the two still. I get 870 GFLOPs for a T880MP16@850MHz unless they've pumped up the ALUs....Also "GPixels" for the T880 should actually state "GTexels/s" for which you're off by a factor of 3.0.

liquidboy · Feb 5, 2015

Ailuros said:
....Also "GPixels" for the T880 should actually state "GTexels/s" for which you're off by a factor of 3.0.

I was only going off the ARM site

under performance it shows "Throughput 1700Mtri/s, 13.6Gpix/s in 16nm (16 FinFET) "

p.s. we are talking about the "throughput" which should be render targets via the CB/DB

Ailuros · Feb 5, 2015

I'm as sure as I can be that the quoted GPixels are actually texture fillrates. Each cluster comes with a single TMU and therefore for MP16 = 16 TMUs * frequency; in the given case 16*850MHz = 13.6 GTexels/s. The data above in post #28 I provided are from their own homesite.

You have 16 TMUs on a T880MP16 and 48 TMUs on the XB1 GPU (see pink block in the diagram you quoted); the 13.6 Gigapixels are from the 16 ROPs of the XB1 GPU. Essentially the XB1 has 3x times the TMU amount as the T880MP16.

Realistically the highest amount we'll see in all likeliness from that GPU IP is a T880MP6 for future high end smartphone SoCs.

Blazkowicz · Mar 3, 2015

mboeller said:
The conclusion came from here:
http://community.arm.com/groups/pro...ed-system-coherency--part-3--corelink-cci-500

Wow, there's an IOMMU (that ARM calls an SMMU) and so a provision for a VM to use the GPU or other hardware blocks. And that was on previous tech already (ARMv7) and this is described in this paper from 2011
http://www.arm.com/files/pdf/System-MMU-Whitepaper-v8.0.pdf

This is tech barely available in desktops (with software complexity and/or licensing, Intel and nvidia mostly locking away the feature)
Though some use cases are from the difficulty of securing Android (lack of security updates, overbearing apps). But if this works and there's software available you could use a secure OS or two for most of your stuff, and some Android VM with disabled or firewalled network to run some game etc.

Today if you have enough money (and perhaps time or need to care) you can run a desktop PC with dual graphics cards and both Windows + linux, keep browsing beyond3D and doing stuff on linux while the Windows VM is off or rebooting to apply updates.

fxtech · Jun 24, 2015

First Arm A72 benchmark . It is an Qualcomm Snapdragon 620 , code msm8952 (lppdr3 memory )

http://browser.primatelabs.com/geekbench3/2826004

fxtech · Jun 27, 2015

i was expecting more reaction from people

this core could reach (and go over) 2500 point on geek bench when will be deployed on the snapdragon 820 at 3 ghz with lpddr4 at 14nm

liolio · Jun 28, 2015

Ailuros said:
You're not going to see in all likeliness any device with 16 T880 clusters in the first place, because at that height the result will consume way too much for a mobile device Besides and unless my math is wrong there's a severe difference in FLOPs between the two still. I get 870 GFLOPs for a T880MP16@850MHz unless they've pumped up the ALUs....Also "GPixels" for the T880 should actually state "GTexels/s" for which you're off by a factor of 3.0.

ARM Corp is not given much information about the difference between the T7xx series and the T8xx series yet for the T880 specifically they claim a beefy 1.8 the performance of T760. It makes me wonder if ARM went for 4 ALUS pipeline per core with the T880 (they did in some of their past designs).
IF I use Anandtech data, I get 320 FLOPS per cycle or 544 FLOPS per cycle (taking in DOT product throughput and more, ARM count 17 FLOPS per ALU) for a T760 mp16. The T860 is the same as the T760 efficiency aside.
The T880 could be simply a reworked T860 using the 14/16nm process. In that case I get: 376 MFLOPS and 462 MFLOPS (accounting for the DOT products). You can double that figure if the T880 were to use 4 ALUs per core, I still can't get the same number as you do.

I don't think we are going mp16 configuration either. By ARM own admission they were expecting 10 cores to be the highest configuration we would see for the T760 line, so far the highest end implementation uses 8 cores.The Mali delivers either 160 FLOPS or 272 FLOPS per cycle depending on your accounting, the mp8 version included in the S6 runs @772MHz (max), that is 145 MFLOPS (or 210 MFLOPS using ARM figures).

The PowerVR G6'xx inside either a Zenfone 2 or an iPhone 6 delivers 256 FLOPS per cycle, at max speed that is 136 MFLOPS for the Zenfone 2 and ~115 MFLOPS for the iPhone 5s (going by Anand tech clock estimate of 450MHz).
If I use the GFX benchmark results and compare the Galaxy S- to the Zenfone, the results are inline with the FLOPS figures if you use ARM accounting for all the result but the ALU2 where both devices perform more or less the same. Actually in the ALU 2 test the PowerVR are performing better comparatively better and in pure graphical tests the mali does a tad better.
From that test you can estimate that gpu in Apple A8 operates ~700 MHz (that test gives an accurate estimate of 454MHz for the A7 using the Z3580 GPU 533MHz as a ref). Now I can estimate the PowerVR G6450 (in the A8) theoretical FLOPS throughput

And I get 179 MFLOPS.
Again sticking to GFX bench result are pretty consistent with FLOPS figure using ARM accounting.
I makes me think of some article I read I don't remember where comparing the FLOPS throughput of the modern GCN GPU inside the XB1 with the FLOPS throughput of Xenos inside the 360. Obvious the former is a lot more powerful but they article showed pretty well that on Xenos quite complicated calculations were "free" (or need multiple instructions on GCN). The situation seems to be the same when you compare Mali to other GPU architecture, it shows pretty well the difference between Compute FLOPS and Graphic FLOPS if that makes sense.

Overall whereas I agree that Mali are not fit deliver the the type of performances you get for low mid-range PC GPUs or consoles, I realized that I've not been paying enough attention to the progress ARM made with its GPUs, those Mali GPU are damned good.
ARM promised a 80% increase of performance over the T760, it might account for difference in clock speed 650 MHz vs 850 MHz in reference designs for the estimate (operating frequency in Samsung Exynos T760 is already higher). Even if part of that increase is deliver is could be enough for a hypothetical Mali-T880 mp8 powered device to beat an iPad Air 2 compete (if not win) the Shield Tablet. An Mp16 configuration (assuming good scaling and the GPU is fed properly) should compete (if not beat) a Shield TV (within a comparable tdp).
Pretty sweet, if there is that is not too much of a PR twist actually a company like Nintendo should consider an all ARM design.

Rys · Jun 29, 2015

s/MFLOP/GFLOP/g

Ailuros · Jun 30, 2015

A >5 months delay for an answer and a quite long post just to actually agree with the quoted content of a few sentences is quite an art I must admit

Exophase · Jun 30, 2015

fxtech said:
i was expecting more reaction from people this core could reach (and go over) 2500 point on geek bench when will be deployed on the snapdragon 820 at 3 ghz with lpddr4 at 14nm

1) This isn't actually the first Cortex-A72 Geekbench 3 score to show up, MediaTek MT8173 have, and depending on what you look at single threaded score is pretty similar http://browser.primatelabs.com/geekbench3/compare/2732906?baseline=2826004 But Geekbench scores on Android are super erratic so who knows (one reason why they're not very instructive)

2) Snapdragon 820 isn't using Cortex-A72 but a custom Qualcomm 64-bit ARM core. We have pretty much no performance details on it.

3) No one has announced a 3GHz Cortex-A72 and I doubt any SoC with A72s made on a remotely mobile-targeted 16/14nm process will allow such high clocks.

liolio · Jul 1, 2015

Last OT post wrt to Mali and FLOPS the best response comes from ARm itself, real world evidences seems to corroborate theirs claims:
http://community.arm.com/groups/arm...ops--how-arm-measures-gpu-compute-performance

ARM announces new Cortex A72 core

Deleted member 13524

Guest

Ailuros

Epsilon plus three

Erinyes

mboeller

Ailuros

Epsilon plus three

Zeross

Erinyes

Ailuros

Epsilon plus three

liquidboy

Ailuros

Epsilon plus three

liquidboy

Ailuros

Epsilon plus three

Blazkowicz

fxtech

fxtech

liolio

Aquoiboniste

Rys

Graphics @ AMD

Ailuros

Epsilon plus three

Exophase

liolio

Aquoiboniste