ARM announces Cortex-A50 series

Pressure

Veteran
All based on the new armv8 superset and 64-bit.

ARM Cortex-A50 processor series:

  • Currently includes the Cortex-A57 and Cortex-A53 processors
  • Optional cryptographic acceleration that can speed up authentication software up to x10
  • Interoperability with ARM Mali™ graphics processor family for GPU compute applications
  • Features AMBA® system coherency to extend to many-core coherence with ARM CoreLink cache coherent fabric components, including the CCI-400 and CCN-504
ARM Cortex-A57 processor:
  • The most advanced, highest single-thread performance ARM application processor
  • Delivers the enhanced performance required for smartphones as they continue to transition from content-consumption devices to content-creation devices, with up to three times that of today’s superphones in the same power budget
  • Provides computer performance comparable to a legacy PC, while operating in a mobile power budget, enabling cost and power efficiency benefits for both enterprise users and consumers
  • Extended reliability and scalability features for high-performance enterprise applications
ARM Cortex-A53 processor:
  • The most efficient ARM application processor ever, delivering today’s superphone experience while using a quarter of the power
  • Incorporates reliability features that enable scalable dataplane applications to maximize performance per mm2 and performance per mW
  • Optimized for throughput processing for modest per thread compute applications
  • The Cortex-A53 processor combined with the Cortex-A57 and big.LITTLE processing technology will enable platforms with extreme performance range while radically reducing the energy consumption

Entire Press Release.
 
It seems like these will be the 64-bit ARM CPUs AMD will incorporate into their Opterons given they only have a processor license and not an architecture license.
 
It seems Anand already had time to digest the news (with pictures to boot).

They are talking about big.LITTLE 2.4 (2 x Cortex-A57 and 4 x Cortex-A53) and 4.4 (4 x Cortex-A57 and 4 x Cortex-A53) configurations for smartphones and the latter for tablets.
 
It seems Anand already had time to digest the news (with pictures to boot).

They are talking about big.LITTLE 2.4 (2 x Cortex-A57 and 4 x Cortex-A53) and 4.4 (4 x Cortex-A57 and 4 x Cortex-A53) configurations for smartphones and the latter for tablets.
It seems like they expect a 4 core block of A53 to be standard for smartphones and tablets. Do the low power, background type scenarios that LITTLE is presumably there for really need 4 cores rather than 1 or 2? Other than persistent background processes, anything else would be best served waking up the A57 array to quickly finish the task so everything can power down again.
 
Some were wondering if one of the reasons ARM went with 32 registers in AArch64 was to better facilitate in-order processors.. with them announcing one first thing like this that may have been the case. The larger register file gives the compiler more room to perform renaming and unrolling in software, and to allow better caching of irregular constants.

ARM's big.LITTLE strategy means software should ideally be optimized for the LITTLE processor.. I hope that this doesn't have any ill-effects on how it runs on the big one. It should probably be alright, so long as it doesn't add a lot of extra code..

Other than persistent background processes, anything else would be best served waking up the A57 array to quickly finish the task so everything can power down again.

There's more to CPU utilization scenarios than background tasks and burst requests. There's going to be a whole sea of games that need some constant level of CPU power that's a small fraction of what Cortex-A57 will provide, at the very least all the games that ran fine on Cortex-A8 phones. This is also true for media playback; even with good hardware acceleration for video decode there's usually some amount of housekeeping needed every frame that's better performed on the CPU in conjunction with the OS.

But I agree four cores might be overkill. I guess they're just so small that it doesn't make that much of a difference to go with 4 instead of 2, I mean we're talking 60% the size of a Cortex-A9 in next gen processes, that's pretty small..
 
But I agree four cores might be overkill. I guess they're just so small that it doesn't make that much of a difference to go with 4 instead of 2, I mean we're talking 60% the size of a Cortex-A9 in next gen processes, that's pretty small..

It's just that performance won't scale linearly with 4 vs. 2 cores; despite that the die area difference for 4 cores over 2 would be small.

Unless I've missed something there's not mention for frequencies on A50s. It would be interesting to see if they can clock higher than A15s and by how much under a given power target.
 
Some were wondering if one of the reasons ARM went with 32 registers in AArch64 was to better facilitate in-order processors.. with them announcing one first thing like this that may have been the case. The larger register file gives the compiler more room to perform renaming and unrolling in software, and to allow better caching of irregular constants.
The benchmarks on the graph at http://www.arm.com/products/processors/cortex-a50/cortex-a53-processor.php are all Java/Javascript - I suppose extra registers can't hurt JITs! However they also show nice improvements Dhrystone/Coremark (showing it's not just memory subsystem improvements) and amazingly enough SPEC Int 2000 (base) although that score is provisional and the clock frequency isn't even specified.

Their A53 webpage is unusually detailed and lists the following improvements:
ARM said:
512 entry main TLB Improved performance on code with complex memory access patterns, e.g. web browsing. Larger main TLB than Cortex-A7 and Cortex-A9.
Small, fast uTLBs 10 entry uTLB with an extremely short miss penalty to reload from the main TLB allows excellent performance in a small area and power footprint.
Advanced Branch Predictor 4Kbit Conditional Predictor, 256 entry indirect predictor increase branch hit rate.
64B cache lines Fully aligned with Cortex-A57 microarchitecture to simplify cache management software in big.LITTLE systems. 64B line sizes a good tradeoff for modern memory access patterns.
Non-blocking I-fetch with multi-line pre-fetch Increased instruction throughput across more types of benchmarks, from control code to processing intensive loops.
Dual identical ALU pipelines Increased opportunity to dual-issue instructions, at a small additional area.
64b store path Balances store bandwidth with dynamic power consumption, focused on a highly efficient design tradeoff.
Multi-stream pre-fetcher Greater data flow into the main datapath increases overall performance on a wide range of code.
Increased D-side throughput 3-outstanding load miss capability (per-core, excluding prefetches); 8-outstanding transactions (per-core)
Extensive power-saving features Heirarchical clock gating, power domains, advanced retention modes.

For Dhrystone/Coremark the dual ALUs being identical is likely to help a fair bit (since they're closer to being throughput benchmarks), but overall I'd expect AArch64 to be the biggest improvement on most workloads. All the other things in that list add up to a surprisingly fast core given it's still in-order though!

As for A57, it looks incremental compared to the A15 (still 8 issue slots - not sure how the 128 instructions in flight compares) but they claim "Full out-of-order scheduling on all execution paths allows more types of instructions to be re-ordered" - I wonder if that means load/store reordering is improved... And "Increased peak instruction throughput via duplication of execution resources" is potentially a big deal given how issue slot was highly specialised on A15.

In terms of clock frequencies, I'd be surprised it they made the A57 clock higher than the A15 given how power hungry the A15 already was/is. And for the A53, it's the same 8 stages pipeline as the A5/A7 and it's still fully compatible with ARMv7, so I don't see how they could have increased clock frequency while decreasing area. That's not necessarily a bad thing in terms of efficiency though.

I hope they release some more information at TechCon :)
 
Some were wondering if one of the reasons ARM went with 32 registers in AArch64 was to better facilitate in-order processors.. with them announcing one first thing like this that may have been the case. The larger register file gives the compiler more room to perform renaming and unrolling in software, and to allow better caching of irregular constants.

ARM's big.LITTLE strategy means software should ideally be optimized for the LITTLE processor.. I hope that this doesn't have any ill-effects on how it runs on the big one. It should probably be alright, so long as it doesn't add a lot of extra code..



There's more to CPU utilization scenarios than background tasks and burst requests. There's going to be a whole sea of games that need some constant level of CPU power that's a small fraction of what Cortex-A57 will provide, at the very least all the games that ran fine on Cortex-A8 phones. This is also true for media playback; even with good hardware acceleration for video decode there's usually some amount of housekeeping needed every frame that's better performed on the CPU in conjunction with the OS.

But I agree four cores might be overkill. I guess they're just so small that it doesn't make that much of a difference to go with 4 instead of 2, I mean we're talking 60% the size of a Cortex-A9 in next gen processes, that's pretty small..

IIUC, they are claiming a simple in order core beats A9 by ~40%. That makes me wonder if A9 with it's branch prediction and OoO is crap. It loses to Atom...
 
The benchmarks on the graph at http://www.arm.com/products/processors/cortex-a50/cortex-a53-processor.php are all Java/Javascript - I suppose extra registers can't hurt JITs! However they also show nice improvements Dhrystone/Coremark (showing it's not just memory subsystem improvements) and amazingly enough SPEC Int 2000 (base) although that score is provisional and the clock frequency isn't even specified.

I can't think of any reason off the top of my head why a JIT would want more registers than an AOT of the same language.. but JS engines are weird beasts and might benefit.

However, on a similar note (and maybe your thinking is related to this), I do think that machine code emulators could substantially benefit from extra registers. For instance, I'd much rather emulate x86-64 on AArch64 than AArch32. You have enough registers not only to not have to worry about dynamic register allocation, which tends to take a performance hit no matter how well you do it, but you also have extra registers for intermediate values (like load-up results), keeping separate views of partial registers, caching large immediates, caching emulated flags, and caching stack locations.

Hell, I'd probably even rather emulate ARMv7 in AArch64 than AArch32, despite the former dropping various core ISA features.

As for A57, it looks incremental compared to the A15 (still 8 issue slots - not sure how the 128 instructions in flight compares) but they claim "Full out-of-order scheduling on all execution paths allows more types of instructions to be re-ordered" - I wonder if that means load/store reordering is improved... And "Increased peak instruction throughput via duplication of execution resources" is potentially a big deal given how issue slot was highly specialised on A15.

In terms of clock frequencies, I'd be surprised it they made the A57 clock higher than the A15 given how power hungry the A15 already was/is. And for the A53, it's the same 8 stages pipeline as the A5/A7 and it's still fully compatible with ARMv7, so I don't see how they could have increased clock frequency while decreasing area. That's not necessarily a bad thing in terms of efficiency though.

One would think that the A57 they were comparing against A15 had to have a higher clock, surely you wouldn't get 30% IPC improvement from the changes they've made. I doubt AArch64 will help with it nearly as much as it may with A53.

I wonder if the instruction issue queues are 16 deep instead of 8 (hence 128 slots)?

I like this information:

"User space cache operations improve dynamic code generation efficiency, Data Cache Zero for fast clear "

Glad to see ARM realized that programs were spending an unacceptable price when generating/modifying code...
 
I can't think of any reason off the top of my head why a JIT would want more registers than an AOT of the same language.. but JS engines are weird beasts and might benefit.
Indeed, I'm not sure what I was thinking. I don't know about JS but here's what Mike Pall says about LuaJIT: http://lua-users.org/lists/lua-l/2010-11/msg00031.html - he makes the interesting point that JITs can optimise register pressure for hotspots more efficiently than method-at-a-time compilers that don't know what to inline.

If you look at http://luajit.org/performance_x86.html you can indeed see that the performance benefit is very small on average and that's despite going from 8 to 16 registers instead of 16 to 32 registers (although on a OoOE architecture), and the biggest gains are due to SSE and 64-bit compare improvements supposedly. However the interpreter was hand-written in assembly so it's hard to say how much it could benefit from more registers. The vanilla Lua interpreter (written in C) shows by far the biggest performance gain on average but I doubt that's very representative of JS JITs..

One would think that the A57 they were comparing against A15 had to have a higher clock, surely you wouldn't get 30% IPC improvement from the changes they've made. I doubt AArch64 will help with it nearly as much as it may with A53.
The results at http://arm.com/products/processors/cortex-a50/cortex-a57-processor.php on the performance tab are actually 32-bit only. And if we exclude stream (bandwidth benchmark), it's about 20% which seems far from impossible for the changes they've done and given A15 was essentially an entirely new architecture (there's always lots of performance up for grabs in lots of small things in the first revision - just look at Intel's evolution from Conroe to Ivy Bridge!)

Also that diagram has numbers on the same 28nm process, and it's still a 15+ stages pipeline. They could have optimised the stages a bit to achieve slightly higher clock speeds, but I wouldn't expect a big increase from that side.

"User space cache operations improve dynamic code generation efficiency, Data Cache Zero for fast clear "
That's a pretty nice unexpected feature! :)
 
Indeed, I'm not sure what I was thinking. I don't know about JS but here's what Mike Pall says about LuaJIT: http://lua-users.org/lists/lua-l/2010-11/msg00031.html - he makes the interesting point that JITs can optimise register pressure for hotspots more efficiently than method-at-a-time compilers that don't know what to inline.

If you look at http://luajit.org/performance_x86.html you can indeed see that the performance benefit is very small on average and that's despite going from 8 to 16 registers instead of 16 to 32 registers (although on a OoOE architecture), and the biggest gains are due to SSE and 64-bit compare improvements supposedly. However the interpreter was hand-written in assembly so it's hard to say how much it could benefit from more registers. The vanilla Lua interpreter (written in C) shows by far the biggest performance gain on average but I doubt that's very representative of JS JITs..

Okay, what you were saying is clearer now.. when you said JIT I thought you only meant generated code, not an interpretive phase of a hot-spot VM. Interpreters typically benefit a lot from more registers since they must carry a lot of ancillary state in addition to the native state of the program they're interpreting. But if a VM is spending a significant time in the interpreter it's not doing a very good job balancing compilation.

The results at http://arm.com/products/processors/cortex-a50/cortex-a57-processor.php on the performance tab are actually 32-bit only. And if we exclude stream (bandwidth benchmark), it's about 20% which seems far from impossible for the changes they've done and given A15 was essentially an entirely new architecture (there's always lots of performance up for grabs in lots of small things in the first revision - just look at Intel's evolution from Conroe to Ivy Bridge!)

Also that diagram has numbers on the same 28nm process, and it's still a 15+ stages pipeline. They could have optimised the stages a bit to achieve slightly higher clock speeds, but I wouldn't expect a big increase from that side.

It just doesn't fit what they've shown. The uarch differences from Conroe to IB are huge. For Cortex-A57 (and A53) ARM is showing the exact same pipeline diagram as Cortex-A15. It looks like most of the improvements are in making buffers bigger and allowing more things in flight.

The thing with the numbers is we don't actually know what clock speed the Cortex-A15 is supposed to be at. It could be 2GHz, despite ARM themselves saying that Cortex-A15 can scale at least to 2.5GHz. So a modest (let's say 10-15%) difference in clock for what they're looking at wouldn't be that strange.

That's a pretty nice unexpected feature! :)

Yeah, I was crossing my fingers on this one.
 
It just doesn't fit what they've shown. The uarch differences from Conroe to IB are huge. For Cortex-A57 (and A53) ARM is showing the exact same pipeline diagram as Cortex-A15. It looks like most of the improvements are in making buffers bigger and allowing more things in flight.
There could be things we're still missing. For example, A15 was 1 load+1 store, stores were in-order, and loads couldn't bypass stores (although they only require the address, not the data to be issued). All of these things could be improved significantly without changing the basic diagram (e.g. 2 loads if there are no stores).

I agree that 20-30% would be high, but given the number of small things adding up, it would also be quite unimpressive if it included a 10-15% clock speed increase IMO. We'll see :)

The thing with the numbers is we don't actually know what clock speed the Cortex-A15 is supposed to be at. It could be 2GHz, despite ARM themselves saying that Cortex-A15 can scale at least to 2.5GHz. So a modest (let's say 10-15%) difference in clock for what they're looking at wouldn't be that strange.
Being able to clock at a certain clock speed doesn't mean that anyone is going to ship that. In theory the only comparison that matters is nominal process voltage, but in practice many companies have shipped CPUs and GPUs at more than nominal process voltage for a long time. As absolute power consumption goes up in the A15 generation and beyond, there's no way around reducing the voltage IMO, so it's hard to compare clock frequencies on a fair basis.
 
So it would appear that TSMC in cooperation with ARM has already taped out the Cortex-A57.

Hsinchu, Taiwan and Cambridge, UK – April 2, 2013 – ARM and TSMC (TWSE: 2330, NYSE: TSM) today announced the first tape-out of an ARM® Cortex™-A57 processor on FinFET process technology. The Cortex-A57 processor is ARM’s highest performing processor, designed to further extend the capabilities of future mobile and enterprise computing, including compute intensive applications such as high-end computer, tablet and server products. This is the first milestone in the collaboration between ARM and TSMC to jointly optimize the 64-bit ARMv8 processor series on TSMC FinFET process technologies. The two companies cooperated in the implementation from RTL to tape-out in six months using ARM Artisan® physical IP, TSMC memory macros, and EDA technologies enabled by TSMC’s Open Innovation Platform® (OIP) design ecosystem.
ARM and TSMC’s collaboration produces optimized, power-efficient Cortex-A57 processors and libraries to support early customer implementations on 16nm FinFET for high-performance, ARM technology-based SoCs.

“This first ARM Cortex-A57 processor implementation paves the way for our mutual customers to leverage the performance and power efficiency of 16nm FinFET technology,” said Tom Cronk, executive vice president and general manager, Processor Division, ARM. “The joint effort of ARM, TSMC, and TSMC’s OIP design ecosystem partners demonstrates the strong commitment to provide industry-leading technology for customer designs to benefit from our latest 64-bit ARMv8 architecture, big.LITTLE™ processing and ARM POP™ IP across a wide variety of market segments.”

“Our multi-year, multi-node collaboration with ARM continues to deliver advanced technologies to enable market-leading SoCs across mobile, server, and enterprise infrastructure applications,” said Dr. Cliff Hou, TSMC Vice President of R&D. “This achievement demonstrates that the next-generation ARMv8 processor is FinFET-ready for TSMC’s advanced technology.”

This announcement highlights the enhanced and intensified collaboration between ARM and TSMC. The test chip was implemented using a commercially available 16nm FinFET tool chain and design services provided by the OIP ecosystem and ARM Connected Community partners. This successful collaborative milestone is confirmation of the roles that TSMC’s OIP and ARM’s Connected Community play in promoting innovation for the semiconductor design industry.
 
Just took part in a webinar about their low-power processors.

Some slides I took from it:

bS69OHw.jpg


It's very interesting that they were targeting the A53 at 2GHz, which might be a indicator of the clocks which manufacturers will be using on their SoCs.

So they state an about 1.4x performance increase per clock if you normalize their frequencies.

ta1co2u.jpg


This slide specifically mentions that in-core blocks now are finely power-gated on an instruction-level granularity.

zIdC6Dn.jpg


In general they were also talking how the A53 will be a bigger improvement in performance, noting that it's not just an ARMv8 version of the A7.

I asked about per-core DVFS planes again and what their stance on it was but the moderator blabbled my question and the presenter defected with saying that yes big.LITTLE clusters are on different frequency planes. :rolleyes:

No mention of the A57 as the presentation was about the low-power models (A5/A7/A53).
 
Interesting how ARM didn't put much weight on the Cortex A50 being 64bit a year ago, but the moment apple says their new cpu is 64bit the popular press goes wild about it, without even knowing what it means.
 
Nice details about Cortex-A53. Sounds like it could compete decently against highly clocked Cortex-A9s like the ones in Tegra-4i. One thing we know about the processor is that its branch prediction was good enough to replace whatever the A12 originally had (which I would assume was at least as good as A9's). That's a big step up from A7 which has pretty weak branch prediction. But even then, this performance estimate is pretty astonishing for an in-order processor (should have perf/MHz well past Saltwell) and I have to wonder how they accomplished it. Interesting that Javascript has the lowest boost - I wonder if it's hindered by the JIT's inability to perform good scheduling for an in-order multi-issue processor.

Interesting how ARM didn't put much weight on the Cortex A50 being 64bit a year ago, but the moment apple says their new cpu is 64bit the popular press goes wild about it, without even knowing what it means.

Apple announcing details about a phone that'll be out in a few weeks is a little different from ARM announcing a processor that'll show up in its first device at some unknown time, but over a year at the very earliest.
 
Apple announcing details about a phone that'll be out in a few weeks is a little different from ARM announcing a processor that'll show up in its first device at some unknown time, but over a year at the very earliest.

The point is, we all knew that ARM processors would have to eventually use 64bit registers, at least because of the 4GB RAM limit. That much was obvious and it was always a matter of when rather than if.
ARM announced that transition as early as 2011 with the first details about armv8, the first cores with armv8 were announced a year ago (as you see in the first post) and by then, no one seemed surprised about it.

Get to Q32013, apple says their newest SoC is 64bit and suddenly all the "«tech press»" starts saying apple made another genius move and technical marvel and how important 64bit ARM cpus are for the future of mankind.

Oh well, just "«tech press»" being "«tech»" press.
 
Back
Top