ARM Cortex A12

Gubbi · Jun 9, 2013

Arun said:
I'm still not convinced by big.LITTLE's cache hierarchy though, and

Me neither. They power two caches per core, but only get the benefit of one, it's detrimental to performance and power consumption.

I'd expect 2nd generation big.LITTLE to have a big and a little core per L2 cache, with only one active at a time -> no migration of data when migrating from big to LITTLE or vice versa. OS scheduler permitting of course.

Cheers

Laurent06 · Jun 9, 2013

Gubbi said:
Well, NEON is integrated into the OOOe scheduling machinery now, so I'd expect it to be mandatory.

I was talking about the AdvSIMD unit width.

Exophase · Jun 9, 2013

Gubbi said:
Me neither. They power two caches per core, but only get the benefit of one, it's detrimental to performance and power consumption.

I'd expect 2nd generation big.LITTLE to have a big and a little core per L2 cache, with only one active at a time -> no migration of data when migrating from big to LITTLE or vice versa. OS scheduler permitting of course.

Every A15 core active can benefit from having the A15 cluster L2 cache active. Every A7 core active can benefit from having the A7 cluster L2 cache active. I don't see how they only get the benefit of one cache or part of one cache. This is true regardless of the allocation/migration policy.

What you're describing about wasted power sounds no different from saying that any shared cache uses more power/core when some cores are disabled, even though it also means more cache/core. But almost everyone is moving to shared caches now. Jaguar is, and even Silvermont is sharing L2 between two cores.

If that extra cache is actually used enough then it makes perf/W better to have it on, not worse. If you can somehow determine it's not really helping your hit rate then you don't necessarily need to have separate caches to turn off part of it. Or at least that's what Steamroller is supposed to be capable of doing.

Homeles · Jun 9, 2013

Exophase said:
The list of lead partners says it all.. Marvell and Mediatek, two SoC vendors who have been anything but cutting edge :/ Being optimized for 28nm is also telling; TSMC's 28HPM in particular will be pretty old by the time this comes out.

18 months behind is a little harsh, it could be quite a nice core if it were out in the next few months. I'd say 12 months too late. Which still sounds awful.

Exactly my feelings on this. Even if we ignore Intel, custom A9 designs are already going to be ahead by a wide margin.

Not really on topic: is anyone else tired of Anand proclaiming Atom (where Silvermont isn't specified, so I presume he means Saltwell) easily beats Cortex-A9? Pretty much any native code test I've seen shows Cortex-A9 with stronger perf/MHz, even that paper where they were using a quite old GCC version before some major ARM improvements. And these days Cortex-A9s are coming in at clocks nearly equivalent to Saltwell.

Well, I think the key phrase there is "these days." Last generation parts were certainly beaten by Atom through sheer clockspeed advantage. Turbo boost is a wonderful thing. However, the lack of useful benchmarks in the tablet and smartphone space are undoubtedly obscuring things.

Current generation Krait 300 and the upcoming 400 definitely hold a lead over Atom. Apple's A7 should make mincemeat of Saltwell as well.

Now these observations I've been making have all been based on AnandTech's bench suite. I don't really follow anywhere else for mobile reviews. People have been stating that Intel's been optimizing for benchmarks, and that's correct. Intel hasn't been hiding it either -- their performance marketing slides released with the Silvermont made note of this.

You can chalk up everything to optimization if you like, but Atom's performance is still impressive, despite its age.

Exophase · Jun 9, 2013

There's good optimization, which I definitely credit Intel for (and I wish there was more effort on the ARM front, although it's slowly improving), but then there's optimizing in a way that only benefits some benchmark or worse, cheating at benchmarks.

When I look at AnTuTu getting a grossly higher advantage than any other benchmark, then some for-pay report is released that shows 2C4T Saltwell destroying 4C4T Cortex-A15 in this very highly threaded synthetic benchmark of course I'm going to be skeptical.

Gubbi · Jun 10, 2013

Exophase said:
Every A15 core active can benefit from having the A15 cluster L2 cache active. Every A7 core active can benefit from having the A7 cluster L2 cache active. I don't see how they only get the benefit of one cache or part of one cache. This is true regardless of the allocation/migration policy.

I must admit I was thinking in the context of the cluster migration policy of Exynox 5 series SoCs. Every time you bounce your run queues from one cluster (A) to the other (B) you're powering two L2s for a while. You either:
1. Power both L2s indefinitely; letting cluster B demand load data from cluster A on misses.
2. Flush cluster A's L2 cache to dram, let cluster B pull it back in when it misses its own L2.
3. A hybrid of the above: run the migrating processes for a few timeslices so they can pull the most frequently used data from on die cache, then flush the rest of the old L2.

They all have a performance impact and they all have a power consumption impact.

You get the same problem to a lesser degree with a core-migration policy. The whole point of big.LITTLE is to shut down you big cluster whenever you can to save power. Everytime you do that you have the above issue

Currently ARMs SCU/L2 cache supports four user agents, I don't understand why they don't pair a big and a LITTLE core for L2 arbitration (ie. 8 cores with 4 L2 "ports").

Cheers

Exophase · Jun 10, 2013

Gubbi said:
I must admit I was thinking in the context of the cluster migration policy of Exynox 5 series SoCs. Every time you bounce your run queues from one cluster (A) to the other (B) you're powering two L2s for a while. You either:
1. Power both L2s indefinitely; letting cluster B demand load data from cluster A on misses.
2. Flush cluster A's L2 cache to dram, let cluster B pull it back in when it misses its own L2.
3. A hybrid of the above: run the migrating processes for a few timeslices so they can pull the most frequently used data from on die cache, then flush the rest of the old L2.

They all have a performance impact and they all have a power consumption impact.

But the impact is strictly dependent on how often the migrations happen. You can't just say it's there and therefore it's bad, you have to look at its contribution in useful real-world scenarios.

Normally a thread should not be constantly bouncing between heavy and low utilization at time scales still large enough so that it makes more sense to migrate it than to optimize the clock on one of the clusters for it.

But if you are constantly shifting CPU load then that doesn't mean you're also bringing along all your data with you. Once enough time has passed for the scheduler to realize you've substantially cooled down or heated up chances are pretty good that your working set will have changed a lot. Meaning that it's not necessarily true that a huge chunk of L2 cache will need to move around.

Also consider that some tasks are pretty transient, there will be new processes showing up on a new core on one of the clusters that'll last a few seconds then quit. These won't be migrated nor will they have a big cache flush footprint.

Gubbi said:
You get the same problem to a lesser degree with a core-migration policy.

I really see both cluster-migration and core-migration as stepping stones to the full HMP solution. Software lagging to this end doesn't mean that the hardware design is fundamentally broken and needs to change.

Gubbi said:
The whole point of big.LITTLE is to shut down you big cluster whenever you can to save power. Everytime you do that you have the above issue

I don't agree with you that the whole point of big.LITTLE is simply to shut down the big cluster. The point is to run tasks with a high average CPU load on the big cluster and tasks with a lower average CPU load on the little cluster. You shut down either when there are no tasks for long enough.

But if the whole point is to keep the big cluster off as much as possible then the cost of migration is negligible since you're only migrating some of the time when you power the big cluster on and off.

Gubbi said:
Currently ARMs SCU/L2 cache supports four user agents, I don't understand why they don't pair a big and a LITTLE core for L2 arbitration (ie. 8 cores with 4 L2 "ports").

There are a lot of disadvantages to this:

1) It's locked on to core-migration when normally big.LITTLE is more flexible than that
2) The big core will often benefit from more cache than the little core
3) Having 4 caches instead of 2 increases coherency requirements
4) Smaller separate caches are less flexible, especially in quad core situations where often only a one or two cores will be active
5) The big and little cores run at different clock speeds and voltages (they have to), sharing a cache between different domains like this has overhead

liolio · Jul 17, 2013

http://www.anandtech.com/show/7126/the-arm-diaries-part-2-understanding-the-cortex-a12

ARM Cortex A12

Gubbi

Laurent06

Exophase

Homeles

Exophase

Gubbi

Exophase

liolio

Aquoiboniste

Similar threads