ARM Cortex-A15 the successor of ARM Cortex-A9

The Eagle has landed?

So this is clearly next-gen stuff, will have to see if this gets paired with current or next-gen graphics i.e. Mali or Mali next v Tegra 2/3 v PowerVR series 5 SGX XT or series 6 (codename the "Daddy of all gpu's" i made that up btw)
 
I'll just quote myself from the NVIDIA x86 thread:
Cortex-A15 announced (with little new information), the three lead licensees are Texas Instruments, Samsung, and ST-Ericsson. So NVIDIA isn't even a lead licensee for Eagle...

Either NV got a special deal with ARM to do what Charlie described and will wait to implement the A15 until their x86 translator is done, or more likely Charlie is just wrong from the first letter to the last and this entire thread discusses nothing more than a fabrication from Charlie's sources. The lack of x64 already made it less plausible, and now this seems to make it very improbable.

And before anyone says this means NVIDIA is giving up on Tegra - it's most likely just a financial decision. It's more expensive to be a lead licensee than to wait 9-12 months and license it then. Not being a lead licensee can also be a scheduling decision (too late for your refresh, too early to be a lead licensee for the next one). Alternatively, they could decide to stick to a quad-core A9, which would be disappointing and a competitive disadvantage in the high-end but not impossible. I think the financial justification is the most likely though.
A bit unfortunate that ARM decided not to reveal any architectural detail - they mention FPU/NEON improvements (it's finally quad-MAC like Snapdragon) but don't say anything about the integer portion. I'm starting to fear it's very incremental and they did not increase the issue width in any way.

EDIT: See my post just below.
 
Okay, forget everything I said: the integer pipeline is now 3-issue, whereas the A9 was 2-issue. Here's by far the best article I've found so far: http://www.electronicsweekly.com/Articles/2010/09/09/49414/update-arm-cortex-a15.htm
EW said:
"The goal was 50% better performance than the A9 at the same geometry with the same clock," ARM v-p of marketing Peter Hutton told EW. "We have seen 2x and 3x in certain applications."

These performance improvements come from updates including a three issue pipeline compared with the A9's dual issue, and changes to memory interfacing.
Presumably the 2x to 3x figure is partially based on NEON going from 64-bit to 128-bit ALUs, but more than 50% higher performance per clock than the A9 for integer code would be very impressive indeed.

Regarding the number of cores: a single cluster is still 4 cores max (and 4MB L2 whereas the A9 supported 8MB interestingly enough, presumably for coherency reasons?), but it's now possible to put multiple clusters on the same chip. I don't know if there's a hard limit of 4 clusters of 4 cores (i.e. 16) or if that's just marketing. And never take power figures seriously unless it's very clear they are for the same process at the same clock or performance target.
 
Okay, forget everything I said: the integer pipeline is now 3-issue, whereas the A9 was 2-issue. Here's by far the best article I've found so far: http://www.electronicsweekly.com/Articles/2010/09/09/49414/update-arm-cortex-a15.htm
Presumably the 2x to 3x figure is partially based on NEON going from 64-bit to 128-bit ALUs, but more than 50% higher performance per clock than the A9 for integer code would be very impressive indeed.

Regarding the number of cores: a single cluster is still 4 cores max (and 4MB L2 whereas the A9 supported 8MB interestingly enough, presumably for coherency reasons?), but it's now possible to put multiple clusters on the same chip. I don't know if there's a hard limit of 4 clusters of 4 cores (i.e. 16) or if that's just marketing. And never take power figures seriously unless it's very clear they are for the same process at the same clock or performance target.

Actually, I have no problem believing the latter part of the quote that attributes the large improvements to the upgraded memory hierarchy.
"These performance improvements come from updates including a three issue pipeline compared with the A9's dual issue, and changes to memory interfacing."
The memory subsystems of these SoC are quite constraining...
 
Actually, I have no problem believing the latter part of the quote that attributes the large improvements to the upgraded memory hierarchy.
"These performance improvements come from updates including a three issue pipeline compared with the A9's dual issue, and changes to memory interfacing."
The memory subsystems of these SoC are quite constraining...

...and with a 128bit AMBA4 interface, this limit is lifted, it seems

Eagle_New_Look_Chip-600.jpg
 
Actually, I have no problem believing the latter part of the quote that attributes the large improvements to the upgraded memory hierarchy.
"These performance improvements come from updates including a three issue pipeline compared with the A9's dual issue, and changes to memory interfacing."
The memory subsystems of these SoC are quite constraining...
Right, but the hierarchy itself is actually completely unchanged (32+32KB L1, shared L2, no L3). The key word is 'interfacing' and that presumably refers to cache performance, load/store units, and/or AMBA 4 as mboeller said. But AMBA 3 already had a 64-bit bus, so in theory you'd be limited by external memory first - in practice, I suppose things can be very different. Alternatively maybe they're really comparing different memory controllers (since ARM licenses those too) although I doubt it.

On a slightly related note, it's interesting that L1/SRAM ECC is now mandatory (I believe it was an option that nobody used on the A9 but I'm not sure).
 
If this core is going to target servers, ECC would be necessary, particularly if the L1 and L2 are not inclusive.
Then again, the PAE-like address extensions seem to set ARM up to at most replace 32-bit x86 servers that haven't been replaced by x86-64 chips in the last 7 or so years, which doesn't sound like a large niche. Perhaps it's not so much servers as some other market that needs a bit more than 4GiB of memory?

Then there is the expectation that the susceptibility of SRAM to soft errors is going to get much worse at the future nodes this design targets. I have seen it alleged that the error rates at the leading edge for SRAM are worse than DRAM already.
 
If this core is going to target servers, ECC would be necessary, particularly if the L1 and L2 are not inclusive.
Then again, the PAE-like address extensions seem to set ARM up to at most replace 32-bit x86 servers that haven't been replaced by x86-64 chips in the last 7 or so years, which doesn't sound like a large niche. Perhaps it's not so much servers as some other market that needs a bit more than 4GiB of memory?

Then there is the expectation that the susceptibility of SRAM to soft errors is going to get much worse at the future nodes this design targets. I have seen it alleged that the error rates at the leading edge for SRAM are worse than DRAM already.

Consider it in context of the virtialization extensions, where each instance has 4GB of address space, but the summed memory usage of all instances can exceed 4GB. In that model, only the hypervisor needs to be aware of PAE, which is actually a reasonable expectation.
 
Then again, the PAE-like address extensions seem to set ARM up to at most replace 32-bit x86 servers that haven't been replaced by x86-64 chips in the last 7 or so years, which doesn't sound like a large niche. Perhaps it's not so much servers as some other market that needs a bit more than 4GiB of memory?
I think the idea is that each virtualisation instance can have up to 4GB, so for highly virtualised servers (which is not a negligible bit of the market nowadays) the lack of x64 isn't a big issue. As I said before, it is a disappointment, but various trade-offs aren't a show-stopper as the intended market is one that must be knowledgeable about the additional complexities of using a non-x86 solution.

Then there is the expectation that the susceptibility of SRAM to soft errors is going to get much worse at the future nodes this design targets. I have seen it alleged that the error rates at the leading edge for SRAM are worse than DRAM already.
Ah yes, good point. That might be a good reason not to bother with a non-ECC version.
 
Consider it in context of the virtialization extensions, where each instance has 4GB of address space, but the summed memory usage of all instances can exceed 4GB. In that model, only the hypervisor needs to be aware of PAE, which is actually a reasonable expectation.

Is any of that problematic if the chip skipped ahead to 64 bits?
PAE may fit that usage case, but why target that one situation at the expense of being marketable for all the other ones?
 
Is any of that problematic if the chip skipped ahead to 64 bits?
PAE may fit that usage case, but why target that one situation at the expense of being marketable for all the other ones?
The A15 uses the same ISA version as the A8 and A9. They're being surprisingly conservative about ISA changes. Why? Who knows... but obviously ARMv8 should be 64-bit in two or three years.
 
Will be interesting to see what Intel fires back with at idf next week, roadmaps are hotting up in the SoC space, and there can be multiple winners in this game I think.
 
It may be that ARM is taking it slow in revising the ISA, or that it has not yet validated a fully extended set.
The downside is that it is attempting to match x86 capabilities by also repeating a chapter of history in x86 that few remember fondly. It does write off a large swath of the server market that was similarly walled off from x86 until x86-64.
 
The A15 uses the same ISA version as the A8 and A9. They're being surprisingly conservative about ISA changes. Why? Who knows... but obviously ARMv8 should be 64-bit in two or three years.

I don't think it'll be exactly the same ISA, they'll need some extension set to facilitate hardware virtualization at least.

I wouldn't put much stock in ISA version number vs extension set in terms of how substantial it is. There isn't very much of a difference between ARMv4 and ARMv5, for instance, all of the big differences came in the optional extension sets.

LPAE does seem like a really incremental move towards attracting the server market, but it was probably not that hard for them to implement.

I kind of wonder if some intermediate approach towards getting > 32-bit virtual addresses would be appropriate. Including full 64-bit registers and ALUs seems like kind of a waste. It'd make sense to have a mode where registers (possibly just some) are extended to some larger bit size (40 bits? 48 bits?) where only the AGUs operate on the upper bits. Might be slightly tricky to get compilers working with and would be a little limited, but at least wouldn't incur the waste of full 64-bit registers and ALUs and would potentially be a much smaller impact to the ISA (maybe just an instruction to move the low x bits of a register into the upper x bits of the extended register would be sufficient)
 
Right, but the hierarchy itself is actually completely unchanged (32+32KB L1, shared L2, no L3). The key word is 'interfacing' and that presumably refers to cache performance, load/store units, and/or AMBA 4 as mboeller said. But AMBA 3 already had a 64-bit bus, so in theory you'd be limited by external memory first - in practice, I suppose things can be very different. Alternatively maybe they're really comparing different memory controllers (since ARM licenses those too) although I doubt it.

On a slightly related note, it's interesting that L1/SRAM ECC is now mandatory (I believe it was an option that nobody used on the A9 but I'm not sure).

Well, the diagram above clearly states "128-bit AMBA4 Advanced Coherent Bus Interface". Lacking any further insight, I assumed twice the width and probably better handled. Which, for a number of scenarios, in and of itself could yield a 2-3 times improvement.
I freely admit that I haven't dug into any meatier documents (and won't have time until Sunday at the earliest), so I could be jumping to conclusions. Probably am.
 
I wonder if SoCs with quad-core Cortex-A15 will employ their own L3 cache. This option could be good reason for the move to the wider AMBA4 bus, because I doubt we'll be seeing 128-bit (paired or otherwise) memories on mobile devices any time soon. L3 in this arrangement probably wouldn't be an awful lot worse than it'd be if it were included in an ARM provided and internally interfaced cell instead.

SoCs are already free to share the L2 with other things (like nVidia is doing in Tegra 2), seems like there's some more flexibility here than with typical multicore designs we're used to.

One feature that I hope Cortex-A15 will have is the ability to share one NEON core among multiple CPU cores. The designs are fairly decoupled as it is so I hope that this will be a possibility. Having a NEON unit for every core will be overkill, especially with 4-way floating point now, I hate to think how much die space it'll take up. Having only one NEON unit for 4 cores should be quite good for a number of workloads - sharing between separate cores could help hide latency so you'd potentially get better utilization than with one unit in one core, although you'd still need the register set and some other contexts duplicated (and hence separated from the main NEON functional units).

I'm just concerned that we'll see the alternatives.. cores with no access to NEON that turn into a big OS problem (although I suppose NEON instructions could be trapped, then the thread can be rescheduled to a core that has it)... or worse, no NEON at all like in Tegra 2, which IMO is going to turn into a compatibility/market segmentation problem.
 
Last edited by a moderator:
I don't think it'll be exactly the same ISA, they'll need some extension set to facilitate hardware virtualization at least.

I wouldn't put much stock in ISA version number vs extension set in terms of how substantial it is. There isn't very much of a difference between ARMv4 and ARMv5, for instance, all of the big differences came in the optional extension sets.

LPAE does seem like a really incremental move towards attracting the server market, but it was probably not that hard for them to implement.

I kind of wonder if some intermediate approach towards getting > 32-bit virtual addresses would be appropriate. Including full 64-bit registers and ALUs seems like kind of a waste. It'd make sense to have a mode where registers (possibly just some) are extended to some larger bit size (40 bits? 48 bits?) where only the AGUs operate on the upper bits. Might be slightly tricky to get compilers working with and would be a little limited, but at least wouldn't incur the waste of full 64-bit registers and ALUs and would potentially be a much smaller impact to the ISA (maybe just an instruction to move the low x bits of a register into the upper x bits of the extended register would be sufficient)

I believe the extension to this is called VMSAv7. It provides for 64-bit page descriptors and an added level of translation, using the result of the old translation as a pointer to the new table (and of course, supported in hardware).
 
Back
Top