ARM Cortex-A15 the successor of ARM Cortex-A9

Discussion in 'Mobile Devices and SoCs' started by argor, Sep 9, 2010.

  1. argor

    Newcomer

    Joined:
    Nov 25, 2008
    Messages:
    96
  2. roninja

    Regular

    Joined:
    Feb 9, 2002
    Messages:
    264
    The Eagle has landed?

    So this is clearly next-gen stuff, will have to see if this gets paired with current or next-gen graphics i.e. Mali or Mali next v Tegra 2/3 v PowerVR series 5 SGX XT or series 6 (codename the "Daddy of all gpu's" i made that up btw)
     
  3. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    I'll just quote myself from the NVIDIA x86 thread:
    A bit unfortunate that ARM decided not to reveal any architectural detail - they mention FPU/NEON improvements (it's finally quad-MAC like Snapdragon) but don't say anything about the integer portion. I'm starting to fear it's very incremental and they did not increase the issue width in any way.

    EDIT: See my post just below.
     
  4. argor

    Newcomer

    Joined:
    Nov 25, 2008
    Messages:
    96
  5. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    Okay, forget everything I said: the integer pipeline is now 3-issue, whereas the A9 was 2-issue. Here's by far the best article I've found so far: http://www.electronicsweekly.com/Articles/2010/09/09/49414/update-arm-cortex-a15.htm
    Presumably the 2x to 3x figure is partially based on NEON going from 64-bit to 128-bit ALUs, but more than 50% higher performance per clock than the A9 for integer code would be very impressive indeed.

    Regarding the number of cores: a single cluster is still 4 cores max (and 4MB L2 whereas the A9 supported 8MB interestingly enough, presumably for coherency reasons?), but it's now possible to put multiple clusters on the same chip. I don't know if there's a hard limit of 4 clusters of 4 cores (i.e. 16) or if that's just marketing. And never take power figures seriously unless it's very clear they are for the same process at the same clock or performance target.
     
  6. argor

    Newcomer

    Joined:
    Nov 25, 2008
    Messages:
    96
  7. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,375
    Actually, I have no problem believing the latter part of the quote that attributes the large improvements to the upgraded memory hierarchy.
    "These performance improvements come from updates including a three issue pipeline compared with the A9's dual issue, and changes to memory interfacing."
    The memory subsystems of these SoC are quite constraining...
     
  8. mboeller

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    921
    Location:
    Germany
    ...and with a 128bit AMBA4 interface, this limit is lifted, it seems

    [​IMG]
     
  9. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    Right, but the hierarchy itself is actually completely unchanged (32+32KB L1, shared L2, no L3). The key word is 'interfacing' and that presumably refers to cache performance, load/store units, and/or AMBA 4 as mboeller said. But AMBA 3 already had a 64-bit bus, so in theory you'd be limited by external memory first - in practice, I suppose things can be very different. Alternatively maybe they're really comparing different memory controllers (since ARM licenses those too) although I doubt it.

    On a slightly related note, it's interesting that L1/SRAM ECC is now mandatory (I believe it was an option that nobody used on the A9 but I'm not sure).
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    If this core is going to target servers, ECC would be necessary, particularly if the L1 and L2 are not inclusive.
    Then again, the PAE-like address extensions seem to set ARM up to at most replace 32-bit x86 servers that haven't been replaced by x86-64 chips in the last 7 or so years, which doesn't sound like a large niche. Perhaps it's not so much servers as some other market that needs a bit more than 4GiB of memory?

    Then there is the expectation that the susceptibility of SRAM to soft errors is going to get much worse at the future nodes this design targets. I have seen it alleged that the error rates at the leading edge for SRAM are worse than DRAM already.
     
  11. frogblast

    Newcomer

    Joined:
    Apr 1, 2008
    Messages:
    78
    Consider it in context of the virtialization extensions, where each instance has 4GB of address space, but the summed memory usage of all instances can exceed 4GB. In that model, only the hypervisor needs to be aware of PAE, which is actually a reasonable expectation.
     
  12. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    I think the idea is that each virtualisation instance can have up to 4GB, so for highly virtualised servers (which is not a negligible bit of the market nowadays) the lack of x64 isn't a big issue. As I said before, it is a disappointment, but various trade-offs aren't a show-stopper as the intended market is one that must be knowledgeable about the additional complexities of using a non-x86 solution.

    Ah yes, good point. That might be a good reason not to bother with a non-ECC version.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    Is any of that problematic if the chip skipped ahead to 64 bits?
    PAE may fit that usage case, but why target that one situation at the expense of being marketable for all the other ones?
     
  14. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    The A15 uses the same ISA version as the A8 and A9. They're being surprisingly conservative about ISA changes. Why? Who knows... but obviously ARMv8 should be 64-bit in two or three years.
     
  15. roninja

    Regular

    Joined:
    Feb 9, 2002
    Messages:
    264
    Will be interesting to see what Intel fires back with at idf next week, roadmaps are hotting up in the SoC space, and there can be multiple winners in this game I think.
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    It may be that ARM is taking it slow in revising the ISA, or that it has not yet validated a fully extended set.
    The downside is that it is attempting to match x86 capabilities by also repeating a chapter of history in x86 that few remember fondly. It does write off a large swath of the server market that was similarly walled off from x86 until x86-64.
     
  17. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,251
    Location:
    Cleveland, OH
    I don't think it'll be exactly the same ISA, they'll need some extension set to facilitate hardware virtualization at least.

    I wouldn't put much stock in ISA version number vs extension set in terms of how substantial it is. There isn't very much of a difference between ARMv4 and ARMv5, for instance, all of the big differences came in the optional extension sets.

    LPAE does seem like a really incremental move towards attracting the server market, but it was probably not that hard for them to implement.

    I kind of wonder if some intermediate approach towards getting > 32-bit virtual addresses would be appropriate. Including full 64-bit registers and ALUs seems like kind of a waste. It'd make sense to have a mode where registers (possibly just some) are extended to some larger bit size (40 bits? 48 bits?) where only the AGUs operate on the upper bits. Might be slightly tricky to get compilers working with and would be a little limited, but at least wouldn't incur the waste of full 64-bit registers and ALUs and would potentially be a much smaller impact to the ISA (maybe just an instruction to move the low x bits of a register into the upper x bits of the extended register would be sufficient)
     
  18. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,375
    Well, the diagram above clearly states "128-bit AMBA4 Advanced Coherent Bus Interface". Lacking any further insight, I assumed twice the width and probably better handled. Which, for a number of scenarios, in and of itself could yield a 2-3 times improvement.
    I freely admit that I haven't dug into any meatier documents (and won't have time until Sunday at the earliest), so I could be jumping to conclusions. Probably am.
     
  19. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,251
    Location:
    Cleveland, OH
    I wonder if SoCs with quad-core Cortex-A15 will employ their own L3 cache. This option could be good reason for the move to the wider AMBA4 bus, because I doubt we'll be seeing 128-bit (paired or otherwise) memories on mobile devices any time soon. L3 in this arrangement probably wouldn't be an awful lot worse than it'd be if it were included in an ARM provided and internally interfaced cell instead.

    SoCs are already free to share the L2 with other things (like nVidia is doing in Tegra 2), seems like there's some more flexibility here than with typical multicore designs we're used to.

    One feature that I hope Cortex-A15 will have is the ability to share one NEON core among multiple CPU cores. The designs are fairly decoupled as it is so I hope that this will be a possibility. Having a NEON unit for every core will be overkill, especially with 4-way floating point now, I hate to think how much die space it'll take up. Having only one NEON unit for 4 cores should be quite good for a number of workloads - sharing between separate cores could help hide latency so you'd potentially get better utilization than with one unit in one core, although you'd still need the register set and some other contexts duplicated (and hence separated from the main NEON functional units).

    I'm just concerned that we'll see the alternatives.. cores with no access to NEON that turn into a big OS problem (although I suppose NEON instructions could be trapped, then the thread can be rescheduled to a core that has it)... or worse, no NEON at all like in Tegra 2, which IMO is going to turn into a compatibility/market segmentation problem.
     
    #19 Exophase, Sep 9, 2010
    Last edited by a moderator: Sep 9, 2010
  20. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    I believe the extension to this is called VMSAv7. It provides for 64-bit page descriptors and an added level of translation, using the result of the old translation as a pointer to the new table (and of course, supported in hardware).
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...