ARM announces ARMv8 architechture

Discussion in 'Mobile Graphics Architectures and IP' started by DSC, Oct 27, 2011.

  1. DSC

    DSC
    Banned

    Joined:
    Jul 12, 2003
    Messages:
    689
    Likes Received:
    3
    ARM announces ARMv8 architecture

    http://www.arm.com/about/newsroom/a...-the-next-version-of-the-arm-architecture.php

     
    #1 DSC, Oct 27, 2011
    Last edited by a moderator: Oct 27, 2011
  2. argor

    Newcomer

    Joined:
    Nov 25, 2008
    Messages:
    96
    Likes Received:
    0
    is there any more info on what ARMv8 adds besides 64-bit support
     
  3. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Nothing public afaics.
     
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Ah the long awaited armv8 (64bit) announcement - I was thinking Cortex-A7 has to be the last major v7 chip announcement. A bit light on details, I think not even 2014 for first products is a surprise.
     
  5. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    Thumb is out. A lot of the instruction encodings have been re-coded to be more sensible -- though since you have to support v7 anyway, I don't know what good that will do.

    40 bit pointers with translation tables similar to that of LPAE in v7. Oh, and no more LDM/STM.
     
  6. Laurent06

    Veteran

    Joined:
    Dec 14, 2007
    Messages:
    1,091
    Likes Received:
    491
    There is more information here

    Applied Micro say they'll have silicon by H2 2012.
     
  7. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    Wonder how many will implement transactional memory models with the atomic ld/st instructions in play now.
     
  8. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    As expected. On 64bit system, code size is the last thing to worry about =)

    That's bad. Hope they will not cripple LDM/STM in 32bit mode

    -Stack pointer is not a general purpose register
    What?!
    -PC is not a general purpose register
    -Additional dedicated zero register available for most instructions
    Far fewer conditional instructions than in AArch32
    Conditional {branches, compares, selects}


    WTF! Seems they went out of mind. ARM had to ask Wilson to design the new ISA.

    If they eliminated conditional execution they should get rid of integer pipe flag register, because there is no any sense in it.
     
    #8 Vitaly Vidmirov, Oct 28, 2011
    Last edited by a moderator: Oct 28, 2011
  9. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    I imagine at some point in the future, they'd want this to carry over to low-profile embedded as well. But T2EE was a pain in the ass from an implementation POV and, IMO, not worth the silicon for the reduction in code size.

    Speaking of another feature that's a pain in the ass to implement and, IMO, not worth the silicon......

    Did you prefer to have a GPR banked all the time?

    They have to still support v7.
     
  10. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    Don't think the circuit is complicated, especially today with 1 word/cycle rate, but they lost the opportunity to be energy effective on function prologs/epilogs that became even more demanding because of increased number of registers. Instead of 1 instruction, they have to fetch/decode/issue 10.
    Yeah, it is not easy to fit 31 bit field in 32 bit instruction, but at least registers groups from N to M could be saved/restored.
    Intel is using a special circuits to accelerate PUSH/POP. Because x86 lack the instructions that ARM has.

    No, only in case of interrupts =)
    I prefer to be able to perform math with stack pointer.
    SP Is GPR even on x86 and all RISCs.
    Do v8 has special mode for LDR/STR to access stack frame?
    This sounds so ridiculous that I can't even believe it is true.
    And what about PC-relative addressing?
     
  11. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    First of all, most ISAs have flags without having conditional execution beyond branches and perhaps moves, that doesn't make it all useless. If you don't have flags you at least benefit from having a separate register set for predication so you can reduce register file contention.

    More importantly, ARMv8 64-bit isn't completely getting rid of conditional execution. Notably it has conditional selects, which are a more useful generalization of conditional moves, and conditional compares, which lets you perform some boolean logic for if-clauses directly in flags.

    Cortex-A8 and A9 are capable of ldm/stm at 64-bits per cycle. But look at it this way: Cortex-A15 can fetch 4 32-bit instructions per cycle and decode 3, but only has one load and one store unit, which I imagine can store either 64-bit or 128-bit to a contiguous location per cycle. ARMv7 already has dual-register stores and ARMv8 makes it more flexible to allow any register.

    Now in designs post-dating Cortex-A15 how often do you really think being able to load or store two 32 or 64-bit registers per instruction is going to be bottlenecked by fetch or decode?

    Intel's stack engine has nothing to do with the actual loads or stores, but with the stack pointer manipulation. By making the stack pointer no longer a general purpose register it's easier for ARM to do this now too..

    ARM would hardly be the first ISA to separate out SP to special function. No I doubt there'll be a special "mode" for loads and stores, just SP specific loads and stores, PC specific loads, SP increment/decrement and moves from SP to GPRs. Compared to the wealth of data operations ARM has this is a tiny number yet accounts for the vast majority of use cases. I don't know about what you do with the stack pointer, but in the real world it's almost never used for anything but a stack pointer.

    Even Thumb-1 separates out SP and PC access exactly like this.

    As for register banking on interrupts, that's only really nice to have in embedded. I doubt big servers or even apps platforms like phones and tablets need such fast or low latency interrupts. Especially with more than one core.
     
  12. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    That depends on how you implement it.

    The biggest problems are cache-line crossers, mispredicts and mid-stride faults. STM's aren't too bad but ARM architecture requires that you not update architected state until the entire LDM is complete; i.e. if you cross a page with your LDM and fault, you gotta roll back your register writes.

    This isn't so much of a performance problem if you're using a design with register renaming but it's still extra power and circuitry to handle these cases. The same problem occurs with a speculatively executed LDM that completes half-way before the mispredict is determined.

    Your choice is either to make a big-ass queue (hardware and power hungry), halting your pipeline (bad for performance), or writing a lot of rename registers unnecessarily (again, bad for power).

    Cache-line crossers are fairly bad for power as well as if you happen to walk into a cacheline that isn't cached, you have a potentially huge number of instructions halted there (or written to rename reg/queue and taking up space) that's waiting on that last pesky load from memory -- which can be really really bad.

    Honestly, I really doubt post-A8 microarch's are going to be fetch/decode/issue limited.

    The push-pop handling is off to the side, yes. But again, it isn't decode limited. I'm not too familiar with Intel's uarch but I have a stinking feeling they decode it into individual load uops internally anyway. If a 4-way issue (on Nehalem) isn't a bottleneck, I doubt any ARMv8 microarch will be bottlenecked by it either.

    And exceptions.

    *Shrug*. It's a stack. I would've thought the obvious use-cases would be to use it as, well, a stack.

    And again, not having it be a GPR can make the hazard compare circuits much simpler.
     
  13. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    Flag register is additional dependency for OoO engine. Intel had a lot of trouble with it the past (not sure about present). Flag-less MIPS/SPU way is much cleaner and leads to simpler hardware, IMHO.

    Well, it is good that they still has some predication. It is just not as useful and orthogonal as before.

    I said "energy effective", not bottlenecked =) More instructions -> more cache used + more power used on F/D/E.

    They have solved the problem of sequential push/pop, so it became beneficial over mov [sp+n] (compact instructions)

    A64 has fixed length instructions, so it doesn't make sense.

    I just want a dream RISC machine, not a freaking PPC clone =)

    Compared to the wealth of data operations ARM has this is a tiny number yet accounts for the vast majority of use cases
    Console PPC processors are based on this logic.
    So we got 21 cycle register shifts, 50+cycle LHS and 50+cycles branches on VMX flags.

    I just hope that ADD Rx, SP, #offset is still possible
     
  14. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    Well, this make sense. On the other side, Intel is beefing up their REP XXX instructions.

    Larger code size (bad for cache) and more energy in cache+frontend.
    Register amount is nearly doubling - that means much more stack spill/fills.

    Yes. It is a pure code density win. AFAIK

    I expect SP access to suffer from extra latency.
     
  15. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    On the other side, Intel can't design sub-1W processors....

    Not arguing either. But you really have to weigh the benefits of code density with the drawbacks of implementing LDM/STM in an OoOE processor. Having designed a few that handles LDM/STM, I can tell you I would've given a limb (and quite a bit of die area and mW's) to get rid of them.

    Honestly, with cachelines being the sizes they are, code density wins aren't what they used to be.

    When you say SP access, you mean using them as a source for computation, I assume. I really don't see that being common practice to optimize for but perhaps I'm wrong.
     
  16. fuboi

    Newcomer

    Joined:
    Aug 6, 2011
    Messages:
    96
    Likes Received:
    51
    Products aiming at the server market might just drop v7 support, it's not like they have 20 years of legacy code to support (in that market). And considering everyone and their grandma tapes out ARMs for every little market...

    What virtualization features are expected?
     
  17. metafor

    Regular

    Joined:
    May 26, 2010
    Messages:
    463
    Likes Received:
    0
    Not really much change from v7 virtualization; which we're fairly comprehensive.
     
  18. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    All choices to support higher performance implementations, rather than codesize/energy.

    1. Mucking about with the stack pointer messes up the return stack (also valid for x86!).
    2. PC should never have been a general purpose register. It is very much special purpose.
    3. Conditional execution poses a challenge for high performance OOO implementations, each instruction takes two entries in the ROB. With improved branch prediction the value of ubiquitous conditional execution drops quickly.

    Also load/store multiple is an abomination and needs to go; It has huge costs and next to no performance advantage.

    Cheers
     
  19. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,360
    Likes Received:
    1,377
    Thank you. Wouldn't have found it without help, my search on ARMv8 didn't turn it up.
     
  20. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    So how would you perform stack unwinding on exception handling?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...