ARM announces ARMv8 architechture

Discussion in 'Mobile Graphics Architectures and IP' started by DSC, Oct 27, 2011.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    IIRC, you LL/SC isn't enough to do lock free algorithms. You REALLY need CAS. I didn't see it in arm64, but is it there in arm32?
     
    #41 rpg.314, Nov 2, 2011
    Last edited by a moderator: Nov 2, 2011
  2. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    A quick Google tells me LDREX/STREX was introduced in ARMv6 and it seems to fit.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The weak model matters for POWER7 enough in practice that IBM included a strong(er) mode.
    Torvald's exposure to low-level code problems may be the reason for his distaste for the weaker model, and the pernicious subtlety of the problems it exposes is apparently a major component of why he dislikes it.

    The completion and visibility aspect of the pipeline is what makes up the memory model.
    Is there some documentation that goes into depth about MPCore's model?

    This indicates for the A9 that ARM considers MPCore to have a weak model.
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0407e/CACBGEAD.html
     
  4. Arcanum

    Newcomer

    Joined:
    Sep 20, 2011
    Messages:
    4
    Likes Received:
    0
    ARM does implement a weakly ordered memory model, but amongst the three memory types - Normal, Device, Strongly-Ordered - the last two impose ordering requirements (and other restrictions) not present in Normal.

    To avoid the inevitable post, yes, there is barely any difference between Device and Strongly-Ordered. None in fact inside the processor.

    I'm not going to comment on the v8 architecture. This is the situation in v7.
     
  5. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    A weak memory ordering model will help reduce power in low power implementations.

    It will cost some performance in high performance implementations (ie. we know how to make a strong memory ordering model fast today).

    I'm slightly surprised, since ARMv8 seems aimed at higher performance.

    Cheers
     
  6. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    So I'd like to apologise for being hilariously wrong - while the single-core model for ARM is perfectly reasonable, the ARMv7 MPCore model does NOT add the restrictions I thought it did based on my previous analysis (although there are some minor changes). This blog post makes it clear it's still weakly ordered: http://wanderingcoder.net/2011/04/01/arm-memory-ordering/ (along with the associated ARM blog post)

    So how bad is it? I think Linus Torvalds' position makes a lot of sense:
    So if we exclude the disadvantage of developers making wrong assumptions based on their x86 experience resulting in buggier code (which is indeed a big problem) then the main question is what's the cost of locking (and barriers in the less extreme cases)?

    My understanding of the ARM MPCore architecture seems to indicate it should be much much more reasonable than in those old RISCs mentioned by Linus. In the case of a single 4xA9 chip, you have a shared L2 cache and duplicated L1 Physical TAGs in the Snoop Control Unit. ARM explicitly brags about their 'high performance, low power spinlock' on Page 20 of this presentation: http://www.iet-cambridge.org.uk/arc/seminar08/slides/JohnGoodacre.pdf

    I'm still worried about how cache coherency would scale in a multi-chip configuration. But it seems ARM is not really targetting that market - for example Calxeda's solution does not support cache coherency at all between chips despite having a 5x10Gb external interfaces for chip-to-chip communication. I suppose this makes some sense in the cloud server market.

    In a semi-related note, I'm far from convinced Calxeda's single-threaded performance is high enough. Even Amazon's EC2 sells compute units in virtual blocks equivalent to a 1-1.2GHz 2007 Opteron. Calxeda's top SKU is a 1.4GHz 4xA9 which will not beat a 1GHz K10 in single-threaded performance. And above that performance level they guarantee virtual cores with a performance level of a 2GHz K10. They'd need to clock an A15 at least 2.5GHz to reach that level - that's perfectly doable on 28HP (in fact you could probably reach more than 3.5GHz since TSMC achieved 3.1GHz on an A9) but then your TDP is going to increase a lot and Calxeda's nodes per rack are going to go down quite a bit.

    I think the best thing that could happen to ARM is if the rumours are right and Apple is indeed seriously considering Marvell's ARMADA XP for iCloud servers. That would certainly give ARM a lot of credibility but I'm rather skeptical it will happen... Even if it's very clear that they are at least testing it (please ignore all the speculative part of that Ars Technica article - the writer clearly doesn't have a clue what he's talking about, but the presence of PJ4B and not PJ4 clearly implies it's about ARMADA XP and that's a server-only chip)
     
  7. Npl

    Npl
    Veteran

    Joined:
    Dec 19, 2004
    Messages:
    1,905
    Likes Received:
    7
    LL/SC is more powerfull than CAS, you can implement the latter with the former, and you can avoid alot ABA problems.
    CAS can be more performant if it fits the task nicely I guess
     
  8. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    Do you have an example of this?

    If you want to check something specific I can try it on my Tegra2 "smartbook".
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I have been digging around for the reference. I'll post it here when I find it.
     
  10. Vitaly Vidmirov

    Newcomer

    Joined:
    Jul 9, 2007
    Messages:
    110
    Likes Received:
    11
    Location:
    Russia
    It's sounds as ridiculous as LD + OP can't implement the functionality of dedicated LOAD-OP instruction :lol:
    On CELL LL/SC provide a way to process atomical structs (up to 128 bytes) in a single transaction. CAS looks like stone-age legacy for me. Like x87 transcendental instructions.
     
  11. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I just listened to the ARMv8 presentation video and it seems they might not only have special case instructions for the SP but also some general-purpose instructions will use the 32th register as referring to the SP instead of Zero: http://www.youtube.com/watch?v=GBeEEfmJ3NI#t=12m00

    I think it's clear they haven't just ignored the problem so there shouldn't be any problem with any realistic usage of the stack pointer.
     
  12. Ari64

    Newcomer

    Joined:
    Mar 27, 2010
    Messages:
    5
    Likes Received:
    0
    Nope, I was just going by what you and Laurent said. I do have a very good idea of how branch prediction works on the A8 though. It's actually a quite simple design, and not hard to figure out.

    I think that would make the L1 cache ciruitry more complicated as it would have to support writing only certain bits and leaving the others with the old values. Well, something more than just selecting between byte/word sizes anyway. Seems like it would be more useful to provide a conditional-select SIMD instruction and then write the entire result.

    It's MIPS-like in having 31 GPR plus a zero register, separate 32 and 64 bit instructions, and dropping most of the conditional execution. Having a dedicated zero register suggests they're making some changes to how immediate values are loaded, but they haven't addressed that. The dedicated stack register is something that MIPS didn't have, and I would be surprised if they got rid of the register+register addressing, so it's not exactly like MIPS.
     
  13. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    "Unique Chips" explains it in pretty great detail. I expect the prediction on A9 to be similar if not outright the same. But I was talking about predication ;p

    NEON already has a conditional select. The point is to avoid the load instruction. Could make it cache bypass only. For that matter, a cache bypass store for NEON would be good in general.

    The 31 GPRs + zero register also goes for SPARC, PPC, Alpha, and probably lots of other RISC platforms. Making it a zero or other stuff register depending on context is more like PPC and in this case seems like a pretty clever idea (admittedly I like PPC a lot more than MIPS anyway). MIPS only has separate 32-bit and 64-bit add/subtract, while it's clear ARM will have a separate set for all ALU operations - I'm sure this is motivated by flags generation (and in this regard they're closer to x86-64)

    I wouldn't be surprised at all if ARM keeps the trend of somewhat complex immediates like they have with the original ISA, Thumb-2, and NEON. They'll probably re-evaluate what's most useful. I don't expect to see flat MIPS-style 16-bit immediates. I don't know what's up with the zero register per se, but the reveal that some instructions will address SP with it instead makes sense, and it means they won't have to have an rsb or neg style instruction. A zero register also means they won't have to do compare/test instructions and may even be useful for preloads.
     
  14. Ari64

    Newcomer

    Joined:
    Mar 27, 2010
    Messages:
    5
    Likes Received:
    0
    It's a fairly good general description, although it leaves out some details, such as the 1-cycle fetch delay for a predicted taken branch, the 3-cycle delay in the branch history, and that adjacent branch instructions can cause BTB collisions because they index the same BTB entry.

    Because of the dual-issue pipeline, the branch resolution on the A8 happens too late to stop the issuance of a subsequent store instruction. It appears that conditional stores and stores following mispredicted branches are handled similarly. The store is issued, then cancelled at the very end of the pipeline. No idea what A9 does in this situation.

    Hmm.. x86 has a cache bypass store. Apparently it causes a very long delay on intel cpus if you try to read those locations after writing. This implies a large store queue. I wonder if ARM would do something like that because of power consumption and silicon area. The store queue on the A8 appears to be quite small, and it's easy to fill it and block.

    MIPS has seperate add, subtract, multiply, divide, and shift. Compares are done with sign-extended results. It seems ARM wants to avoid that for power-saving reasons. It remains to be seen if we'll get an integer divide instruction in ARMv8.

    If it was really PPC-like then there would be multiple sets of flags.

    A zero-register would avoid needing a rsb with an immediate zero for negation. And MOV can be replaced with add zero. For power saving and depenency resolution you don't really want those add-as-mov instructions going through the ALU, and avoiding that just makes the instruction decoder more complex. hmm...
     
  15. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Do you have the complete copy?

    "Therefore, the instruction fetch currently in the F1 stage will need to be thrown away. This means ther is a one-cycle bubble in the fetch pipeline whenever a branch prediction is made for a taken branch."

    "The BTB is indexed by the fetch address and contains branch target address and information about the branch type." (note that we know that fetch is 64-bit aligned so it goes without saying that the BTB can't index two branches in the same 64-bit block)

    Not sure about the 3 cycle branch history delay, I'll have to go read your optimization notes again.

    Yes, and exception detection happens even after issue and still has to be triggered during the final stages of the pipeline.

    The NEON store queue is a lot larger and I think we can both agree that Cortex-A15 and its successors will be a lot wider than A8 in these regards.

    Yeah I remembered mul/div right after I finished posting. This makes sense for performance reasons and I'm sure they aren't the only ones to do it, hell, a lot of other platforms have a whole variety of multiply widths.

    Did forget about shifts altogether, but aren't those kind of a given if you're extending a 32-bit arch like MIPS was? Unless you only want to be able to shift them by no more than 32.

    ARM-v7a was extended to make them optional, but with every A-series ARM processor post A9 getting it I would be surprised if ARM doesn't make it mandatory, and very surprised if it's dropped entirely.

    I said that particular feature was PPC like and made it more PPC like than MIPS. Not that it was overall PPC like, certainly not "almost identical" like you claim it is vs MIPS. It's its own ISA.

    I already mentioned rsb. Good call on mov - I don't know how much the alternate form adds to the decoder, but I'd be more curious if the post-decode instructions really want to carry extra width to handle instructions that the original doesn't have encode space for.

    I wouldn't expect ARM to fully rely on that for loading immediates either, since it wastes 5 bits of encode space that could have been used on the immediate. I don't expect ARM64 to get MIPS style 16-bit immediates generally and I do expect them to keep movt/movw capability at 16-bit.
     
  16. Laurent06

    Veteran

    Joined:
    Dec 14, 2007
    Messages:
    1,091
    Likes Received:
    491
    ARM 32-bit ISA already has to take care of mov since it's a variant of LSL imm. In fact it's the renaming stage that has to know about that to optimize some cases; whether the detection is done at decode or rename or in any other place before renaming is just an implementation detail that depends on critical paths in the design.

    Without telling too much (I obviously am under NDA), Aarch64 encoding is much cleaner than Aarch32 so this should be a non-issue :smile:

    BTW a note about architecture naming: ARMv8 is made of both Aarch32 and Aarch64 (the 64-bit ISA). So v8 shouldn't be considered as the 64-bit ISA, one should talk of Aarch64 when talking about 64-bit stuff :wink:
     
  17. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Just the same, I'm going to keep calling it ARM64 because I find the official AArch64 name awful ;)

    (just like how I refuse to call anything "Advanced SIMD")
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Damn right it is. Who names these things? :roll:
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Maybe it's too generic. Googling ARM 64 showed a few instances of similar terms being used by other products.
     
  20. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    May be we should adopt a32 and a64 to match with x86 and x86-64/x64.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...