ARM announces ARMv8 architechture

ARM's sticking with weak memory consistency going forward.
Additionally, Aarch64 is adding store acquire/load release instructions, which seems to be a trend.
AMD's FSA architecture also uses this method for controlling visibility.

That may be fine for heterogenous compute or for situations where there are coherent memory pools separated by significant latency and different memory/cache controllers.

That's not quite the direction for servers, or desktop for that matter.
AMD's x86 side, and Intel in general have much stronger ordering and speculative hardware to silently reorder memory accesses.
On the server side, POWER7 has a stronger ordering mode available, even though traditionally its model was weaker like many RISC architectures.

Going by commentary on realworldtech's forums, Linus Torvalds really does not like weak consistency.
IIRC, you LL/SC isn't enough to do lock free algorithms. You REALLY need CAS. I didn't see it in arm64, but is it there in arm32?
 
Last edited by a moderator:
IIRC, you LL/SC isn't enough to do lock free algorithms. You REALLY need CAS. I didn't see it in arm64, but is it there in arm32?
A quick Google tells me LDREX/STREX was introduced in ARMv6 and it seems to fit.
 
I understand Linus Torvalds' reaction but I do not believe it is warranted. The key point is that you shouldn't look only at the ARM ISA itself but also the ARM MPCore model. This is quite similar to how the weak PPC ISA ordering model doesn't really matter in practice for POWER7. I'm not saying the effective ordering model is as strong as x86 (I'm fairly sure it's not) but the practical differences are a lot more subtle than most people believe.
The weak model matters for POWER7 enough in practice that IBM included a strong(er) mode.
Torvald's exposure to low-level code problems may be the reason for his distaste for the weaker model, and the pernicious subtlety of the problems it exposes is apparently a major component of why he dislikes it.

Shock, horror, this is completely identical to the original Intel P6. They don't explicitly mention completion but I don't expect any surprises there. And yes, the MPCore model should effectively guarantee that writes by a single processor are observed in the same order by all processors. So all the basics are covered really.

The completion and visibility aspect of the pipeline is what makes up the memory model.
Is there some documentation that goes into depth about MPCore's model?

This indicates for the A9 that ARM considers MPCore to have a weak model.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0407e/CACBGEAD.html
 
ARM does implement a weakly ordered memory model, but amongst the three memory types - Normal, Device, Strongly-Ordered - the last two impose ordering requirements (and other restrictions) not present in Normal.

To avoid the inevitable post, yes, there is barely any difference between Device and Strongly-Ordered. None in fact inside the processor.

I'm not going to comment on the v8 architecture. This is the situation in v7.
 
A weak memory ordering model will help reduce power in low power implementations.

It will cost some performance in high performance implementations (ie. we know how to make a strong memory ordering model fast today).

I'm slightly surprised, since ARMv8 seems aimed at higher performance.

Cheers
 
So I'd like to apologise for being hilariously wrong - while the single-core model for ARM is perfectly reasonable, the ARMv7 MPCore model does NOT add the restrictions I thought it did based on my previous analysis (although there are some minor changes). This blog post makes it clear it's still weakly ordered: http://wanderingcoder.net/2011/04/01/arm-memory-ordering/ (along with the associated ARM blog post)

So how bad is it? I think Linus Torvalds' position makes a lot of sense:
But I thought ARM64 did "weak ordering" with then special ops to do acquire/release consistency for locking.

And every single time I've seen weak ordering, the cost of locking has been astronomical. And the upside of the weak ordering hasn't been very noticeable - while the downside has been huge.
So if we exclude the disadvantage of developers making wrong assumptions based on their x86 experience resulting in buggier code (which is indeed a big problem) then the main question is what's the cost of locking (and barriers in the less extreme cases)?

My understanding of the ARM MPCore architecture seems to indicate it should be much much more reasonable than in those old RISCs mentioned by Linus. In the case of a single 4xA9 chip, you have a shared L2 cache and duplicated L1 Physical TAGs in the Snoop Control Unit. ARM explicitly brags about their 'high performance, low power spinlock' on Page 20 of this presentation: http://www.iet-cambridge.org.uk/arc/seminar08/slides/JohnGoodacre.pdf

I'm still worried about how cache coherency would scale in a multi-chip configuration. But it seems ARM is not really targetting that market - for example Calxeda's solution does not support cache coherency at all between chips despite having a 5x10Gb external interfaces for chip-to-chip communication. I suppose this makes some sense in the cloud server market.

In a semi-related note, I'm far from convinced Calxeda's single-threaded performance is high enough. Even Amazon's EC2 sells compute units in virtual blocks equivalent to a 1-1.2GHz 2007 Opteron. Calxeda's top SKU is a 1.4GHz 4xA9 which will not beat a 1GHz K10 in single-threaded performance. And above that performance level they guarantee virtual cores with a performance level of a 2GHz K10. They'd need to clock an A15 at least 2.5GHz to reach that level - that's perfectly doable on 28HP (in fact you could probably reach more than 3.5GHz since TSMC achieved 3.1GHz on an A9) but then your TDP is going to increase a lot and Calxeda's nodes per rack are going to go down quite a bit.

I think the best thing that could happen to ARM is if the rumours are right and Apple is indeed seriously considering Marvell's ARMADA XP for iCloud servers. That would certainly give ARM a lot of credibility but I'm rather skeptical it will happen... Even if it's very clear that they are at least testing it (please ignore all the speculative part of that Ars Technica article - the writer clearly doesn't have a clue what he's talking about, but the presence of PJ4B and not PJ4 clearly implies it's about ARMADA XP and that's a server-only chip)
 
IIRC, you LL/SC isn't enough to do lock free algorithms. You REALLY need CAS. I didn't see it in arm64, but is it there in arm32?
LL/SC is more powerfull than CAS, you can implement the latter with the former, and you can avoid alot ABA problems.
CAS can be more performant if it fits the task nicely I guess
 
I have been digging around for the reference. I'll post it here when I find it.
It's sounds as ridiculous as LD + OP can't implement the functionality of dedicated LOAD-OP instruction :LOL:
On CELL LL/SC provide a way to process atomical structs (up to 128 bytes) in a single transaction. CAS looks like stone-age legacy for me. Like x87 transcendental instructions.
 
I wasn't saying to make less commonly used stack operations slow, I'm saying that almost no one is using the stack pointer as anything but a stack pointer and it's fine to make special case instructions for things you want with a stack pointer.
I just listened to the ARMv8 presentation video and it seems they might not only have special case instructions for the SP but also some general-purpose instructions will use the 32th register as referring to the SP instead of Zero: http://www.youtube.com/watch?v=GBeEEfmJ3NI#t=12m00

I think it's clear they haven't just ignored the problem so there shouldn't be any problem with any realistic usage of the stack pointer.
 
Do you know more about the nature of predication on A9?
Nope, I was just going by what you and Laurent said. I do have a very good idea of how branch prediction works on the A8 though. It's actually a quite simple design, and not hard to figure out.

I'm still hoping for masked SIMD stores, since those are even more useful than scalar conditional stores. But I don't expect to get them..
I think that would make the L1 cache ciruitry more complicated as it would have to support writing only certain bits and leaving the others with the old values. Well, something more than just selecting between byte/word sizes anyway. Seems like it would be more useful to provide a conditional-select SIMD instruction and then write the entire result.

I think "almost exactly like MIPS" is grossly underselling it. There's more to ARM vs MIPS than predication and block memory instructions. Until we hear otherwise - and going off the comment that the ISA was kept as similar as possible save the mentioned exceptions - I'm going to assume that ARM64 has register + scaled register addressing, pre/post-increment, folded shifts, and a variety of bit-manipulation/multiplication/ALU instructions MIPS64 lacks. And is still flags based.
It's MIPS-like in having 31 GPR plus a zero register, separate 32 and 64 bit instructions, and dropping most of the conditional execution. Having a dedicated zero register suggests they're making some changes to how immediate values are loaded, but they haven't addressed that. The dedicated stack register is something that MIPS didn't have, and I would be surprised if they got rid of the register+register addressing, so it's not exactly like MIPS.
 
Nope, I was just going by what you and Laurent said. I do have a very good idea of how branch prediction works on the A8 though. It's actually a quite simple design, and not hard to figure out.

"Unique Chips" explains it in pretty great detail. I expect the prediction on A9 to be similar if not outright the same. But I was talking about predication ;p

I think that would make the L1 cache ciruitry more complicated as it would have to support writing only certain bits and leaving the others with the old values. Well, something more than just selecting between byte/word sizes anyway. Seems like it would be more useful to provide a conditional-select SIMD instruction and then write the entire result.

NEON already has a conditional select. The point is to avoid the load instruction. Could make it cache bypass only. For that matter, a cache bypass store for NEON would be good in general.

It's MIPS-like in having 31 GPR plus a zero register, separate 32 and 64 bit instructions, and dropping most of the conditional execution.

The 31 GPRs + zero register also goes for SPARC, PPC, Alpha, and probably lots of other RISC platforms. Making it a zero or other stuff register depending on context is more like PPC and in this case seems like a pretty clever idea (admittedly I like PPC a lot more than MIPS anyway). MIPS only has separate 32-bit and 64-bit add/subtract, while it's clear ARM will have a separate set for all ALU operations - I'm sure this is motivated by flags generation (and in this regard they're closer to x86-64)

Having a dedicated zero register suggests they're making some changes to how immediate values are loaded, but they haven't addressed that. The dedicated stack register is something that MIPS didn't have, and I would be surprised if they got rid of the register+register addressing, so it's not exactly like MIPS.

I wouldn't be surprised at all if ARM keeps the trend of somewhat complex immediates like they have with the original ISA, Thumb-2, and NEON. They'll probably re-evaluate what's most useful. I don't expect to see flat MIPS-style 16-bit immediates. I don't know what's up with the zero register per se, but the reveal that some instructions will address SP with it instead makes sense, and it means they won't have to have an rsb or neg style instruction. A zero register also means they won't have to do compare/test instructions and may even be useful for preloads.
 
"Unique Chips" explains it in pretty great detail. I expect the prediction on A9 to be similar if not outright the same. But I was talking about predication ;p
It's a fairly good general description, although it leaves out some details, such as the 1-cycle fetch delay for a predicted taken branch, the 3-cycle delay in the branch history, and that adjacent branch instructions can cause BTB collisions because they index the same BTB entry.

Because of the dual-issue pipeline, the branch resolution on the A8 happens too late to stop the issuance of a subsequent store instruction. It appears that conditional stores and stores following mispredicted branches are handled similarly. The store is issued, then cancelled at the very end of the pipeline. No idea what A9 does in this situation.

NEON already has a conditional select. The point is to avoid the load instruction. Could make it cache bypass only. For that matter, a cache bypass store for NEON would be good in general.
Hmm.. x86 has a cache bypass store. Apparently it causes a very long delay on intel cpus if you try to read those locations after writing. This implies a large store queue. I wonder if ARM would do something like that because of power consumption and silicon area. The store queue on the A8 appears to be quite small, and it's easy to fill it and block.

The 31 GPRs + zero register also goes for SPARC, PPC, Alpha, and probably lots of other RISC platforms. Making it a zero or other stuff register depending on context is more like PPC and in this case seems like a pretty clever idea (admittedly I like PPC a lot more than MIPS anyway). MIPS only has separate 32-bit and 64-bit add/subtract, while it's clear ARM will have a separate set for all ALU operations - I'm sure this is motivated by flags generation (and in this regard they're closer to x86-64)
MIPS has seperate add, subtract, multiply, divide, and shift. Compares are done with sign-extended results. It seems ARM wants to avoid that for power-saving reasons. It remains to be seen if we'll get an integer divide instruction in ARMv8.

If it was really PPC-like then there would be multiple sets of flags.

I wouldn't be surprised at all if ARM keeps the trend of somewhat complex immediates like they have with the original ISA, Thumb-2, and NEON. They'll probably re-evaluate what's most useful. I don't expect to see flat MIPS-style 16-bit immediates. I don't know what's up with the zero register per se, but the reveal that some instructions will address SP with it instead makes sense, and it means they won't have to have an rsb or neg style instruction. A zero register also means they won't have to do compare/test instructions and may even be useful for preloads.
A zero-register would avoid needing a rsb with an immediate zero for negation. And MOV can be replaced with add zero. For power saving and depenency resolution you don't really want those add-as-mov instructions going through the ALU, and avoiding that just makes the instruction decoder more complex. hmm...
 
It's a fairly good general description, although it leaves out some details, such as the 1-cycle fetch delay for a predicted taken branch, the 3-cycle delay in the branch history, and that adjacent branch instructions can cause BTB collisions because they index the same BTB entry.

Do you have the complete copy?

"Therefore, the instruction fetch currently in the F1 stage will need to be thrown away. This means ther is a one-cycle bubble in the fetch pipeline whenever a branch prediction is made for a taken branch."

"The BTB is indexed by the fetch address and contains branch target address and information about the branch type." (note that we know that fetch is 64-bit aligned so it goes without saying that the BTB can't index two branches in the same 64-bit block)

Not sure about the 3 cycle branch history delay, I'll have to go read your optimization notes again.

Because of the dual-issue pipeline, the branch resolution on the A8 happens too late to stop the issuance of a subsequent store instruction. It appears that conditional stores and stores following mispredicted branches are handled similarly. The store is issued, then cancelled at the very end of the pipeline. No idea what A9 does in this situation.

Yes, and exception detection happens even after issue and still has to be triggered during the final stages of the pipeline.

Hmm.. x86 has a cache bypass store. Apparently it causes a very long delay on intel cpus if you try to read those locations after writing. This implies a large store queue. I wonder if ARM would do something like that because of power consumption and silicon area. The store queue on the A8 appears to be quite small, and it's easy to fill it and block.

The NEON store queue is a lot larger and I think we can both agree that Cortex-A15 and its successors will be a lot wider than A8 in these regards.

MIPS has seperate add, subtract, multiply, divide, and shift.

Yeah I remembered mul/div right after I finished posting. This makes sense for performance reasons and I'm sure they aren't the only ones to do it, hell, a lot of other platforms have a whole variety of multiply widths.

Did forget about shifts altogether, but aren't those kind of a given if you're extending a 32-bit arch like MIPS was? Unless you only want to be able to shift them by no more than 32.

Compares are done with sign-extended results. It seems ARM wants to avoid that for power-saving reasons. It remains to be seen if we'll get an integer divide instruction in ARMv8.

ARM-v7a was extended to make them optional, but with every A-series ARM processor post A9 getting it I would be surprised if ARM doesn't make it mandatory, and very surprised if it's dropped entirely.

If it was really PPC-like then there would be multiple sets of flags.

I said that particular feature was PPC like and made it more PPC like than MIPS. Not that it was overall PPC like, certainly not "almost identical" like you claim it is vs MIPS. It's its own ISA.

A zero-register would avoid needing a rsb with an immediate zero for negation. And MOV can be replaced with add zero. For power saving and depenency resolution you don't really want those add-as-mov instructions going through the ALU, and avoiding that just makes the instruction decoder more complex. hmm...

I already mentioned rsb. Good call on mov - I don't know how much the alternate form adds to the decoder, but I'd be more curious if the post-decode instructions really want to carry extra width to handle instructions that the original doesn't have encode space for.

I wouldn't expect ARM to fully rely on that for loading immediates either, since it wastes 5 bits of encode space that could have been used on the immediate. I don't expect ARM64 to get MIPS style 16-bit immediates generally and I do expect them to keep movt/movw capability at 16-bit.
 
I already mentioned rsb. Good call on mov - I don't know how much the alternate form adds to the decoder, but I'd be more curious if the post-decode instructions really want to carry extra width to handle instructions that the original doesn't have encode space for.
ARM 32-bit ISA already has to take care of mov since it's a variant of LSL imm. In fact it's the renaming stage that has to know about that to optimize some cases; whether the detection is done at decode or rename or in any other place before renaming is just an implementation detail that depends on critical paths in the design.

Without telling too much (I obviously am under NDA), Aarch64 encoding is much cleaner than Aarch32 so this should be a non-issue :smile:

BTW a note about architecture naming: ARMv8 is made of both Aarch32 and Aarch64 (the 64-bit ISA). So v8 shouldn't be considered as the 64-bit ISA, one should talk of Aarch64 when talking about 64-bit stuff ;)
 
Just the same, I'm going to keep calling it ARM64 because I find the official AArch64 name awful ;)

(just like how I refuse to call anything "Advanced SIMD")
 
Back
Top