ARM announces ARMv8 architechture

Flag register is additional dependency for OoO engine. Intel had a lot of trouble with it the past (not sure about present). Flag-less MIPS/SPU way is much cleaner and leads to simpler hardware, IMHO.

The problem on x86 is that most operations universally set the flags but not all of them set all of them, so it creates a dependency when partial flag generation is handled with a read-modify-write of a unified flags register.

ARM historically would have less problems because modifying flags is optional, and I assume they kept this for AArch64. There is a problem with logic operations not setting overflow and not always setting carry, and it's even worse for shift by register where the carry set isn't known until run time, always creating a partial dependency. But usually flags are set with add, sub, or cmp/cmn, and flag setting on variable register shifts would be particularly rare. ARM may well have decided to always set V and C on AArch64 logic operations, though.

Well, it is good that they still has some predication. It is just not as useful and orthogonal as before.

In the older ARM CPUs predication was nice because it was implemented such that the operation was cancelled early in the pipeline before any execution work was performed on it, hence why unexecuted instructions would take 1 cycle even in the case of multi-cycle instructions.

On deeper pipelined and OoO processors this approach is too not practical, so look at the alternatives in play:

Cortex-A8: Make the target register a load dependency and have the writeback stage perform the conditional select. This requires an additional register file port that would otherwise not be that useful, wasting area and power. Also requires special circuitry to cancel load exceptions or stores that are predicated, not to mention mask out other side-effects like flags.
Cortex-A9: Convert the instruction into a branch + op pair and lose the benefits of predication for unpredictable conditions entirely, and possibly pollute the BTB (especially if you end up with collisions from densely packed branch chains)

We don't know what A15 or A7 implement but I doubt they're taking A8's approach, instead my guess is that if they don't do it like A9 they do it by cracking the operation into the execute and select stages. All of the side-effects still have to be specially handled.

So to me it makes a lot more sense to simply include the select as an explicit operation and let the software crack itself. Yes, it increases fetch/decode pressure, but has the following benefits:

- Software can determine for itself how to deal with side effects: in a lot of cases it won't care if loads or flag setting happens regardless of the predicate condition, so no work will have to be done to that end.
- Software is discouraged from having long predicated chains and instead only selecting results from two separate paths, where appropriate, which is bound to use less resources.
- Conditional select is actually useful in its own right, there are a lot of cases in ARM code of if(condition) x = a; if(opposite(condition)) x = b; and on Cortex-A8 at least these operations can't pair. A conditional select replaces this.

And ARM style predication was really wasting a ton of instruction bits. Is it really worth having it over 31 registers? Especially when most new instructions didn't allow it anyway.

I said "energy effective", not bottlenecked =) More instructions -> more cache used + more power used on F/D/E.

ldm/stm doesn't save execution (vs ldr/str), just fetch and decode. And not always decode: look at Cortex-A8, where not only do the instructions block the decoders but only one transfer cycle can pair. That means that it actually uses more decode resources than N ldr/str over the N cycles the ldm/stm takes (but still loses because ldr/str can't achieve the same throughput). With register pair load/store and dynamic scheduling I'm sure ldm/stm wouldn't have a throughput advantage.

I would also expect the processor to keep fetching until it fills the fetch queue. If this isn't a bottleneck for execution then improved density may not mean less fetching, since more instructions would be fetched instead. You could arguably make the fetch queue smaller, but ldm/stm tend to be pretty bursty and not something you'd want to take a worst case hit on.

Worse code density does result in more L1 icache misses which costs more power. But this is hardly a linear increase, and it's difficult to quantify if the extra logic necessary for better density outweighs having more cache.

They have solved the problem of sequential push/pop, so it became beneficial over mov [sp+n] (compact instructions)

Much harder to realize an advantage like that w/fixed length instructions.

Console PPC processors are based on this logic.
So we got 21 cycle register shifts, 50+cycle LHS and 50+cycles branches on VMX flags.

I wasn't saying to make less commonly used stack operations slow, I'm saying that almost no one is using the stack pointer as anything but a stack pointer and it's fine to make special case instructions for things you want with a stack pointer.

I just hope that ADD Rx, SP, #offset is still possible

It was in Thumb so my guess is it'll be here.
 
Cortex-A9: Convert the instruction into a branch + op pair and lose the benefits of predication for unpredictable conditions entirely, and possibly pollute the BTB (especially if you end up with collisions from densely packed branch chains)
I don't understand what you mean here: do you think A9 splits cond instructions into branch + op so early in the pipeline that it could pollute BTB? That's not the case.

FWIW regarding the removal of the conditional instructions, the new ISA was tested with an ISS taking care of L1 icache and branch pred; comparisons against ARM ISA were done (of course using the same version of the compiler). The same applies to the removal of ldm/stm. So the tradeoff was certainly carefully analysed, but heh that's no surprise, no instruction set can be designed without careful analysis ;)
 
I don't understand what you mean here: do you think A9 splits cond instructions into branch + op so early in the pipeline that it could pollute BTB? That's not the case.

That kinda caught me off guard too. There's no real reason to branch predict a conditional. That's one of the benefits of having conditionals....
 
I don't understand what you mean here: do you think A9 splits cond instructions into branch + op so early in the pipeline that it could pollute BTB? That's not the case.

ARM hasn't specified what they actually do with Cortex-A9, although it's clear that they aren't handling it like A8. All I was told is that it's "turned into a branch." Would you care to tell me what it's actually doing? I don't see how it could effectively turn it into branch + op without involving all prediction mechanisms, anything else is not a branch..

If it doesn't involve prediction and the potential for misprediction why is the advice given to not use it over blocks of more than one instruction? Surely hard to predict sequences of two instructions would still prefer whatever they're doing if it eschews prediction.
 
I don't have the right to detail the micro-architecture. My advice is that you look for patents issued by ARM about register renaming (I don't claim they describe what A9 does but that should give you some ideas).
 
So you can tell me what it doesn't do but not what it does.. what a tease ;) ARM should be way more forthright with this information, since it actually does have a bearing on writing good code..

I'll think about looking at the patents but quite frankly my reading comprehension rate for patents is incredibly low :( If it really has to do with register renaming then I would guess it's handled in the retirement stage by redirecting the renamed output away from the register file, which seems pretty clever. No idea what it'd do with the non-register based side-effects though. Or why it's bad to do this over many operations.
 
Just thought I'd chime in to say it sounds like a very solid ISA :) I don't have much to add over the (very nice) discussion going on right now but here are a few unrelated points:
  • Previous roadmap disclosures revealed the existence of the 64-bit Athena and Atlas cores. Presumably one will be an incremental improvement over the A15 and the other will be very similar to the A7 to work in an ARMv8 big.LITTLE configuration. That same old roadmap also revealed that ARM was planning to add SMT but not until the generation after Athena (i.e. 2 generations after the A15).
  • It's not clear whether it's actually beneficial to have 31 GPRs on a high-end OoOE machine but because of big.LITTLE we can be certain that ARM will be designing ARMv8 in-order cores as well (see above) where the benefits will be much more obvious.
  • Fixed-Width instructions make a lot of sense today given the huge amounts of cache involved in application processors for both performance and especially power reasons (saving external bandwidth). However a cache miss on L1 is still expensive - I wonder if high-end ARMv8 processors will start including 64KB L1i or alternatively a 3-level cache hierarchy similar to Nehalem with a much smaller/faster per-core L2.
  • I don't know if it was a commercial or a technical decision, but I think it will probably turn out to have been commercially very attractive not to have supported 64-bit in the Cortex-A15 because it will allow ARM to be much more incremental in the performance of its next-generation IPs and still have the vast majority of target customers license them (arguably unlike the Cortex-A5 where people are still making new ARM11/ARM9-based chips).
 
So you can tell me what it doesn't do but not what it does.. what a tease ;) ARM should be way more forthright with this information, since it actually does have a bearing on writing good code..
I'm not sure you get more information from AMD or Intel; all they provide is an optimization guide. The Cortex-A Series Programmer's Guide contain some useful information.

Don't get me wrong, I agree that having more detailed documentation about micro-architecture might be beneficial, but that's the way things are.

For the fun, here is a 2007 doc about A9 uarch :smile:
 
Have already read that.

The optimization guides from Intel and AMD actually are more useful than what ARM gives - the most you get is pretty incomplete timing information in a TRM, that also tends to be full of errors. I hear there are optimization guides but they aren't publicly available, which is pretty mysterious.

Plus Intel and AMD chips get more openly analyzed by third parties, like Agner Fog.. Too bad there aren't more people doing that for ARM.
 
If it doesn't involve prediction and the potential for misprediction why is the advice given to not use it over blocks of more than one instruction? Surely hard to predict sequences of two instructions would still prefer whatever they're doing if it eschews prediction.

They probably implement conditional execution as every other OOO CPU does: For each conditional instruction inject two instructions into the ROB, one using the condition code as input, the other using the negated condion code. Both executes and at instruction retirement, you throw the one away that doesn't satisfy the condition.

It means that each conditional instructions take two slots in the ROB and execute twice, thus it should only be used when there is a clear benefit.

Cheers
 
They probably implement conditional execution as every other OOO CPU does: For each conditional instruction inject two instructions into the ROB, one using the condition code as input, the other using the negated condion code. Both executes and at instruction retirement, you throw the one away that doesn't satisfy the condition.

It means that each conditional instructions take two slots in the ROB and execute twice, thus it should only be used when there is a clear benefit.

Cheers

But for predicated instructions like on ARM the "opposite" condition is just a no-op. So you'd really just issue one instruction and at retirement throw it away or not. A second slot could still possibly be needed to read the flags, I guess.

Everything I said about non-register side effects probably doesn't make sense since you'd have to handle clearing anything at that point anyway, in order to flush the pipeline due to a branch mispredict.

I guess if I actually had a Cortex-A9 I'd try some stuff out..
 
But for predicated instructions like on ARM the "opposite" condition is just a no-op. So you'd really just issue one instruction and at retirement throw it away or not. A second slot could still possibly be needed to read the flags, I guess.

You're right, mea culpa, I was thinking of a conditional move instruction.

Either they have an extra input for the condition code for each slot in the ROB and discard at retirement. Or they issue a pair of instruction, one of them reads the condition code and tells the retirement stage to throw both away (if condition not met).

Cheers
 
In the older ARM CPUs predication was nice because it was implemented such that the operation was cancelled early in the pipeline before any execution work was performed on it, hence why unexecuted instructions would take 1 cycle even in the case of multi-cycle instructions.

On deeper pipelined and OoO processors this approach is too not practical, so look at the alternatives in play:

Cortex-A8: Make the target register a load dependency and have the writeback stage perform the conditional select. This requires an additional register file port that would otherwise not be that useful, wasting area and power. Also requires special circuitry to cancel load exceptions or stores that are predicated, not to mention mask out other side-effects like flags.
Cortex-A9: Convert the instruction into a branch + op pair and lose the benefits of predication for unpredictable conditions entirely, and possibly pollute the BTB (especially if you end up with collisions from densely packed branch chains)
Well, replacing a conditional instruction with an unconditional instruction plus a conditional select ends up having the same load dependencies, so there really isn't much difference there.

Not having to cancel load exceptions may make things a bit simpler. The x86 CMOV instructions have always worked this way - if the address is invalid, you get an exception regardless of the condition code.

Not having conditional stores does hurt pipelining somewhat. On the A8, conditional stores don't stall the pipeline and it basically just drops the store from the queue if the condition is false, which is faster than treating it like a branch misprediction and flushing the entire pipeline. I'm guessing that skipping the stores would make hazard detection (load-after-store) too complex on an out-of-order design, so that's why they don't do it that way on the A9.

Without true conditional stores, these have to get written as a read-modify-write sequence. Theoretically this means that you end up stalling a little earlier in the pipeline if the read misses the L1 cache, but in practice it's probably not going to make that much difference since stores end up allocating cache lines anyway.

And ARM style predication was really wasting a ton of instruction bits. Is it really worth having it over 31 registers? Especially when most new instructions didn't allow it anyway.
Honestly, ARM's new instruction set looks almost exactly like MIPS without delay slots.
 
Well, replacing a conditional instruction with an unconditional instruction plus a conditional select ends up having the same load dependencies, so there really isn't much difference there.

Hence why ARM mostly dropped them and added conditional selects.

The main advantage of predication is avoiding branch mispredict penalties, not avoiding dependencies. Avoiding dependencies would seem to need condition prediction, which would be no more predictable than whatever branch it's trying to replace.

Not having to cancel load exceptions may make things a bit simpler. The x86 CMOV instructions have always worked this way - if the address is invalid, you get an exception regardless of the condition code.

On retrospect I'm not sure if it'd make a big deal since the pipeline has to be capable of canceling anything clear up to writeback to support branch mispredict flushes anyway.. including whatever exceptions a speculative load could have caused. So you'd still be safe to cancel it by the time the condition was resolved.

Not having conditional stores does hurt pipelining somewhat. On the A8, conditional stores don't stall the pipeline and it basically just drops the store from the queue if the condition is false, which is faster than treating it like a branch misprediction and flushing the entire pipeline. I'm guessing that skipping the stores would make hazard detection (load-after-store) too complex on an out-of-order design, so that's why they don't do it that way on the A9.

Do you know more about the nature of predication on A9?

I'm still hoping for masked SIMD stores, since those are even more useful than scalar conditional stores. But I don't expect to get them..

Honestly, ARM's new instruction set looks almost exactly like MIPS without delay slots.

I think "almost exactly like MIPS" is grossly underselling it. There's more to ARM vs MIPS than predication and block memory instructions. Until we hear otherwise - and going off the comment that the ISA was kept as similar as possible save the mentioned exceptions - I'm going to assume that ARM64 has register + scaled register addressing, pre/post-increment, folded shifts, and a variety of bit-manipulation/multiplication/ALU instructions MIPS64 lacks. And is still flags based.
 
ARM's sticking with weak memory consistency going forward.
Additionally, Aarch64 is adding store acquire/load release instructions, which seems to be a trend.
AMD's FSA architecture also uses this method for controlling visibility.

That may be fine for heterogenous compute or for situations where there are coherent memory pools separated by significant latency and different memory/cache controllers.

That's not quite the direction for servers, or desktop for that matter.
AMD's x86 side, and Intel in general have much stronger ordering and speculative hardware to silently reorder memory accesses.
On the server side, POWER7 has a stronger ordering mode available, even though traditionally its model was weaker like many RISC architectures.

Going by commentary on realworldtech's forums, Linus Torvalds really does not like weak consistency.
 
EDIT: Ignore what's below, it's fairly likely to be wrong, gah - sorry.
Going by commentary on realworldtech's forums, Linus Torvalds really does not like weak consistency.
I understand Linus Torvalds' reaction but I do not believe it is warranted. The key point is that you shouldn't look only at the ARM ISA itself but also the ARM MPCore model. This is quite similar to how the weak PPC ISA ordering model doesn't really matter in practice for POWER7. I'm not saying the effective ordering model is as strong as x86 (I'm fairly sure it's not) but the practical differences are a lot more subtle than most people believe.

For example, here's what Linus Torvalds said on RWT:
Linus Torvalds said:
Let's see how well A15 does. My guess is that it will have a fairly simplistic memory pipeline, go for a fairly simplistic model of memory ordering (probably "wildly out of order") which people will claim is really good for performance and which will the suck horribly for serialization and locks.
That is clearly wrong. Here is what the A15 presentation says about the memory pipeline for a single processor:
16 entry issue queue for loads and stores
 Common queue for ARM and NEON/memory operations
 Loads issue out-of-order but cannot bypass stores
 Stores issue in order, but only require address sources to issue
Shock, horror, this is completely identical to the original Intel P6. They don't explicitly mention completion but I don't expect any surprises there. And yes, the MPCore model should effectively guarantee that writes by a single processor are observed in the same order by all processors. So all the basics are covered really.
 
Back
Top