So how would you perform stack unwinding on exception handling?
pop?
So how would you perform stack unwinding on exception handling?
Flag register is additional dependency for OoO engine. Intel had a lot of trouble with it the past (not sure about present). Flag-less MIPS/SPU way is much cleaner and leads to simpler hardware, IMHO.
Well, it is good that they still has some predication. It is just not as useful and orthogonal as before.
I said "energy effective", not bottlenecked =) More instructions -> more cache used + more power used on F/D/E.
They have solved the problem of sequential push/pop, so it became beneficial over mov [sp+n] (compact instructions)
Console PPC processors are based on this logic.
So we got 21 cycle register shifts, 50+cycle LHS and 50+cycles branches on VMX flags.
I just hope that ADD Rx, SP, #offset is still possible
I don't understand what you mean here: do you think A9 splits cond instructions into branch + op so early in the pipeline that it could pollute BTB? That's not the case.Cortex-A9: Convert the instruction into a branch + op pair and lose the benefits of predication for unpredictable conditions entirely, and possibly pollute the BTB (especially if you end up with collisions from densely packed branch chains)
I don't understand what you mean here: do you think A9 splits cond instructions into branch + op so early in the pipeline that it could pollute BTB? That's not the case.
I don't understand what you mean here: do you think A9 splits cond instructions into branch + op so early in the pipeline that it could pollute BTB? That's not the case.
Just thought I'd chime in to say it sounds like a very solid ISA
I'm not sure you get more information from AMD or Intel; all they provide is an optimization guide. The Cortex-A Series Programmer's Guide contain some useful information.So you can tell me what it doesn't do but not what it does.. what a tease ARM should be way more forthright with this information, since it actually does have a bearing on writing good code..
If it doesn't involve prediction and the potential for misprediction why is the advice given to not use it over blocks of more than one instruction? Surely hard to predict sequences of two instructions would still prefer whatever they're doing if it eschews prediction.
They probably implement conditional execution as every other OOO CPU does: For each conditional instruction inject two instructions into the ROB, one using the condition code as input, the other using the negated condion code. Both executes and at instruction retirement, you throw the one away that doesn't satisfy the condition.
It means that each conditional instructions take two slots in the ROB and execute twice, thus it should only be used when there is a clear benefit.
Cheers
But for predicated instructions like on ARM the "opposite" condition is just a no-op.
But for predicated instructions like on ARM the "opposite" condition is just a no-op. So you'd really just issue one instruction and at retirement throw it away or not. A second slot could still possibly be needed to read the flags, I guess.
Well, replacing a conditional instruction with an unconditional instruction plus a conditional select ends up having the same load dependencies, so there really isn't much difference there.In the older ARM CPUs predication was nice because it was implemented such that the operation was cancelled early in the pipeline before any execution work was performed on it, hence why unexecuted instructions would take 1 cycle even in the case of multi-cycle instructions.
On deeper pipelined and OoO processors this approach is too not practical, so look at the alternatives in play:
Cortex-A8: Make the target register a load dependency and have the writeback stage perform the conditional select. This requires an additional register file port that would otherwise not be that useful, wasting area and power. Also requires special circuitry to cancel load exceptions or stores that are predicated, not to mention mask out other side-effects like flags.
Cortex-A9: Convert the instruction into a branch + op pair and lose the benefits of predication for unpredictable conditions entirely, and possibly pollute the BTB (especially if you end up with collisions from densely packed branch chains)
Honestly, ARM's new instruction set looks almost exactly like MIPS without delay slots.And ARM style predication was really wasting a ton of instruction bits. Is it really worth having it over 31 registers? Especially when most new instructions didn't allow it anyway.
Well, replacing a conditional instruction with an unconditional instruction plus a conditional select ends up having the same load dependencies, so there really isn't much difference there.
Not having to cancel load exceptions may make things a bit simpler. The x86 CMOV instructions have always worked this way - if the address is invalid, you get an exception regardless of the condition code.
Not having conditional stores does hurt pipelining somewhat. On the A8, conditional stores don't stall the pipeline and it basically just drops the store from the queue if the condition is false, which is faster than treating it like a branch misprediction and flushing the entire pipeline. I'm guessing that skipping the stores would make hazard detection (load-after-store) too complex on an out-of-order design, so that's why they don't do it that way on the A9.
Honestly, ARM's new instruction set looks almost exactly like MIPS without delay slots.
I understand Linus Torvalds' reaction but I do not believe it is warranted. The key point is that you shouldn't look only at the ARM ISA itself but also the ARM MPCore model. This is quite similar to how the weak PPC ISA ordering model doesn't really matter in practice for POWER7. I'm not saying the effective ordering model is as strong as x86 (I'm fairly sure it's not) but the practical differences are a lot more subtle than most people believe.Going by commentary on realworldtech's forums, Linus Torvalds really does not like weak consistency.
That is clearly wrong. Here is what the A15 presentation says about the memory pipeline for a single processor:Linus Torvalds said:Let's see how well A15 does. My guess is that it will have a fairly simplistic memory pipeline, go for a fairly simplistic model of memory ordering (probably "wildly out of order") which people will claim is really good for performance and which will the suck horribly for serialization and locks.
Shock, horror, this is completely identical to the original Intel P6. They don't explicitly mention completion but I don't expect any surprises there. And yes, the MPCore model should effectively guarantee that writes by a single processor are observed in the same order by all processors. So all the basics are covered really.16 entry issue queue for loads and stores
Common queue for ARM and NEON/memory operations
Loads issue out-of-order but cannot bypass stores
Stores issue in order, but only require address sources to issue