It's a fairly good general description, although it leaves out some details, such as the 1-cycle fetch delay for a predicted taken branch, the 3-cycle delay in the branch history, and that adjacent branch instructions can cause BTB collisions because they index the same BTB entry.
Do you have the complete copy?
"Therefore, the instruction fetch currently in the F1 stage will need to be thrown away. This means ther is a one-cycle bubble in the fetch pipeline whenever a branch prediction is made for a taken branch."
"The BTB is indexed by the fetch address and contains branch target address and information about the branch type." (note that we know that fetch is 64-bit aligned so it goes without saying that the BTB can't index two branches in the same 64-bit block)
Not sure about the 3 cycle branch history delay, I'll have to go read your optimization notes again.
Because of the dual-issue pipeline, the branch resolution on the A8 happens too late to stop the issuance of a subsequent store instruction. It appears that conditional stores and stores following mispredicted branches are handled similarly. The store is issued, then cancelled at the very end of the pipeline. No idea what A9 does in this situation.
Yes, and exception detection happens even after issue and still has to be triggered during the final stages of the pipeline.
Hmm.. x86 has a cache bypass store. Apparently it causes a very long delay on intel cpus if you try to read those locations after writing. This implies a large store queue. I wonder if ARM would do something like that because of power consumption and silicon area. The store queue on the A8 appears to be quite small, and it's easy to fill it and block.
The NEON store queue is a lot larger and I think we can both agree that Cortex-A15 and its successors will be a lot wider than A8 in these regards.
MIPS has seperate add, subtract, multiply, divide, and shift.
Yeah I remembered mul/div right after I finished posting. This makes sense for performance reasons and I'm sure they aren't the only ones to do it, hell, a lot of other platforms have a whole variety of multiply widths.
Did forget about shifts altogether, but aren't those kind of a given if you're extending a 32-bit arch like MIPS was? Unless you only want to be able to shift them by no more than 32.
Compares are done with sign-extended results. It seems ARM wants to avoid that for power-saving reasons. It remains to be seen if we'll get an integer divide instruction in ARMv8.
ARM-v7a was extended to make them optional, but with every A-series ARM processor post A9 getting it I would be surprised if ARM doesn't make it mandatory, and very surprised if it's dropped entirely.
If it was really PPC-like then there would be multiple sets of flags.
I said that particular feature was PPC like and made it more PPC like than MIPS. Not that it was overall PPC like, certainly not "almost identical" like you claim it is vs MIPS. It's its own ISA.
A zero-register would avoid needing a rsb with an immediate zero for negation. And MOV can be replaced with add zero. For power saving and depenency resolution you don't really want those add-as-mov instructions going through the ALU, and avoiding that just makes the instruction decoder more complex. hmm...
I already mentioned rsb. Good call on mov - I don't know how much the alternate form adds to the decoder, but I'd be more curious if the post-decode instructions really want to carry extra width to handle instructions that the original doesn't have encode space for.
I wouldn't expect ARM to fully rely on that for loading immediates either, since it wastes 5 bits of encode space that could have been used on the immediate. I don't expect ARM64 to get MIPS style 16-bit immediates generally and I do expect them to keep movt/movw capability at 16-bit.