ARM announces ARMv8 architechture

May be we should adopt a32 and a64 to match with x86 and x86-64/x64.

ARM is actually calling the instruction sets that (AArch32/AArch64 are architectures which encompass more details).. AArch32 can run either "a32" or "t32" code, latter being the new name for Thumb-2. AArch64 can of course only run a64, but I wouldn't be THAT surprised if a t64 or similar came into existence at some point.
 
It seems about as useful as including Thumb-2 support in ARMv7-a. I can't really quantify how useful it really was but they did do it.
 
I really do question the usefulness of a t64. But I suppose if you have to support t32 anyway....
I doubt all ARMv8 implementations are going to have anything close to full-speed t32. Is there any benefit whatsoever to full-speed T32 on AMCC's X-Gene ARMv8 core? I don't see it. If they're smart they won't bother with more than a single decoder for it. Also T64 would be slightly less useful than T32 since it could only access half the registers (this is unlikely to make a big difference on an OoOE core but it could be noticeable on the big.LITTLE companion core).
 
I doubt all ARMv8 implementations are going to have anything close to full-speed t32.
Why not just duct-tape an energy efficient Cortex-A7 core to a full-blown AArch64-only core?
Or more bizarre bigMediumLITTLE combination of fast 64 bit, energy-saving 64bit and energy-saving 32 cores :D
 
Not sure about the 3 cycle branch history delay, I'll have to go read your optimization notes again.
The branch history buffer is indexed by 10-bit history of prior branches (taken/not taken) and the two lowest significant bits of the instruction address. I'm not sure what happens in thumb mode since there's an additional address bit.

The three-cycle delay isn't all that relevant to optimizaton, it's just a quirk of the design. To figure out which instructions were actually branches it has to finish decoding the instructions, and this takes 3 cycles. So basically the last 6-7 instructions aren't included in the branch history.

Yes, and exception detection happens even after issue and still has to be triggered during the final stages of the pipeline.
On the A8, if a TLB miss occurs then there will either be an exception or a pipeline stall, and presumably this happens in the first cycle after address calculation. With an out-of-order design, things can be a little more complex if you have multiple loads outstanding.

Did forget about shifts altogether, but aren't those kind of a given if you're extending a 32-bit arch like MIPS was? Unless you only want to be able to shift them by no more than 32.

The 32-bit shifts do sign/zero-extension from the 31st bit.

I already mentioned rsb. Good call on mov - I don't know how much the alternate form adds to the decoder, but I'd be more curious if the post-decode instructions really want to carry extra width to handle instructions that the original doesn't have encode space for.
mov is handled specially, so that shifts can pair with other instructions. This is mostly useful for thumb mode, I suppose.
 
There's an ARMv8 document out describing AArch64 in detail. You need an account on ARM to download it:

https://silver.arm.com/download/download.tm?pv=1199137

Some highlights are:

- Load/store immediate offsets can have implicit scaling by the operand size in order to extend effective range (like in Thumb)
- There's still special-second-op: you can apply an immediate shift and ror for non add/sub. add/sub/cmp also have a form allowing sign extension plus a left shift of 1 to 4 bits.
- Immediates are not the classic 8-bit + ROR, but something varying per-instruction: arithmetic have 12-bit with optional shift by 12 (so you can double up to get 24-bit easily), logical has bit-mask forms, and there's the 16-bit movt/movw-style instructions, but extended to provide the other parts for 64-bit immediate creation
- No more integer SIMD instructions (ie ARMv6 SIMD), these are instead rightfully delegated to NEON
- CPSR access is broken up per-flag instead of one register
- MIPS-style compare + branch if zero/non-zero instructions, and test and branch instructions
- Conditional select can apply increment, negate, or invert to one of its operands
- Loads/stores have register scaling by the access size, which is a reasonable limitation vs an arbitrary shift (and finally gives us proper halfword indexes!)
- +/- 1MB PC-relative loads, PC-relative address generation, and conditional branch range. +/- 128MB unconditional branch range.
- 32 and 64-bit integer division, but no direct remainder provided
- Flags setting is still optional
- Pre-indexed and post-indexed writeback on addressing is still provided, but with immediates only, and they take away offset bits (as well as optional scaling)
- Address offsets are additive only (can use negative immediates though)
- Non-temporal hints for load/store pair instructions
- add/sub/cmp/cmn/mov can access SP (read and write), and can access with read-only if setting flags, while the logical instructions can write to SP
- Only logical AND can set condition codes
- Variable shifts can no longer exceed the word size like in classical ARM implementations, I'm sure this will introduce a subtle bug somewhere..
- Conditional compare is just for cmn and cmp
- Multiply-negate instructions
- Floating point multiply-add with four operands, I think it's implied that it's fused and there's no more chained available
- Floating point gets conditional selects too
- Lane insert/extract instructions have been added to facilitate the new NEON registers no longer being natively packed (was hoping this much would at least be true), and instructions can target top 64-bits
- SIMD multiply-add is definitely fused only now, including reciprocal approximation steps
- Vector normalization acceleration instructions
- Unsigned to signed saturated arithmetic (I actually hoped this would be there!)
- NEON now has support for scalar operations, and full horizontal min, max, and sum
- NEON table lookup extended to 4 128-bit registers (instead of 4 64-bit), meaning up to 64-wide 8-bit shuffles
- Vector floating point division
- Vector element to element move
- Vector reciprocal exponent approximation
- Vector bit-reverse
- Explicit instructions for cache and TLB management

All in all I'm quite impressed with the ISA, and would definitely not consider it a MIPS clone.
 
I'm surprised this document went public.

A few comments about what you wrote:

- FP mul-add are fused (see section 3.7)
- compare + branch if zero/non-zero instructions already existed in Thumb
- among logical operations, BIC can also set condition codes and all logical ops set the four cond bits (of course ADD and SUB still can set cond codes).
 
- compare + branch if zero/non-zero instructions already existed in Thumb

Yes, but only as a 16-bit instruction with a 0 to 126 byte displacement which is IMO much, much less useful. AArch64's +/- 1MB displacement is much more reasonable. The +/- 32KB displacement for bit test less so but still not the joke 126 bytes is.

- among logical operations, BIC can also set condition codes and all logical ops set the four cond bits (of course ADD and SUB still can set cond codes).

Ah right, my mistake, what I meant to say is that among logical instructions with immediate operands only AND can set flags:

"Note: Apart from ANDS, logical immediate instructions do not set the condition flags, but “interesting” results can
usually directly control a CBZ, CBNZ, TBZ or TBNZ conditional branch."

This time around there actually is no BIC immediate, nor is there ORN immediate: but since the bitmasks have an invert option this would be redundant. There's also a reg/reg EOR NOT (EON) instruction.

The immediate bitmask format is still a little unclear to me. Here's what the manual says:

'The logical immediate instructions accept a bitmask immediate bimm32 or bimm64. Such an immediate consists
EITHER of a single consecutive sequence with at least one non-zero bit, and at least one zero bit, within an
element of 2, 4, 8, 16, 32 or 64 bits; the element then being replicated across the register width, or the bitwise
inverse of such a value."

ADD/SUB seems to use 13-bits for immediates so I'm assuming that's the case here as well. The smaller the bit replication used the fewer the bits needed to describe the mask section, so I'm assuming a form of huffman encoding on the element width selection. I guess what I'm confused on is this - is this saying it encodes X zeroes, Y ones, and Z zeroes? Then it'd need to have a bit offset and a width, and for 64-bits that'd need to be 6 + 6 bits, and you'd still need 1 bit to determine 64-bits and another inversion bit. The actual sum of bit offset and width wouldn't need to exceed 11 bits since if one is > 31 the other couldn't be, but you'd need a bit to specify which is in use. I must be missing something here, or maybe there's a limitation the document isn't describing. Encodings for the other formats would be straightforward since you lose the need for width/offset bits at least twice as fast as you gain the need for element size encoding bits.

It's worth noting that the ability to increment, invert, or negate a value on conditional select does mean that there's still predication for all of these unary operations. That's a bit more than we expected. The increment and invert functions is cool because when used against the zero register it can result in conditional set (either to 1 or an all 1's bitmask)

Good call on flag setting in general. As far as I can tell, everything that sets flags sets all four of them. ANDS and BICS clear C and V. I can't find any indication that shifts are capable of setting the carry flag. Nor does there appear to be a MOVS instruction, but w/o carry-out that functionality is reproduced in ANDS. These cleanups to flags setting should help implementation without hurting flexibility too much. Sucks for people doing large-precision variable shifts, though.
 
Lots of goodies for game development: fsel, fma + all the new vector instructions. I like that.
Explicit instructions for cache and TLB management
Perfect! Could you copy/paste the list of instructions? Do we just get cache prefect and evict... or a more complete set of goodies (like streaming stores that do not pollute cache, cache line zeroing to save reads, etc)?

I did read somewhere that the new ARM chips already support write combined memory access (writes though cache). Very useful when you are writing big chunks of memory and do not intend to read the data (for example when writing data to dynamic vertex buffers).
 
Okay. The instructions are a bit complex, in that they take parameter fields that specify certain tasks. I'll try to give an overview.

ic: can invalidate all of instruction cache or individual lines by virtual address
dc: can invalidate/clean/zero individual data cache lines either by virtual address or by set/way
at: performs address translation. You can specify which TLB level/type to go to, and there's something about specifying exception levels..
tlbi: TLB invalidate. There's a bunch of addressing options here going by VA, ASID, or entire TLBs, and the exception levels are here again.

ARM chips have supported caches that aren't write-allocate for a long time now, so you don't need write-through if you aren't going to read (although they tend to support that too). They've had coalescing write buffers for a while too.
 
Are these instructions usable in user mode?

Unfortunately I don't think the document says whether they are or not. Hopefully they will be.

The ability to load the data cache with zeroes is kind of interesting. I wonder if this will be faster than normal on any processors.
 
The ability to load the data cache with zeroes is kind of interesting. I wonder if this will be faster than normal on any processors.
It's interesting, but that seems to expose the cache line size to the program. I am not sure if that is an acceptable tradeoff for user mode code.
 
I wish the regular fp ops had the same explicit rounding mode options that the float to int conversion instructions have. I guess the use-cases are just too thin.

I'm a complete ARM newbie - is setting the FPCR expensive?
 
The ability to load the data cache with zeroes is kind of interesting. I wonder if this will be faster than normal on any processors.
With current (high speed) processors, it can take a very long time to fetch a cache line from memory. Cache line zeroing acts similarly than cache prefetch, but it's much more faster (prefetch can take up to 500 cycles before the data is ready). But instead of getting the actual data from memory to the cache, you get cache lines filled with zeroes (without needing to load the data from memory to those cache lines).

It's very useful for example if you need a big buffer of temporary storage. Instead of prefetching the memory area (or accessing it and letting the cache to automatically load it), you cache line zero the area. You save hundreds of cycles, because you can start using the memory immediately (don't need to wait for cache lines to load), and you also save a lot of memory bandwidth. It's also a good operation to do when you are initializing new (cache line aligned) structures. In this case you are going to overwrite the old "uninitialized" memory contents anyway, so loading it first from memory to cache is a waste of time. It's basically a way to tell the CPU that "I am going to fully overwrite this cache line, so don't bother loading it". Of course when the cache line gets evicted, it's written to the memory as usual (there's no safe way to do this in the opposite direction).
It's interesting, but that seems to expose the cache line size to the program. I am not sure if that is an acceptable tradeoff for user mode code.
That's also the likely reason why x86 doesn't have cache line zero (while the console PPC cores have). Cache line prefetch and evict however are safe to use. These instructions are just hints to optimize automated cache logic performance. If the cache line size is different than the programmer anticipated, only the performance gain is reduced.

PPC has an instruction (http://publib.boulder.ibm.com/infoc...c=/com.ibm.aix.aixassem/doc/alangref/clcs.htm) to get the cache line size. Is there anything like that on the new ARM documentation?
 
Last edited by a moderator:
That's also the likely reason why x86 doesn't have cache line zero (while the console PPC cores have). Cache line prefetch and evict however are safe to use. These instructions are just hints to optimize automated cache logic performance. If the cache line size is different than the programmer anticipated, only the performance gain is reduced.

PPC has an instruction (http://publib.boulder.ibm.com/infoc...c=/com.ibm.aix.aixassem/doc/alangref/clcs.htm) to get the cache line size. Is there anything like that on the new ARM documentation?
With a way to determine the cache line size, it seems fairly reasonable and indeed quite useful. Though you will need to parametrize your code.
 
The immediate bitmask format is still a little unclear to me. Here's what the manual says:

'The logical immediate instructions accept a bitmask immediate bimm32 or bimm64. Such an immediate consists
EITHER of a single consecutive sequence with at least one non-zero bit, and at least one zero bit, within an
element of 2, 4, 8, 16, 32 or 64 bits; the element then being replicated across the register width, or the bitwise
inverse of such a value."
This means you can build a bitstream that's made of repeated sequences of 2, 4, ... 64-bit (up to 32- or 64-bit); each sequence must have at least one bit set and one bit clear; if multiple bits are set they must be consecutive (modulo a rotation).
For instance : 1011_1011_1011_1011_1011_1011_1011_1011.

ADD/SUB don't use the same mechanism for immediates.

Nor does there appear to be a MOVS instruction, but w/o carry-out that functionality is reproduced in ANDS.
Or ADDS with the zero register :)
 
Back
Top