ARM announces ARMv8 architechture

DSC

Regular
Banned
ARM announces ARMv8 architecture

http://www.arm.com/about/newsroom/a...-the-next-version-of-the-arm-architecture.php

ARM today disclosed technical details of its new ARMv8 architecture, the first ARM architecture to include a 64-bit instruction set. ARMv8 broadens the ARM architecture to embrace 64-bit processing and extends virtual addressing, building on the rich heritage of the 32-bit ARMv7 architecture upon which market leading cores such as the Cortex™-A9 and Cortex-A15 processors are built.

The ARMv8 architecture consists of two main execution states, AArch64 and AArch32. The AArch64 execution state introduces a new instruction set, A64 for 64-bit processing. The AArch32 state supports the existing ARM instruction set. The key features of the current ARMv7 architecture, including TrustZone®, virtualization and NEON™ advanced SIMD, are maintained or extended in the ARMv8 architecture.

The ARMv8 architecture specifications describing all aspects of the ARMv8 architecture are available now to partners under license. ARM will disclose processors based on ARMv8 during 2012, with consumer and enterprise prototype systems expected in 2014.
 
Last edited by a moderator:
Ah the long awaited armv8 (64bit) announcement - I was thinking Cortex-A7 has to be the last major v7 chip announcement. A bit light on details, I think not even 2014 for first products is a surprise.
 
is there any more info on what ARMv8 adds besides 64-bit support

Thumb is out. A lot of the instruction encodings have been re-coded to be more sensible -- though since you have to support v7 anyway, I don't know what good that will do.

40 bit pointers with translation tables similar to that of LPAE in v7. Oh, and no more LDM/STM.
 
Wonder how many will implement transactional memory models with the atomic ld/st instructions in play now.
 
Thumb is out
As expected. On 64bit system, code size is the last thing to worry about =)

Oh, and no more LDM/STM.
That's bad. Hope they will not cripple LDM/STM in 32bit mode

-Stack pointer is not a general purpose register
What?!
-PC is not a general purpose register
-Additional dedicated zero register available for most instructions
Far fewer conditional instructions than in AArch32
Conditional {branches, compares, selects}


WTF! Seems they went out of mind. ARM had to ask Wilson to design the new ISA.

If they eliminated conditional execution they should get rid of integer pipe flag register, because there is no any sense in it.
 
Last edited by a moderator:
As expected. On 64bit system, code size is the last thing to worry about =)

I imagine at some point in the future, they'd want this to carry over to low-profile embedded as well. But T2EE was a pain in the ass from an implementation POV and, IMO, not worth the silicon for the reduction in code size.

That's bad. Hope they will not cripple LDM/STM in 32bit mode

Speaking of another feature that's a pain in the ass to implement and, IMO, not worth the silicon......

-Stack pointer is not a general purpose register
What?!

Did you prefer to have a GPR banked all the time?

-PC is not a general purpose register
-Additional dedicated zero register available for most instructions
Far fewer conditional instructions than in AArch32
Conditional {branches, compares, selects}


WTF! Seems they went out of mind. ARM had to ask Wilson to design the new ISA.

If they eliminated conditional execution they should get rid of integer pipe flag register, because there is no any sense in it.

They have to still support v7.
 
Speaking of another feature that's a pain in the ass to implement and, IMO, not worth the silicon......
Don't think the circuit is complicated, especially today with 1 word/cycle rate, but they lost the opportunity to be energy effective on function prologs/epilogs that became even more demanding because of increased number of registers. Instead of 1 instruction, they have to fetch/decode/issue 10.
Yeah, it is not easy to fit 31 bit field in 32 bit instruction, but at least registers groups from N to M could be saved/restored.
Intel is using a special circuits to accelerate PUSH/POP. Because x86 lack the instructions that ARM has.

Did you prefer to have a GPR banked all the time?
No, only in case of interrupts =)
I prefer to be able to perform math with stack pointer.
SP Is GPR even on x86 and all RISCs.
Do v8 has special mode for LDR/STR to access stack frame?
This sounds so ridiculous that I can't even believe it is true.
And what about PC-relative addressing?
 
Vitaly Vidmirov said:
If they eliminated conditional execution they should get rid of integer pipe flag register, because there is no any sense in it.

First of all, most ISAs have flags without having conditional execution beyond branches and perhaps moves, that doesn't make it all useless. If you don't have flags you at least benefit from having a separate register set for predication so you can reduce register file contention.

More importantly, ARMv8 64-bit isn't completely getting rid of conditional execution. Notably it has conditional selects, which are a more useful generalization of conditional moves, and conditional compares, which lets you perform some boolean logic for if-clauses directly in flags.

Don't think the circuit is complicated, especially today with 1 word/cycle rate, but they lost the opportunity to be energy effective on function prologs/epilogs that became even more demanding because of increased number of registers. Instead of 1 instruction, they have to fetch/decode/issue 10.
Yeah, it is not easy to fit 31 bit field in 32 bit instruction, but at least registers groups from N to M could be saved/restored.
Intel is using a special circuits to accelerate PUSH/POP. Because x86 lack the instructions that ARM has.

Cortex-A8 and A9 are capable of ldm/stm at 64-bits per cycle. But look at it this way: Cortex-A15 can fetch 4 32-bit instructions per cycle and decode 3, but only has one load and one store unit, which I imagine can store either 64-bit or 128-bit to a contiguous location per cycle. ARMv7 already has dual-register stores and ARMv8 makes it more flexible to allow any register.

Now in designs post-dating Cortex-A15 how often do you really think being able to load or store two 32 or 64-bit registers per instruction is going to be bottlenecked by fetch or decode?

Intel's stack engine has nothing to do with the actual loads or stores, but with the stack pointer manipulation. By making the stack pointer no longer a general purpose register it's easier for ARM to do this now too..

No, only in case of interrupts =)
I prefer to be able to perform math with stack pointer.
SP Is GPR even on x86 and all RISCs.

Do v8 has special mode for LDR/STR to access stack frame?
This sounds so ridiculous that I can't even believe it is true.
And what about PC-relative addressing?

ARM would hardly be the first ISA to separate out SP to special function. No I doubt there'll be a special "mode" for loads and stores, just SP specific loads and stores, PC specific loads, SP increment/decrement and moves from SP to GPRs. Compared to the wealth of data operations ARM has this is a tiny number yet accounts for the vast majority of use cases. I don't know about what you do with the stack pointer, but in the real world it's almost never used for anything but a stack pointer.

Even Thumb-1 separates out SP and PC access exactly like this.

As for register banking on interrupts, that's only really nice to have in embedded. I doubt big servers or even apps platforms like phones and tablets need such fast or low latency interrupts. Especially with more than one core.
 
Don't think the circuit is complicated, especially today with 1 word/cycle rate,

That depends on how you implement it.

The biggest problems are cache-line crossers, mispredicts and mid-stride faults. STM's aren't too bad but ARM architecture requires that you not update architected state until the entire LDM is complete; i.e. if you cross a page with your LDM and fault, you gotta roll back your register writes.

This isn't so much of a performance problem if you're using a design with register renaming but it's still extra power and circuitry to handle these cases. The same problem occurs with a speculatively executed LDM that completes half-way before the mispredict is determined.

Your choice is either to make a big-ass queue (hardware and power hungry), halting your pipeline (bad for performance), or writing a lot of rename registers unnecessarily (again, bad for power).

Cache-line crossers are fairly bad for power as well as if you happen to walk into a cacheline that isn't cached, you have a potentially huge number of instructions halted there (or written to rename reg/queue and taking up space) that's waiting on that last pesky load from memory -- which can be really really bad.

but they lost the opportunity to be energy effective on function prologs/epilogs that became even more demanding because of increased number of registers. Instead of 1 instruction, they have to fetch/decode/issue 10.

Honestly, I really doubt post-A8 microarch's are going to be fetch/decode/issue limited.

Yeah, it is not easy to fit 31 bit field in 32 bit instruction, but at least registers groups from N to M could be saved/restored.
Intel is using a special circuits to accelerate PUSH/POP. Because x86 lack the instructions that ARM has.

The push-pop handling is off to the side, yes. But again, it isn't decode limited. I'm not too familiar with Intel's uarch but I have a stinking feeling they decode it into individual load uops internally anyway. If a 4-way issue (on Nehalem) isn't a bottleneck, I doubt any ARMv8 microarch will be bottlenecked by it either.

No, only in case of interrupts =)

And exceptions.

I prefer to be able to perform math with stack pointer.
SP Is GPR even on x86 and all RISCs.
Do v8 has special mode for LDR/STR to access stack frame?
This sounds so ridiculous that I can't even believe it is true.
And what about PC-relative addressing?

*Shrug*. It's a stack. I would've thought the obvious use-cases would be to use it as, well, a stack.

And again, not having it be a GPR can make the hazard compare circuits much simpler.
 
First of all, most ISAs have flags without having conditional execution beyond branches and perhaps moves, that doesn't make it all useless.
Flag register is additional dependency for OoO engine. Intel had a lot of trouble with it the past (not sure about present). Flag-less MIPS/SPU way is much cleaner and leads to simpler hardware, IMHO.

Notably it has conditional selects..
Well, it is good that they still has some predication. It is just not as useful and orthogonal as before.

Now in designs post-dating Cortex-A15 how often do you really think being able to load or store two 32 or 64-bit registers per instruction is going to be bottlenecked by fetch or decode?
I said "energy effective", not bottlenecked =) More instructions -> more cache used + more power used on F/D/E.

Intel's stack engine has nothing to do with the actual loads or stores, but with the stack pointer manipulation
They have solved the problem of sequential push/pop, so it became beneficial over mov [sp+n] (compact instructions)

By making the stack pointer no longer a general purpose register it's easier for ARM to do this now too..
A64 has fixed length instructions, so it doesn't make sense.

No I doubt there'll be a special "mode" for loads and stores, just SP specific loads and stores, PC specific loads, SP increment/decrement and moves from SP to GPRs.
I just want a dream RISC machine, not a freaking PPC clone =)

Compared to the wealth of data operations ARM has this is a tiny number yet accounts for the vast majority of use cases
Console PPC processors are based on this logic.
So we got 21 cycle register shifts, 50+cycle LHS and 50+cycles branches on VMX flags.

I don't know about what you do with the stack pointer, but in the real world it's almost never used for anything but a stack pointer.
I just hope that ADD Rx, SP, #offset is still possible
 
That depends on how you implement it.
The same problem occurs with a speculatively executed LDM that completes half-way before the mispredict is determined.
Well, this make sense. On the other side, Intel is beefing up their REP XXX instructions.

Honestly, I really doubt post-A8 microarch's are going to be fetch/decode/issue limited.
Larger code size (bad for cache) and more energy in cache+frontend.
Register amount is nearly doubling - that means much more stack spill/fills.

they decode it into individual load uops internally anyway
Yes. It is a pure code density win. AFAIK

And again, not having it be a GPR can make the hazard compare circuits much simpler.
I expect SP access to suffer from extra latency.
 
Well, this make sense. On the other side, Intel is beefing up their REP XXX instructions.

On the other side, Intel can't design sub-1W processors....

Larger code size (bad for cache) and more energy in cache+frontend.
Register amount is nearly doubling - that means much more stack spill/fills.

Not arguing either. But you really have to weigh the benefits of code density with the drawbacks of implementing LDM/STM in an OoOE processor. Having designed a few that handles LDM/STM, I can tell you I would've given a limb (and quite a bit of die area and mW's) to get rid of them.

Yes. It is a pure code density win. AFAIK

Honestly, with cachelines being the sizes they are, code density wins aren't what they used to be.

I expect SP access to suffer from extra latency.

When you say SP access, you mean using them as a source for computation, I assume. I really don't see that being common practice to optimize for but perhaps I'm wrong.
 
Thumb is out. A lot of the instruction encodings have been re-coded to be more sensible -- though since you have to support v7 anyway, I don't know what good that will do.

Products aiming at the server market might just drop v7 support, it's not like they have 20 years of legacy code to support (in that market). And considering everyone and their grandma tapes out ARMs for every little market...

What virtualization features are expected?
 
Products aiming at the server market might just drop v7 support, it's not like they have 20 years of legacy code to support (in that market). And considering everyone and their grandma tapes out ARMs for every little market...

What virtualization features are expected?

Not really much change from v7 virtualization; which we're fairly comprehensive.
 
-Stack pointer is not a general purpose register
What?!
-PC is not a general purpose register
-Additional dedicated zero register available for most instructions
Far fewer conditional instructions than in AArch32
Conditional {branches, compares, selects}

All choices to support higher performance implementations, rather than codesize/energy.

1. Mucking about with the stack pointer messes up the return stack (also valid for x86!).
2. PC should never have been a general purpose register. It is very much special purpose.
3. Conditional execution poses a challenge for high performance OOO implementations, each instruction takes two entries in the ROB. With improved branch prediction the value of ubiquitous conditional execution drops quickly.

Also load/store multiple is an abomination and needs to go; It has huge costs and next to no performance advantage.

Cheers
 
Back
Top