ARM VFP vs NEON

rpg.314 · Feb 21, 2009

This got me thinking. ARM's vfp is effectively just simd with much higher latency and the same throughput as normal alu instructions. Now my question is, are ARM's neon and vfp different execution units, or are they the same thing. If they are the same, then to me it is the greatest marketing con job that I know of.

Arun · Feb 21, 2009

No, they definitely are very different things and VFP can't run NEON instructions. Also obviously in the real world, the VFP is mostly used in non-vector mode so it has to be fast for that even if it obviously can't be as fast as with 4-wide instructions. On the other hand, the A8 Neon is a 2-wide FP32 engine which can, like MMX, also run things like 4-wide INT16, 8-wide INT8, etc. - the A9 Neon is the same thing but twice as wide (similar to the Qualcomm 'Scorpion' CPU's FPU core, codenamed 'VeNum').

What is possibly the most interesting part about VFP is what's happening to it in the Cortex-A9 generation, where they claim full IEEE is now full-speed and it's twice as fast (it's not clear if it can be both at the same time; i.e. maybe it's 1/clock for full IEEE vs less than 1, and an extra 1 for the non-IEEE case?). However the issue width to the VFP still seems to be 1, so if this is true and not just marketing you'd expect the only way to achieve that to be via those vector instructions as Farrar described in there.

Since Farrar apparently examined the ISA for the VFP even more than I did, it would be interesting if he had any idea of what they could have done there... I've tried to poke ARM about it and see if I could get some info, but no luck so far.

darkblu · Feb 22, 2009

rpg.314 said:
This got me thinking. ARM's vfp is effectively just simd with much higher latency and the same throughput as normal alu instructions. Now my question is, are ARM's neon and vfp different execution units, or are they the same thing. If they are the same, then to me it is the greatest marketing con job that I know of.

arm's vfp has never been simd, and never claimed to be. it does vector ops by sequencing scalar ones*. neon, on the other hand, is a simd unit.

* not only vector-width-wise, but also composite-ops-wise - a madd in vfp is actually a discrete mul and an add.

Laurent06 · Feb 23, 2009

Arun said:
On the other hand, the A8 Neon is a 2-wide FP32 engine which can, like MMX, also run things like 4-wide INT16, 8-wide INT8, etc. - the A9 Neon is the same thing but twice as wide (similar to the Qualcomm 'Scorpion' CPU's FPU core, codenamed 'VeNum').

I wonder where this twice wide NEON Cortex-A9 information comes from.

What is possibly the most interesting part about VFP is what's happening to it in the Cortex-A9 generation, where they claim full IEEE is now full-speed and it's twice as fast (it's not clear if it can be both at the same time; i.e. maybe it's 1/clock for full IEEE vs less than 1, and an extra 1 for the non-IEEE case?). However the issue width to the VFP still seems to be 1, so if this is true and not just marketing you'd expect the only way to achieve that to be via those vector instructions as Farrar described in there.

The claim is comparing Cortex-A9 VFP performance against Cortex-A8 VFPlite performance I guess. Cortex-A8 VFP is not pipelined whereas A9 VFP is.

BTW VFP vector operations have been deprecated.

rpg.314 · Feb 23, 2009

deprecated? I thought you couldn't deprecate ISA w/o breaking backward compatibility. Or is it the case that their use has been slowed down a lot by reducing the die space allocated to them?

Arun · Feb 23, 2009

Laurent06: My source is http://www.arm.com/pdfs/ARMCortexA-9Processors.pdf
I think this refers rather to ARM11's FPU, which already had a single-cycle MAC:

Providing an average of more than double the Floating-Point performance of previous generation ARM Floating-Point coprocessors

While this was not true of Cortex-A8's NEON, which only had a dual-MAC:

The MPE extends the Cortex-A9 processor’s floating-point unit (FPU) to provide a quad-MAC

I'd be interested in a link wrt depreciation, that'd be a nice confirmation

darkblu · Feb 23, 2009

Arun said:
Laurent06: My source is http://www.arm.com/pdfs/ARMCortexA-9Processors.pdf
I think this refers rather to ARM11's FPU, which already had a single-cycle MAC <snip>

vfp11 does not have a single-cycle mac - most vfp11's ops have an issue rate (throughput in arm documentation's terms) of 1 cycle, but their latency is never one cycle. don't forget vfp11 is fully pipelined, whereas a8's vfplite is not.

Laurent06 · Feb 23, 2009

Arun said:
I'd be interested in a link wrt depreciation, that'd be a nice confirmation

For a public statement of vector mode deprecation look here:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204i/Chdehgeh.html
or here:
http://forums.arm.com/index.php?showtopic=13053&pid=31161&st=0&#entry31161

Arun · Feb 23, 2009

darkblu said:
vfp11 does not have a single-cycle mac - most vfp11's ops have an issue rate (throughput in arm documentation's terms) of 1 cycle, but their latency is never one cycle. don't forget vfp11 is fully pipelined, whereas a8's vfplite is not.

Isn't throughput what matters to compare the ARM11 FPU and the Cortex-A9 FPU though?

Laurent06: Cheers!

silent_guy · Feb 23, 2009

rpg.314 said:
deprecated? I thought you couldn't deprecate ISA w/o breaking backward compatibility. Or is it the case that their use has been slowed down a lot by reducing the die space allocated to them?

You can, sort of, if you provide a software trap to emulate them. My ARM memory is a bit rusty, but I believe these kind of traps are already there for cases where you're running a program with floating point opcode on low level ARM that don't have them.

darkblu · Feb 23, 2009

Arun said:
Isn't throughput what matters to compare the ARM11 FPU and the Cortex-A9 FPU though?

well, yes. i guess i read your original post the wrong way.

TimothyFarrar · Feb 24, 2009

rpg.314 said:
This got me thinking. ARM's vfp is effectively just simd with much higher latency and the same throughput as normal alu instructions. Now my question is, are ARM's neon and vfp different execution units, or are they the same thing. If they are the same, then to me it is the greatest marketing con job that I know of.

According to the Assembler Guide, "If your processor has both NEON and VFP, all the NEON registers overlap with the VFP registers." And likewise it seems as if the load/store/transfer(to/from coprocessor reg/cpu reg) opcodes overlap between the VFP and NEON. So at least a shared register file.

Arun · Feb 24, 2009

TimothyFarrar said:
According to the Assembler Guide, "If your processor has both NEON and VFP, all the NEON registers overlap with the VFP registers." And likewise it seems as if the load/store/transfer(to/from coprocessor reg/cpu reg) opcodes overlap between the VFP and NEON. So at least a shared register file.

Ah yeah, that's another change that's worth pointing out: on the A8, if you've got Neon, then you don't have VFP. In A9, it seems that if you have Neon, you necessarily have VFP (but it shares the RF). At least that's what I could gather from the docs...
Fun stuff!

Laurent06 · Feb 25, 2009

Arun said:
on the A8, if you've got Neon, then you don't have VFP. In A9, it seems that if you have Neon, you necessarily have VFP (but it shares the RF). At least that's what I could gather from the docs...

There's a small difference between A8 and A9: you can have A9 with VFP, or with VFP + NEON, whereas A8 always comes with VFP(lite) and NEON.

davefb · Jun 10, 2009

right, so seems this is now suddenly a bit more interesting/relevant suddenly.. w.r.t. cortex A8 vs vfp

basically, hand coded vfp code for arm 11 , might actually run slower on the A8 ?

is there any good info about this? the arm docs whilst good are a bit 'diff' to compare like with like..

arjan de lumens · Jun 11, 2009

Instruction timings for both the ARM11 VFP and the Cortex-A8 VFP are publicly available from ARM:

The ARM11 VFP implementation is pipelined in the general case, which means that you can get reasonable performance for hand-scheduled or vectorized code; the Cortex-A8 VFP implementation is non-pipelined unless a whole bunch of specific conditions are met (FP32 only, "RunFast" mode enabled, no vector instructions!), and its instruction latencies are quite large.

Laurent06 · Jun 12, 2009

arjan de lumens said:
unless a whole bunch of specific conditions are met (FP32 only, "RunFast" mode enabled, no vector instructions!), and its instruction latencies are quite large.

That RunFast mode is severely limited (ref)

A restriction that applies to VFP instructions executing in the NFP pipeline is that instruction results cannot be forwarded early to subsequent instructions.

No matter what you do if you want to have good FP performance out of A8 (where good means at least VFP11 level), you will have to use single precision NEON instructions and forget that VFP exists in A8.

warmi · Jun 12, 2009

Laurent06 said:
That RunFast mode is severely limited (ref)

No matter what you do if you want to have good FP performance out of A8 (where good means at least VFP11 level), you will have to use single precision NEON instructions and forget that VFP exists in A8.

I guess, this is not good news for legacy apps running on the latest iPhone (3GS)...

iwod · Jun 16, 2009

So why not just get rid of VFP and use NEON only instead?

Laurent06 · Jun 16, 2009

iwod said:
So why not just get rid of VFP and use NEON only instead?

NEON instruction set doesn't have double precision instructions and its single precision is not fully IEEE754 compliant. NEON isn't meant to be a VFP replacement.

ARM VFP vs NEON

rpg.314

Arun

Unknown.

darkblu

Laurent06

rpg.314

Arun

Unknown.

darkblu

Laurent06

Arun

Unknown.

silent_guy

darkblu

TimothyFarrar

Arun

Unknown.

Laurent06

davefb

arjan de lumens

Laurent06

warmi

iwod

Laurent06

Similar threads