ARM VFP vs NEON

rpg.314

Veteran
This got me thinking. ARM's vfp is effectively just simd with much higher latency and the same throughput as normal alu instructions. Now my question is, are ARM's neon and vfp different execution units, or are they the same thing. If they are the same, then to me it is the greatest marketing con job that I know of.
 
No, they definitely are very different things and VFP can't run NEON instructions. Also obviously in the real world, the VFP is mostly used in non-vector mode so it has to be fast for that even if it obviously can't be as fast as with 4-wide instructions. On the other hand, the A8 Neon is a 2-wide FP32 engine which can, like MMX, also run things like 4-wide INT16, 8-wide INT8, etc. - the A9 Neon is the same thing but twice as wide (similar to the Qualcomm 'Scorpion' CPU's FPU core, codenamed 'VeNum').

What is possibly the most interesting part about VFP is what's happening to it in the Cortex-A9 generation, where they claim full IEEE is now full-speed and it's twice as fast (it's not clear if it can be both at the same time; i.e. maybe it's 1/clock for full IEEE vs less than 1, and an extra 1 for the non-IEEE case?). However the issue width to the VFP still seems to be 1, so if this is true and not just marketing you'd expect the only way to achieve that to be via those vector instructions as Farrar described in there.

Since Farrar apparently examined the ISA for the VFP even more than I did, it would be interesting if he had any idea of what they could have done there... I've tried to poke ARM about it and see if I could get some info, but no luck so far.
 
This got me thinking. ARM's vfp is effectively just simd with much higher latency and the same throughput as normal alu instructions. Now my question is, are ARM's neon and vfp different execution units, or are they the same thing. If they are the same, then to me it is the greatest marketing con job that I know of.
arm's vfp has never been simd, and never claimed to be. it does vector ops by sequencing scalar ones*. neon, on the other hand, is a simd unit.

* not only vector-width-wise, but also composite-ops-wise - a madd in vfp is actually a discrete mul and an add.
 
On the other hand, the A8 Neon is a 2-wide FP32 engine which can, like MMX, also run things like 4-wide INT16, 8-wide INT8, etc. - the A9 Neon is the same thing but twice as wide (similar to the Qualcomm 'Scorpion' CPU's FPU core, codenamed 'VeNum').
I wonder where this twice wide NEON Cortex-A9 information comes from.
What is possibly the most interesting part about VFP is what's happening to it in the Cortex-A9 generation, where they claim full IEEE is now full-speed and it's twice as fast (it's not clear if it can be both at the same time; i.e. maybe it's 1/clock for full IEEE vs less than 1, and an extra 1 for the non-IEEE case?). However the issue width to the VFP still seems to be 1, so if this is true and not just marketing you'd expect the only way to achieve that to be via those vector instructions as Farrar described in there.
The claim is comparing Cortex-A9 VFP performance against Cortex-A8 VFPlite performance I guess. Cortex-A8 VFP is not pipelined whereas A9 VFP is.

BTW VFP vector operations have been deprecated.
 
deprecated? I thought you couldn't deprecate ISA w/o breaking backward compatibility. Or is it the case that their use has been slowed down a lot by reducing the die space allocated to them?
 
Laurent06: My source is http://www.arm.com/pdfs/ARMCortexA-9Processors.pdf
I think this refers rather to ARM11's FPU, which already had a single-cycle MAC:
Providing an average of more than double the Floating-Point performance of previous generation ARM Floating-Point coprocessors
While this was not true of Cortex-A8's NEON, which only had a dual-MAC:
The MPE extends the Cortex-A9 processor’s floating-point unit (FPU) to provide a quad-MAC
I'd be interested in a link wrt depreciation, that'd be a nice confirmation :)
 
vfp11 does not have a single-cycle mac - most vfp11's ops have an issue rate (throughput in arm documentation's terms) of 1 cycle, but their latency is never one cycle. don't forget vfp11 is fully pipelined, whereas a8's vfplite is not.
Isn't throughput what matters to compare the ARM11 FPU and the Cortex-A9 FPU though? :)
Laurent06: Cheers!
 
deprecated? I thought you couldn't deprecate ISA w/o breaking backward compatibility. Or is it the case that their use has been slowed down a lot by reducing the die space allocated to them?
You can, sort of, if you provide a software trap to emulate them. My ARM memory is a bit rusty, but I believe these kind of traps are already there for cases where you're running a program with floating point opcode on low level ARM that don't have them.
 
This got me thinking. ARM's vfp is effectively just simd with much higher latency and the same throughput as normal alu instructions. Now my question is, are ARM's neon and vfp different execution units, or are they the same thing. If they are the same, then to me it is the greatest marketing con job that I know of.

According to the Assembler Guide, "If your processor has both NEON and VFP, all the NEON registers overlap with the VFP registers." And likewise it seems as if the load/store/transfer(to/from coprocessor reg/cpu reg) opcodes overlap between the VFP and NEON. So at least a shared register file.
 
According to the Assembler Guide, "If your processor has both NEON and VFP, all the NEON registers overlap with the VFP registers." And likewise it seems as if the load/store/transfer(to/from coprocessor reg/cpu reg) opcodes overlap between the VFP and NEON. So at least a shared register file.
Ah yeah, that's another change that's worth pointing out: on the A8, if you've got Neon, then you don't have VFP. In A9, it seems that if you have Neon, you necessarily have VFP (but it shares the RF). At least that's what I could gather from the docs...
Fun stuff! :)
 
on the A8, if you've got Neon, then you don't have VFP. In A9, it seems that if you have Neon, you necessarily have VFP (but it shares the RF). At least that's what I could gather from the docs...
There's a small difference between A8 and A9: you can have A9 with VFP, or with VFP + NEON, whereas A8 always comes with VFP(lite) and NEON.
 
right, so seems this is now suddenly a bit more interesting/relevant suddenly.. w.r.t. cortex A8 vs vfp ;)

basically, hand coded vfp code for arm 11 , might actually run slower on the A8 ?

is there any good info about this? the arm docs whilst good are a bit 'diff' to compare like with like..
 
Instruction timings for both the ARM11 VFP and the Cortex-A8 VFP are publicly available from ARM:

The ARM11 VFP implementation is pipelined in the general case, which means that you can get reasonable performance for hand-scheduled or vectorized code; the Cortex-A8 VFP implementation is non-pipelined unless a whole bunch of specific conditions are met (FP32 only, "RunFast" mode enabled, no vector instructions!), and its instruction latencies are quite large.
 
unless a whole bunch of specific conditions are met (FP32 only, "RunFast" mode enabled, no vector instructions!), and its instruction latencies are quite large.
That RunFast mode is severely limited (ref)
A restriction that applies to VFP instructions executing in the NFP pipeline is that instruction results cannot be forwarded early to subsequent instructions.

No matter what you do if you want to have good FP performance out of A8 (where good means at least VFP11 level), you will have to use single precision NEON instructions and forget that VFP exists in A8.
 
That RunFast mode is severely limited (ref)


No matter what you do if you want to have good FP performance out of A8 (where good means at least VFP11 level), you will have to use single precision NEON instructions and forget that VFP exists in A8.

I guess, this is not good news for legacy apps running on the latest iPhone (3GS)...
 
Back
Top