The problem with NEON and cortex A9 is that the A9's ROB can't track data dependencies for NEON instructions and have to stall on potential RAW hazards from NEON instructions. Use of NEON instructions effectively turns an A9 into an in-order processor.
Yes, NEON is in-order. And for in-order it doesn't really have enough registers to cover its latency, which can be substantial (and you don't get single cycle for almost everything). It requires aggressive hand scheduling to get good utilization out of, but for the type of data parallel and data linear algorithms that you often use here it's fairly viable to get good utilization without OoOE and with generous prefetching. Just painful.
NEON is actually further handicapped on Cortex-A9 since it isn't staggered against the integer pipeline in the same way it was on A8, so while you don't get the same penalty going from NEON registers to ARM ones you also don't get the same latency hiding for loads. You also can't dual issue loads/stores/permutes anymore, so you'll probably see quite lower per-clock performance for very highly optimized code..
But I'd still want to have it on all Cortex-A9s.
The ommission of NEON in Tegra 2 makes perfect sense from a performance point of view. From a software developer point of view it is a PITA having to support multiple code paths, - more work for the developer and app. bloat for everybody. I agree with you that overall it was a bad call by Nvidia.
The thing is, those NEON cores took up very little die space, although maybe nVidia was more concerned with capping max TDP. Or maybe NEON constrained their clock potential more. I still say that for the right code NEON, even on A9, can improve performance by a highly substantial amount, but maybe nVidia needed more of that out already to determine that. If you use it naively it won't give you much at all.
As you say, Cortex A15 solves all these problems. NEON is mandatory and is integrated with the out-of-order scheduling machinery. Performance-wise the A15 should be equivalent to a Intel P-III Coppermine.
Yeah that'd be pretty nice. It's not quite as wide, though (afaik 2 ALUs vs 3), but then again neither is Bulldozer.
metafor said:
Interesting. Does this mean that a single uncommitted NEON instruction can stall younger non-NEON instructions as well or simply that NEON instructions must be in-order with respect to each other?
I can't imagine ARM breaking their design so badly as to make it the later.
On A8 NEON was decoupled via an instruction queue that'd only stall dispatch if it filled up (which it wouldn't if you interspersed non-NEON instructions), and the only stall I was aware of in the other direction is if you transferred from NEON to ARM registers, at which point any instruction touching the ARM register file at all would cause it.
A9 is probably still using a queue like this, and the integer core probably does even less work with NEON instructions than A8 did in order to keep that functionality in the optional module.
Looking here on page 12 you can see there's an instruction FIFO in the compute side:
http://www.arm.com/files/downloads/Cortex-A9_Devcon_2007_Microarchitecture.pdf
It's just that the dispatch to here occurs a lot earlier in the pipeline than it did on A8.