There's no quad SP MAC, just dual FPMUL and FPADD pipelines which can be chained. Double precision is chained as well, of course.
It's a MAC in that it's a single mult-acc operation. It's not fused but fused MAC isn't the only definition of MAC
For integer there are 8 8x16bit MACs, and I don't think that they're shared with the floating point multipliers, or at least I can't think of a very good way to do this since you'd need to use 6 of those just to get one requisite 24x32 out of it.
It can be done. Parts of the wallace tree and the booth encoders can be re-used. The M3 multiplier as well if it's radix 8.
Double precision VMUL actually has a throughput of two cycles on Cortex-A9 so I'd imagine the multiply part is split as two trips through two single-precision multipliers which have been extended from 23x23 to 27x27. It seemed to me that the modular approach would discourage this as well, but at the same time NEON always comes with VFP, while VFP itself is not required in any form for A9. You'd think if it were 100% separate they'd offer them separately, with maybe an A8-like VFP-lite option as well. Instead my guess is that if you get NEON you get VFP practically for free, while the opposite isn't nearly as true due to all of the integer stuff NEON needs.
Architecturally, NEON cannot come without VFP. That decision wasn't made, IMO, because of how the A9 is implemented. I believe A15 has an entirely separate VFP and NEON pipeline.
As for how "free" you can get VFP after having a NEON implementation, that depends on the NEON implementation. Without having to support the rounding and denormal handling of VFP as well as DP, a NEON implementation can be made to be very small and efficient. I'd wager that's why A15 separates its VFP from its NEON pipes.
It's true that much of the NEON and VFP multiply pipeline can be shared. But from a power/perf standpoint, having them separate -- and having separate fused and chained/mul/add pipelines -- is the best implementation. Of course, there's an area trade-off for that.