I think his point was that intel really handles 256bit AVX natively whereas AMD needs to split it up into 2 instructions.Yes, I meant marketing cores.
The point was that most of Intel's overing is still quads so you're point with half the FP performance is not standing.
That said, the AMD SSE unit is definitely more beefy otherwise (though I think latency is quite a bit worse). There's only one sse unit handling muls for instance in Sandy Bridge, so doing nothing but muls is twice as fast on BD (and if you're doing 256bit muls still as fast). It only gets better if you'd do FMAD (though not much since with as many muls as adds you can do the adds "for free" on SB). Not sure how though they'd behave with real code where the most frequent instructions tend to be the shuffle and pack/unpack ones .