It's likely that the op is split into two 128-bit uops.
Would be a more traditional approach (just the same as SSE2 was handled pre-Barcelona A64). dkanter's article actually suggested both methods might be a possibility.
One negative is that the registers are not 256 bits, so the rename capacity and latency hiding of the OoO engine is reduced because it can only rename half as many registers.
Oh yes I forgot about that. Though if you had "naturally 8-wide" code and just executed it as AVX-128 you'd end up with the same number of registers really as you'd just need more XMM regs instead.
There can be more contention for issue ports as well, since shuffle instructions have to fight with arithmetic ops for one of the FMAC ports. Since two ports are occupied by AVX-256, either the arithmetic op takes longer to process. Various instructions that shuffle or insert values within and between YMM registers may take longer, reducing throughput further.
So all in all AVX-128 probably will always be a (slight) win over AVX-256 on this chip. The increased decoder bandwidth this needs probably is of little consequence.
btw one area where BD excels (at least!) is the SIMD INT scores. SNB (and AVX) cheaped out on these so only 128bits, and BD has improved on the simd int units. Despite having to do with only 4 FP units it still manages to turn in a score 3 times higher than 6-core Phenom II (well in SiSoft Sandra that is). I actually do not quite understand why, since Barcelona (and up) should have been able to execute 2 simd int instructions per clock just as well. I guess being able to do two muls or adds simultaneously helps, but the increase over Phenom II is much larger even taking this into account. I don't think the test was actually using the IMAC (as the big increase was both with and without AVX code), though that could definitely lead to a big increase (I was looking at techreport numbers,
http://techreport.com/articles.x/21813/17). Though maybe it was using imac in which case the score isn't all that impressive. Interestingly, AIDA64, despite mentioning using XOP specifically, does not show any such gain at all over Phenom II, though the score is good regardless.
edit: actually I think I've got that wrong, it's confusing in the BD optimization manual (that there are obvious copy/paste bugs around the important areas doesn't help matters neither). The chip has a mysterious "MMA" execution unit (mapped to pipe 0) which is completely missing from the overview (mislabeled as another FMA I think though in reality it's probably the same unit anyway) and seems to handle all simd int mul related operations (muls and macs). If I interpret that right this means you could actually issue 3 simd int ops per cycle, 2 adds (or ands, ors, masks, other simple stuff) and 1 mul/mac. Not 2 simd int muls per clock, however, but in any case it would be quite an improvement with the right mix of instructions (no way to issue a 3rd int add to that MMA unit though despite it obviously having an adder it seems). Still does not explain the Sandra Multimedia Int score though.