If an FMUL and FADD instruction shared an operand, it's certainly possible but the you'd need a 4x128-bit output datapath to the retirement buffer. Mind you that in most cases, that's 2x128-bit that'll never be used.
Which is why the referred paper says bridged FMA makes for a 40% bigger FPU than pure FMA.
The FPU being shared between two INT cores should help make that not such a bad thing. Same for the 1*256bit/2*128bit capability.
You get FMA that is a bit slow & a bit more power hungry than a pure FMA. But you keep similar speed and similar power to a classic MAD for MUL & ADD (where a pure FMA is comparatively slow & power hungry).
Its not the fastest or lowest power option but it seems like a good compromise for a general purpose processor that is likely to be running mostly non-FMA code (legacy SSEn) but needs to be capable of FMA for the new instructions.
Presumably later architectures may migrate to a pure FMA once old software is less common/performance of it is deemed less critical.
Edit: or refer back to Nov 09 itsmydamnation post
http://forum.beyond3d.com/showpost.php?p=1362342&postcount=94
Also while re-reading the thread I'd like to modify my response to this:
Great for future workloads, but not so hot for today's software if I'm not mistaken.
BD will do 1*128bit FP per core ie same as Phenom architecture so should be same or better FP performance for existing software.
For future software 256bit will be there but shared between two cores.
Given equal number of cores, 256bit FP heavy stuff may suffer vs an architecture with 1*256bit per core.
I saw a JF-AMD post somewhere pointing out that 8 module BD vs 8 core SB will be: 16 cores vs 8 cores, 16 threads vs 16 threads & 8*256bit FP vs 8*256bit FP so that shouldn't be too much of an issue assuming similar clocks.