Face recognition, gesture recognition, speech recognition, speech encoding/decoding, compression, encryption, ray tracing, packet inspection, data sorting, financial computing, path finding, artificial intelligence, ...
You also need to realize that pretty much every loop with independent iterations and some integer data types, could benefit from AVX2's support for 256-bit integer operations (and gather). Such loops are present (and often a hotspot) in any respectable sized code base.
Also, integer processing is critical for implementing transcendental floating-point functions. So even heavily floating-point oriented applications can benefit significantly. And last but not least, Bulldozer appears to have twice the integer vector processing capability so Intel can't risk to stay behind.
BD has half the per core fp throughput of sandy bridge.
8-core Bulldozer will be up against 4-core Sandy Bridge and Ivy Bridge. A few ALUs dedicated to a specific thread doesn't really count as a separate core. Nearly everything else is shared, making a Bulldozer module more like an alternative to a single Intel core with Hyper-Threading.
As the software starts to adopt AVX-256, and Intel offers CPUs with more cores, AMD will have no other choice but to equip FlexFP with a pair of 256-bit FMA units. Intel already has 256-bit units, so goes the other direction: adding FMA support to each unit. Neither company is going to allow the other to take a significant lead in floating-point performance, since that would cost them big deals in the HPC market.
How will that give enough operands for 2 fma's per clock?
4 operands in the first cycle, 2 in the next. Of course this means in the worst case it can't sustain 2 FMA's per clock, but like I said the bypass network reduces the pressure on the register file. In fact when you have a high density of FMA instructions they're bound to have close dependencies so the result of the previous instruction can be fed into the top of the pipeline again without the need to occupy a register file port.
Note that for many years Intel had only 3 register file read ports, and hardly anyone every ran into this bottleneck, precisely because the bypass network provides a lot of the operands.