I wonder if replacing the FPMUL unit with an FMA unit would help, not necessarily for increasing flop count, but efficiency. I would expect that in most work loads that's the type of operation, multiply accumulate, that will be done. It would seem that an FMA unit could help with register pressure and latency when dealing when scheduling the separate multiply/add instructions currently. In theory it could also help with instruction decode since it eliminates an instruction from the stream.
Surely that would increase the flops by only 50% though so both the ADD and the MUL units would have to be upgraded to FMADD. But as itsmydamnation says above, that may require additional pipeline changes to reap the full benefits. Or maybe durango is just using 4 steamroller modules