Throughput is 1 cycle for everything except double precision.
Float ops are 6 cycles, conversions to/from float 7, integer complex (madds etc) 7, integer and logic simple 2, shift and shuffle 4, load/store 6, branch misspredict 18, doubles are 13 cycles non pipelined for the first 6.
Floats and integers go in pipe 0, load/store branch and shuffles go in pipe 1.
All in all, pipes seem quite well balanced. Select bits in 2 cycles is quite nice. Every branch that can be converted to a conditional move should be.
Branch hint must be issued quite a lot of cycles in advance (13 i believe), even for unconditional jumps. There is a possibility to stall till the hint arrives, rather than filling with tons of nops.