I'm reading the diagram as 2 16 wide FP units, and 1 16 wide INT unit, so that the scheduler can switch between issuing to different FP units every cycle (unlike Turing), or switch between INT and FP units (like Turing). Yes, one wouldn't be able to use all 3 16-wide units concurrently, so there'd be a bubble in at least one INT or FP unit every cycle. But it seems like a pretty non-invasive way to increase peak FP throughput without having to scale other aspects of the SM. If power spent in instruction execution is relatively small compared to the cost of obtaining/moving operands, then this design seems like it doubles peak FP throughput without increasing peak SM power consumption very much. So it all seems pretty plausible to me...