Beyond the doubled scheduler throughput, issuing from two wavefronts at once means maintaining twice as many wavefronts in flight in the CU in order to keep up throughput. That also means doubling register size, you'll want to increase cache in line, and eventually you've just doubled the CU count.Also, when it can't dual issue from the same wavefront, why can't they make it so that it could fill the other ALU from another wavefront that is ready to go?
Seems like a no-brainer at a high level, so there must be some difficult technical hurdle to overcome to make it work.
Simply doubling up the ALUs certainly isn't going to bring anywhere near double performance, but also comes in with a much smaller increase in die size and power draw. And those are the variables you're trying to optimise your performance against, metrics like realised IPC/peak IPC are at best a means rather than an end.