They specifically talked about the throughput of one NCU compared to one traditional CU with 64 SPs. And doubling the number of SPs is the simple way of doing it. How they would feed a higher number of SPs and how they are organized, no idea. Could be dual issue to two separate vector ALUs each clock. Or something more out of the ordinary like dual issue to the same vALU and computing over 8 cycles to match a round robin scheme over 8 vALUs (I don't think this will happen) or something else. There is also the possibility they can somehow fuse certain combinations of ops in the scheduler and issue the fused ops (meaning the higher throughput is only usable for relatively specific cases). This is still unknown right now.