That would quadruple the number of thread contexts that the SIMD sequencers would have to pick through in order to set up the SIMD's execution schedule.
Similarly, there would be four times as many instruction queues.
Whatever storage that holds the instructions for an ALU clause would be accessed every cycle, as opposed to twice every 8 cycles.
It would require four times the branch units to resolve branches, and a complex operation like a transcendental or integer multiply would require multiplying the number of transcendental units by a factor of four to keep that throughput equivalent.
I don't agree with most of your assertions here. You don't have to be able to branch every scalar instruction. Every fourth would match branching performance with the old design. As for transcendentals, I said let's ignore it for simplicity, but if you want to go there then I will.
Okay, so let's compare the old design (A) with the new one (B). A has 16x(4x1D + 1D) SIMD units, and B has 64x1D + 16x1D SIMD units (MAD + transcendental). A has a "macrobatch" of two 64-thread batches, and B's consists of eight 64-thread batches. A's macrobatches can be switched every 8 cycles, B's every 32 cycles. Both have instruction packets of (4x1D + 1D), but B has the additional flexibility of dependency in the MAD parts.
Every 8 cycles for A:
Load two batches, execute an instruction packet on each, branch up to twice.
Every 32 cycles in B:
Load eight batches, execute an instruction packet on each, branch up to eight times. Note that the 16 trans. units can operate on all eight batches in this time.
You can see that instruction packet throughput and branch throughput is the same in both systems, so you don't really need more resources there for decoding/fetching/whatever. You just need a little more pipelining for the same scheduling system in A to handle B. The register file may need to be a bit smarter with dependencies, but I don't see much of a problem there, particularly with the use of a temp register. The only big change is the same one I was asking about earlier: switching instructions every clock instead of every 4 clocks within the SIMD arithmetic logic.
It also appears that the single lane layout in G80 was a stumbling block to getting higher DP FLOPs, or AMD just lucked out that its scheme allowed for a quicker path to DP math.
I guess that makes sense, but part of the problem is that NVidia is working with a smaller batch size, making it harder to go SIMD with the DP.