Maybe I misinterpreted what design you are proposing.I don't agree with most of your assertions here. You don't have to be able to branch every scalar instruction. Every fourth would match branching performance with the old design. As for transcendentals, I said let's ignore it for simplicity, but if you want to go there then I will.
You are saying that design B has 64x1D +16x1D in a SIMD.Okay, so let's compare the old design (A) with the new one (B). A has 16x(4x1D + 1D) SIMD units, and B has 64x1D + 16x1D SIMD units (MAD + transcendental).
How many individual elements are processed per clock?
Design A has 16 elements being processed in a given clock cycle, hence why the 80 units per SIMD are divided up into 5 ALU processor groups.
What has design B changed, exactly, other than distributing the 16 over the terms in the parenthesis?
By your doing so, I interpret it as meaning that all 64 elements have one component evaluated per clock.
If not, why did you change the 16-unit division of ALUs?
I don't see how it's related to the thread-switching scheme that follows.
Design A has to have enough registers to handle two 64-thread batches.A has a "macrobatch" of two 64-thread batches, and B's consists of eight 64-thread batches. A's macrobatches can be switched every 8 cycles, B's every 32 cycles. Both have instruction packets of (4x1D + 1D), but B has the additional flexibility of dependency in the MAD parts.
Design B needs enough to handle eight.
Either that, or each clause is 1/4 the size of those found in A, and clause setup overhead is quadruple that of A. The absolute amount of overhead is not something I'm aware of.
Each thread gets a sequencer in the SIMD's control logic.You can see that instruction packet throughput and branch throughput is the same in both systems, so you don't really need more resources there for decoding/fetching/whatever.
Design A has two.
Why wouldn't B have eight?
The live set of registers is also 8 times as large, over most of the 32 clocks of execution.
The most design A will have to worry about is 2 clauses' worth.