I guess. I sort of thought of it the other way, too, i.e. if you're going to do 16 element batches, isn't it cheaper to make one 16-wide SIMD instead of two 8-wide SIMDs?
A single wide SIMD amortizes a fair amount of control and scheduling hardware over more ALUs, yes.
Though if the batch size is 16 we're back at a SIMD that runs through a batch instruction in one cycle.
Sure, but the same would happen if it took 1 cycle and sequentially cycled between 8 batches.
That would quadruple the number of thread contexts that the SIMD sequencers would have to pick through in order to set up the SIMD's execution schedule.
Similarly, there would be four times as many instruction queues.
Whatever storage that holds the instructions for an ALU clause would be accessed every cycle, as opposed to twice every 8 cycles.
One thing I wondered is why R600 (and later) didn't have a 64x1D single-clock-switch SIMD instead of 16x5x1D 4-clock-switch SIMD, if you know what I mean. Let's ignore the 5th channel for now. This would give ATI the same dependency-free scalar performance that NVidia has.
It would require four times the branch units to resolve branches, and a complex operation like a transcendental or integer multiply would require multiplying the number of transcendental units by a factor of four to keep that throughput equivalent.
Both have pretty much the same instruction bandwidth (1D per clock vs. 4x1D every 4 clocks) and work on the same batch size. There's a little difference in scheduling cost for the same reason.
There's more queueing going on with the more finely divided scheduling.
The act of decoding instructions also happens much more frequently.
4x1D only decodes once every four clocks, the other option decodes every clock.
It also appears that the single lane layout in G80 was a stumbling block to getting higher DP FLOPs, or AMD just lucked out that its scheme allowed for a quicker path to DP math.
I'm not sure which scheme is better.
The wider AMD model takes a given number of instructions and runs them over a much wider number of elements, so the costs of the program itself and the setup are cheaper per element.
On the other hand, it is more wasteful when the workload isn't as broad.
The more narrow model takes more hardware to schedule, and fewer elements means that the overheads per batch are spread out over a smaller number of elements. There's much less waste at the margins, though.