That's clearly not true - a cursory comparison of G80 and R600 reveals radically different architectures with quite different emphases on SIMD width.
I've said in another post I was limiting this discussion to a GPU in the vein of R600 and RV770.
I don't consider G80 to be an R600-style GPU.
I think you're trying to generalise too much and ignoring how architectures scale. At the same time super-wide SIMDs are always going to look bad against narrower SIMDs solely because of the dynamic branching problem. But it's the batch size that's the crunch point, not the SIMD width per se.
Non-coherent execution can exist outside of the individual batches.
What happens if AMD decides to add another kind of shader thread (just sayin')?
What happens if it's more than 4 such types?
More likely, what happens if AMD adds the ability to run multiple independent or loosely coupled programs, or maybe add the ability to make some special kind of procedure call?
I've been talking about the way the TU is constructed, hypothesising that it's a monolithic unit in RV670, with each TEX instruction running for 4 clocks. If RV770 is the same, then this enforces a batch size of 128 on the SIMDs (since a TU batch is assumed to be 32 wide * 4 clocks). So the basic design choices restrict the options for SIMD width. Only 5 SIMDs each 32 wide fits.
Clearly ATI can fiddle with the count of SIMDs. But if the batch size is 128 due to basic architectural factors, no variation in SIMD count is going to alter the divergence penalty.
All the R6xx GPUs thus far have kept TU width on a 1:1 basis with SIMD width.
If we assume that remains the case, then fiddling with the TU width maps 1:1 to fiddling with SIMD width, which if we keep the ALU count and organization otherwise the same means we haven't really moved away from fiddling with SIMD count.
We could even treat the TU as a SIMD, albeit more specialized, if we go by what a poster or two have said at other times...
So, the question is, am I right about the monolithic TU? If not, then there's more flexibility in both SIMD width and count.
I like the idea of a non-monolithic TU better in the abstract, though what I like obviously doesn't matter to AMD one bit.
A wider TU will suffer from more utilization issues, and on reflection of this, I think the price of underutilizing the equivalent of memory or cache ports is worse than underutilizing the ALUs.
Apart from the start-up/shut-down costs of a program and branch divergence penalties, the SIMD count/width makes no difference, if the total ALU capability is constant.
It does if the GPU did what I said I hoped would happen in the future: where the GPU can generate clauses from independent programs and apportion them to a SIMD.
With programmability growing in many areas of functionality, the SIMD itself is increasingly a useful unit of granularity.
My posts have lacked clarity, so I failed to make a point more clear that when I speak of utilizing a GPU's resources that it wasn't limited to only ALU utilization, but utilization everywhere.
As programmability increases, the number of ties to the shader arrays is bound to increase.
Future units that get tied into the execution pipeline are likely to have a way to interface with the SIMDs, but of course we'd only get garbage if they simultaneously hit a SIMD.
The SIMD count becomes the upper bound on the number of independent programs that can run simultaneously, or at least the upper bound on programs that use any ALUs.
At the simplest you can see this problem if you consider the implementation of a 1MB register file. A 4 SIMD GPU will perform better than a 16 SIMD GPU - the latter will have less batches per SIMD available to hide latency. And as the register allocation increases the total number of available batches across the entire GPU will fall faster in the 16 SIMD case because of fragmentation.
Won't batch sizes scale with SIMD length?
I'm thinking a SIMD 1/4 the size of the first example would have 1/4-sized batches to match. The fragmentation and batching overhead issues would increase, true.
Does the increased performance of dynamic branching in the 16 SIMD GPU compensate for the lower latency-hiding in complex shaders?
How reduced is the latency hiding capability?
Besides, that question can't be answered for all workloads.
It could be done either way.
Streaming processors hide fetch latencies by oversubscribing the SIMDs. I can't work out the meaning of "issue-port restriction" to be honest.
Merely a comment on the fact that sending an instruction to be executed is the same as a CPU's issue port, just fanned out by a factor of 16.
If it has 5 instructions, each of which exercises a different area of the chip, it's only going to get away with 4 and must leave part of the chip idle.
I'm also not sold on the idea that hiding fetch latency is the absolute highest priority in all cases, not if speculation or divergent work causes the chip to hit the TDP barrier.
Is latency tolerance
massively important? Yes, but perhaps sometimes going to the ends of the earth to avoid that last cycle of unhidden latency isn't worth the price.
This doesn't tell us anything interesting though because of the freedom of architectural choices (and the learning curve). G71 v R580 is another good example, with respect to batch-size and dynamic branching, where both designs could perform 48 MADs per clock.
I'm basing my argument on the idea that R7xx will bear some resemblance to R6xx, and that AMD intends to extend the architecture into the future when it runs into Larrabee in the GPGPU space.
This could be wildly wrong, or AMD may not exist by that point, but that is where I'm coming from.