The unit of shading fragments is naturally a quad, so that's taken as the current minimum capability of a "pipeline". You can choose to dedicate a shader state to that single pipeline (e.g. R420), or you can gang a number of those pipelines together (NV40) so that they all have the same shader state (i.e. shader program, program counter and constants are all shared).
As you increase the number of quads sharing a shader state, you theoretically cut the total amount of instruction decode logic in the GPU, and you can multiplex the register fetch and store pathways for all the pipelines - saving transistors, though to be frank I don't know what percentage of a pipeline is consumed by decode, fetch and store, and I dunno how practical the multiplexing is given the fairly extensive register file sizes of GPUs.
If texturing consists of multiple pipelines ganged together, then you can theoretically gain coherency in memory accesses, as well as gain savings in common decode and control hardware.
In Xenos and R5xx we see very small threads consisting of 4, 12 or 16 quads of fragments being processed by a pipeline (all of them do so in four phases, rather than the 64 or 256 that seem typical of older GPUs). Effectively it's a "short" (in time, dozens of cycles) and "wide" (4 quads in Xenos and 3 in R580) pipeline architecture - effectively the polar opposite of the old-fashioned fixed function pipeline that was a single quad (or half a quad or single fragment) that spent hundreds of cycles processing each instruction.
Xenos and R5xx can be built like this because of decoupled texturing and the ability to schedule very frequent shader state changes. And both arithmetic and texturing in these GPUs gain from lowered transistor overheads in terms of decode, fetch, store and control transistors (well, I presume they do!).
The scheduling complexity and its inherent transistor count overhead seems to be a given if you want to build an architecture like Xenos or R5xx that decouples arithmetic and texturing and also provides for small thread sizes. So their wide arrays, with the lowered overheads for decode etc. are a way to tackle the overhead incurred in having this crazy scheduler (oh and the increased register file that comes with it).
Well, that's the way I see it for ATI, at least. Whether R600 has such wide shader arrays is still something I can't decide on. Right now I'm leaning towards wide arrays for R600... Why not even wider?...
---
In your proposal for 48, 44, 40 etc. variants you do create a problem that certain parts of the GPU end-up "oversized", e.g. if you have 40 arithmetic pipelines and 16 texturing pipelines, the GPU is "out of proportion". Sure, no-one will complain at the performance, but parts of the GPU will be oversized. Beggars can't be choosers you might say, at least the die is still useful.
Well, I don't know. As silent_guy remarked recently, DRAMs come out with a 98%+ yield rate, because the extraordinarily high parallelism of a DRAM enables "just the right amount" of redundancy to be hidden within each die. Sure that makes each die bigger, but clearly there's a point on the yield curve that it makes a lot of sense to aim for. These fine-grained redundancy patents seem to be saying the same thing.
You could have a pool of processors, where each and every one has a distinct shader state and where there's a 1:8 or 1:16 redundancy - that's entirely feasible.
But I think the overriding problem with the "pool of 128 processors" view is that if you want to give each processor a distinct shader state you have a massive explosion in shader program storage, shader state storage, decode logic and fetch/store pathway control logic. The overheads multiply really fast. Though I don't have a decent idea of the quantities of transistors we're talking about
Jawed