Consider the F-Buffer, which is an implementation of pixel processing state management and storage to avoid re-processing vertices (sort of strikes me as making an IMR more closely suited to pixel/vertex processing re-tasking as a TBDR...
hmm).
Consider the V-Buffer, which was mentioned by someone from ATI in passing as something like "F-Buffers for vertex processing".
Consider effort being extended to add transistors focused specifically on the performance enhancement, management, and scheduling of these buffers, and what that might allow something awfully close to the existing basic processing pipelines of the R3xx to accomplish (for example, PS/VS 3.0).
Consider how long we've had a description of the R420 that corresponds to that last bit.
...
As for it not being easy, that's why even though I think 8 vertex or "uber" (or maybe some more exciting marketingspeak term) pipelines in addition to 8 pixel pipelines seems feasible with the expected transistor budget, I don't think a peak of 16 pipelines applied to
all types of processing is guaranteed even if that expectation is true.
It seems likely that not all pipelines are created equal for the R420 (because of transistor budget...I don't see ATI pulling out both the 8 unit case and full PS 3.0 functionality for the 8 basic pixel pipelines that seem given). It also seems to me that vertex processing units of the R300 are quite close to the full PS 3.0 spec, if coupled with the R300's centroid sampling capable TMUs and pixel processing frontend.
Among the possibilities, the one that seems likely to me is no peak processing with full PS 3.0/VS 3.0 utilization. Specifically, what seems likely to me is a peak parallel processing of 16 for base PS 2.0, 8 for VS 3.0/VS 2.0, and some sort of choke for PS 3.0. Why I say a "choke", and not "8" for parallelism for PS 3.0, is that buffer management and preservation of state by the buffer systems I mentioned seem to have some possibilities for avoiding stalls, and in combination with there being lots of other instructions besides flow control, allowing a return to the theoretical 16 parallel pixel peak (pardon...*deep breath* the spittle).
I don't see the hypothetical "uber" pipelines being used for adding processing depth, because this seems to complicate the problem of maintaing pipelining and managing buffers. I do see them doing something akin to that for managing branching situations, however, simply because of transistor budget and existing VS 2.0 featureset and implementation in the R3xx. I'm not sure about the TMU speculation...I don't see multiple texture unit per pipe functionality as being useful focus with an emphasis on processing, but OTOH, the directions shading might be taking and the idea of "uber" buffer implementation fitting into this might allow it to still be necessary to hide latency issues.
What I'm wondering, while out here on this hypothetical branch of thought, is what's the worst branching case that might happen in shading, what's the best, and what kind of solutions would be best suited to dealing with managing each acceptably? Would any such solutions lend themselves to having only 4 "uber" pipelines instead of 8, because transistors would be better spent on implementing the solution usefully? The bandwidth relationship we've been led to expect seems to fit this more than the case for 16, really, but maybe ATI reads something else into the future.