I don't see why alpha blending need be handled any differently from texture fetches, in terms of latency hiding. That is, I would expect that an architecture could simply use the latency hiding that is used for texture fetches to also hide the latency for alpha blends.
Before you start throwing around the latest buzz words of "dynamicaly allocated shader pipes", this only allows you to push shader performance where its required, you still need pretty much the same buffering to cover all cases.
Not really. If you design the pipelines in an optimal fashion, then there's no need for much of any vertex->pixel cache. The main problem is that you'd want the pipelines to be able to switch quickly between vertex and pixel processing. For example: have a pipeline do all vertex processing for a single triangle, then do all pixel shading for that triangle, then start on the next triangle. If an architecture like this could be designed to handle two states at once efficiently, there would be no problem with load balancing.
There are a couple of issues, of course. The first is branching: you'd have to handle branches efficiently with very long pipelines designed for texture access latency hiding. One way around this might be to have only one "idle" stage that is designed to hide most of the latency, with data that doesn't need texture accesses being promoted by some amount within the "idle" stage (which would largely act like a FIFO buffer). With such an architecture, you'd obviously need tags in the data in the buffer to indicate what to do with that data, as there'd be no realistic way to keep it all straight in other ways. Since we're talking about a dual-state system, that tag may just be one bit.
Anyway, I'm not going to go any more into the problems. I'm sure you can think of a number of other problems with this solution, but the questions remain: Can this become more efficient than dedicated vertex and pixel pipelines? Is the additional transistor worth it? What additional programming possibilities could this add for developers, and would they make the change more worthwhile?
There is, of course, the obvious benefits that such an architecture could attain are absolutely zero stalling between vertex and pixel data with a tiny buffer (you'd essentially only need to store a few incoming pieces of vertex/triangle data, a couple of triangles for the triangle setup engine to work on, and a couple of pixels output from the triangle setup engine...pixels would get priority over vertex data, with execution of vertex data only when the triangle setup engine has no more pixel data to give to the pixel pipelines), and absolute load balancing between vertex and pixel data.