DaveBaumann said:
With PS3.0 vertex shaders already need to cope with texture accesses and as shader lengths go up texture accesses are reduced in comparison to other ALU operations - the reduced importance of pure texturing is already being shown in architectural design.
Vertex shader lengths were already long, even in PS2.0 (256 instructions in PS2.0). The introduction of texturing into vertex shaders is the *opposite* architectural effect. Replacing long chains of procedural ALU ops with texture lookups. Texture lookups still have much higher throughput than procedurally generated data.
And, again, we also have the case where you are likely to be able to dedicate more die to a branching unit in a unified shader than having two descrete ones.
But now you have the much more complex problem of allocating shared units between an incoming vertex stream and an outgoing fragment stream. Let's say you have a pool of 32 unified units. How many do you allocate to the vertices, for how long? What if there are branches and texture fetches happening in the vertex and fragment programs? Due to the vastly different frequencies of I/O in the two pipelines, you have a huge problem to efficiently allocate those pool of units to the current pipeline state to maximize ALU throughput and bandwidth utilization.
Given the fact that highly parallel CPU machines which have OSs and compilers that dedicate lots of runtime to efficiently schedule identical units to handle subprograms of a given datastream have not solved this probably sufficiently (their average thoroughput is still way below the theoretical throughput) I do not have confidence that a silicon scheduler, bound by gate limits, and real-time limits, is going to do better.
In SM4, the unified shader units look to become generalized stream processors. They take input, can do ALU ops and texture fetches, and write output to a stream, which can then be read as a stream by the next stage/loop back. I simply do not see an easy way that the HW can be reconfigured to handle all these generalized cases efficiently. Especially since a stall in an earlier unified shader can block the rest of the pipeline which is waiting for it's output. Unified shaders add more pathological cases that can hurt performance, not removes them.
Not that I am against unified shaders, but more restrictive programming models offer more opportunities for optimization. The more general purpose the model, the less can be determined statically by IHV designers, by compilers, by the APIs, and the drivers.