Gubbi said:
What you describe is a VLIW, not a superscalar shader, which is what I thought you meant originally. - And as I stated VLIW makes more and more sense as the low level instruction scheme for a shader, the more I think about it.
No, I did not explicitly mean VLIW (although I agree with you that it is probably the most efficient implementation) . I mean that the compiler schedules and routes operations to functional units in the same way that compilers do register allocation today.
Let's say you have K functional units from a set F. Consider the following program:
X = T * C
Y = N * L
Z = R * V
MC = Ka + X
DC = Kd * Y
SC = Ks * Z
FC = MC + DC + SC
X, Y, and Z are data-independent, so I explicitly schedule (via some extra bits) these operations to functional units F_1, F_2, F_3. DC and SC are also data independent. Same there.
(F_1) dp4 X, T, C
(F_2) dp4 Y, N, L
(F_3) dp4 Z, R, V
(F_1) add MC, Ka, X
(F_2) mul DC, Kd, Y
(F_3) mul SC, Ks, Z
(F_1) add FC, MC, DC
(F_2) nop
(F_3) nop
(F_1) add FC, FC, SC
Now, if the GPU is VLIW, than you'd use tree-covering in the compiler back-end generation phase to pack this schedule into the proper VLIW words. On the other hand, you could just as easily have a GPU that handles these instructions via pipelining and prefetch as normal, except that it doesn't have to use a scoreboard to figure out which units are busy and which are ready. The GPU can just blindly execute these instructions without worrying about a conflict/hazard.
Yes, you are always going to have situations where some units go idle, but you have the same situation of idle silicon by adding additional pixel pipelines. I would argue that increasing parallelism at more granular levels is better than increasing it at more course levels.
If you increase it at a very granular level (like the P10), then you have opportunities for allocating resources where they are best needed. If you increase it coarsely by including more monolithic pixel pipelines, when one of them goes idle, you are wasting *far more silicon* You would argue of course for allowing these pipelines to be more independent. I would argue that you should take this further and allow
resources within a pipeline to be
more independent. I think the Trident XP4 takes this approach.
I would advocate a more general approach. Include lots of reusable units and spend logic on complicated routing architecture so that the compiler can hook up the units together as they are needed and route data between them (ala MAJC)
Adding more pipelines might be the brute force way of doing it, but you could achieve the same thing by going multichip and just using extra GPUs to boost performance. It just doesn't seem as elegant to me.