Actually, it depends on the pipelining, doesn't it?
Pardon my making up words as I go along
:
In a 3
macrostage pipeline, if one component per
pseudoclock could be done, scalar dependencies could be resolved if the scalar operator was able to do one macrostage/pseudoclock for it's previous instruction.
For instance: if a
subunit could output one component of a mul in one clock cycle, and a pipeline had 3 such subunits replicated for one macrostage of the "vec3" part of the pipeline, it could have each subunit cascade for dependency and process 3 different pixels simultaneously but take 3 clocks for each pixel (staggering output).
However, if the scheduler could analyze dependency and manage another, more flexible, pipeline (like the scalar one in the R300), it could have the choice of using one macrostage in that pipeline for the 4th component, or getting a head start on stage propogation delay for a dependent scalar operation (if it was told to do the dependent component calculation first).
...
What this doesn't analyze is the design cost for pipelining in this way and being able to schedule for it, have register replication for it, etc, but I thought it interesting to mention regarding considerations for future design and analysis of current hardware in the context of this statement about optimization opportunities.
Hopefully, I wasn't too sloppy in my wording and didn't make any silly errors or oversights.
I have a feeling this discussion reflects some things mentioned in prior pipeline discussions (and may have been pointed out to be fallacious in them), but I can't begin to guess right now which word I'd use to search for it (efficiently), though it should have been the latter half of last year some time.
...
macrostage: What I mean is that the actual number of discrete pipeline stages might differ...the "macrostage" is a convenience of representation for this particular case to maintain per clock output.
pseudoclock: An operation in a pipeline can take more than one clock to execute, but still output one per clock due to the number of simultaneous operations conducted in the pipeline. The "pseudoclock" is just a term for being able to implement that per clock output concept while in the middle of the "simple" pipeline concept.
subunit: Units are usually referring to something providing useful outputs, and can mean drastically different thing. I'll use subunits just to try to avoid confusion with statements like "the R300 has 8 flexible 4 component 24 bit per component floating point processing units" and other variations that might be valid depending on how you look at things.