Jawed, I'm not sure where you got that impression, but it looks horribly wrong to me.
As reference, I'll take slide 33 of this presentation:
http://rnc7.loria.fr/oberman_invited.pdf (which is a really nice presentation btw, fwiw, nice overall architecture/algorithms summary.)
Slide 31 is also of some relevance. And if you actually want the original paper, it's all there still:
http://66.102.9.104/search?q=cache:...ction+arith17&hl=en&ct=clnk&cd=1&client=opera
And there's also the original Stuart Oberman paper on enhanced minimax quadratic approximation if the problem is you aren't sure how the algorithm for it works interally. It's basically the same for the multifunction unit, just with some smart sharing.
What you don't seem to be realizing is that what you call the "auxiliary" paths are actually (partially) used for SF. But SF needs to do 3 iterations (it's not strictly three iterations; the second iteration is very specific, as well as some other things, which is why the paper calls it "three hybrid passes" - the details are available on page 3 of the paper if you feel like wasting some time...) - so, it makes use of two of the "auxiliary" paths for this. Considering there is at least one other "auxiliary" path available, it's easy to see that the algorithm could be extended to 4 iterations if needed for higher precision in the future, although the only use of that would be GPGPU, imo. None of the papers even allude to that possibility.
Rys' diagram, just like mine, assumes 16 units that need 4 clocks for SF and 1 for interpolation, but clearly it's 4 units (or a bigger one, from the scheduler's POV as I said above) that does one pixel quad of interpolation or one SF per clock. At least as far as I can see, of course.
Uttar
P.S.: I'm now 99% sure that the MUL is doing Special Function setup (to put the values in range). The patents clearly hint also at the MUL functionality of the multipurpose ALU being put to use for that. Finally, I think that except for CUDA, it makes sense to only expose the MUL when you're doing SF, as the MUL would be idling 3/4th of the time when it has to setup a sincos etc., since the SF couldn't keep up. It'd be interesting if they could expose it more generally in the future though, especially in the VS - hmm.