T already does this and then follows that with 5 multiplications and a 4-input adder. All in one logical clock cycle (8 physical cycles).
But those are structured differently from the multiplies in the ALU units. As you mentioned to me, you can do two multiplies in series in the other ALUs, so the first multiplication starts very soon. In the T units, you have to get the LUT value and you need to do a partial square and cube before starting the multiply, the LUT coefficients don't need full IEEE to do those multiplies, and don't need another multiply after that (so they are further down the pipeline than in the other ALUs). Note how the other ALUs don't let you do an add after two serial muls, so that's not a viable path to accommodate the square/cube.
The structure is totally different. Look at Fig. 7 in the patent you linked to.
Getting the data from the LUT to the multiplier is a non-issue, the LUT's right there in the single pipeline it's providing its data to.
It still needs a cycle. The squaring and cubing probably need two.
With ALU utilisation typically peaking at 75-80%, deleting this unit isn't generally going to cost any performance.
What makes you think the T-unit idling is responsible for the last 20-25%? Or that the other ALUs are idling while the T-unit is working? It absolutely will cost performance.
~20% area saving for the ALUs is not to be sniffed at.
Your figure is arrived at with fairytale accounting. Much of what is eliminated is simply transplanted, like the LUTs, squarer, cuber, exponent processing (exp, log, div). You're going to increase the longest path, thus requiring slower clock speeds, to squeeze all this into the framework of the main ALUs. I don't even think it's possible at all. Maybe it could be done with the modified pipeline I suggested earlier, as that is more accommodating to dependent math, but as it is a separate T just makes sense.
When you take into account these costs instead of just looking at the T-unit savings, and also look at the total size of the SIMD engine, you'll only save a few percent from the cost of the latter. That's unlikely to be worth the loss of throughput.
Finally, of course, there's no absolute reason why a transcendental has to be done in a single cycle. e.g. as two cycles, with the squarer in the first cycle.
Well now you're reducing throughput even more. With the current architecture, in two cycles you can do 1 or 2 transcendentals and 9 or 8 regular ops. With your modification, you can only do one transcendental (and maybe one regular op alongside the square/cube cycle).
I can't remember where I saw this, but the whole architecture appears to have a nice 4-cycle stagger across it.
Whole SIMD staggering is actually isomorphic to the quasi-scalar architecture I was proposing. It has the same requirement of needing more active batches. Four cycles of stagger between x and y, y and z, and z and w, for example, would have 20 total cycles of latency before proceeding to the next instruction group, so you'd need 5 active wavefronts instead of 2.
I'm pretty sure that ATI isn't doing this, though. The restrictions on dependent math are a big hint, IMO.