Cayman:So in Cayman, 64 transcendentals would be executed in groups of 16, taking four cycles total (full pipelining?), while in Tahiti, they would be executed all together, but would have to go through at least three loops? Is that right?
An instruction containing a transcendental (instructions always operate on a full warp of 64 elements) is issued over 4 cycles, same as all instructions. A transcendental operations just takes 3 slots of the VLIW instruction (leaving the 4th one for another operation).
The transcendental itself is computed as some kind of 3 dependent multiply adds (using lookup tables and the inter slot communication channels also used for double precision, DOT instruction, and for co-issue of dependent operations [add_prev, mul_prev_ madd_prev, sad_prev and so on]). So the latency is the same as with all other operations (8 cycles in all VLIW GPUs), just the throughput is lower.
With GCN, the same algorithm would probably map to an instruction computing and combining the result of the 3 dependent (internal) operations. This is done by looping over the MUL and ADD stages in the pipeline to compute the intermediate results in series (instead of in parallel in 3 slots of a VLIW unit). The wide adder circuit at the end of the pipeline (for FMA) combines the intermediate results to the final one. The throughput would be lowered to 1/3 and the latency would triple (12 cycles). AMD may save on some hardware somewhere, so it may be 1/4 or 1/6 (1/3 would be quite fast for transcendentals).