But there is no word on whether these instructions can be executed without loosing execution cycles for SIMDs.
IIRC, you would only eventually loose SIMD cycles if you had multiple scalar instructions in a row. So one scalar op followd by vector op is fine. But i never understood the technical reasons, and i assume even in this case the SIMDs can work on other waves if there are some in flight like always.
A wave scheduler can issue just 1 wave per clk. Launching a scalar op requires selecting and issuing a wave.
Maybe this is just about sheduling the ALU-SIMDs, but other units like scalar or memory have their own shedulers?
Guess the 16-wide-SIMDs get the same instruction 4 times for a single wave, each having latency of 4 cycles, while the scalar unit tries to execute 4 scalar ops from 4 other waves.
That's really the point where i'm unsure, and i can't find a proper document to clarify.
But i think i had discussed such things quite often with other devs, some professionals, and the assumption scalar and vector operate concurrently seemed in commen for everyone, IIRC. I never doubted this until now, but i may be wrong.
I agree, if we look at it from a single wavefront the vector cycle is lost with a scalar op, but the vector units keep saturated with processing other wavefronts so there is no real loss when looking at the entire workload?