That's a pretty positive way to say that the instruction will cause a processor to stall.
I think gating the decoder when there's a hit in the uop cache is a good thing.
I think gating the decoder because the front end needs to stall due to an ROB or scheduling blockage is not as good.
I think that an instruction that can mostly take the place of 4 smaller instructions is a good thing.
There is tension between the space savings and power gating too aggressively to take advantage of what could be up to three times as much opportunity to continue reordering, improvements to the hit rate of the uop cache, and less instruction decode bandwidth.
Thinking back on the duties of the scheduler, I think the gating opportunities may be more modest.
The scheduler has functionality to receive new uops from the front end, monitor the operand status of the ones it already has, and dispatch those it deems ready (and somewhere in here is the secret sauce of measurements and heuristics in how it makes the determination).
When the 4-cycle instruction begins issue, the scheduler's buffer still has 3 entries to spare (if the instruction is not cracked) where there would have been 3 additional 256 bit instructions otherwise. I'd rather not gate it off, which would leave the space underutilized.
At the same time, until the 4-cycle instruction's results become available some number of cycles (possibly 5-7 before the first completion for some arithmetic ops) in the future, the operand readiness logic would be active in order to detect returns for instructions still in the pipeline.
When the operand results start to return, the dispatch and pick logic would have already moved on to new things.
Maybe the result monitoring part for the port would not care until the final cycle, although perhaps the exception monitoring part would need to be awake unless the deliver of exception information is deferred to the end, while the register write part is done per 256 bit chunk.
However, this may leave performance on the table.
Much like how the P4 fast ALU forwarded a 16 bit chunk to the first half of a dependent op, these 4 cycle ops could forward the first chunks of the result to a dependent 4 cycle op, that is, if the scheduler were paying attention to the port.
This would save latency, since a single 4-cycle instruction could take ten cycles to hit the wakeup period at the end.
I think there are tradeoffs to be made, and there could be some nice benefits with measurable if not revolutionary benefits.
edit:
In addition, Intel has worked on reducing the latency of clock gating units. I think the earliest chips with it had a small cycle penalty because of the wakeup period. I think this has been improved, but I haven't seen mention of it.