What you describe, i.e. gating at base-freq, would be best case and I agree that it seems highly unlikely, albeit it could save LOADS of power. But from what I read in the patent, it seems to apply to wavefront with threads who are predicated off for good at some point, in which case you could switch off the remaining SIMD lanes maybe even with a slight delay (8, 16 or more cycles) and still save some power.
Yes, agreed, that is the fundamental idea of the patent: that one or more lanes within the SIMD are turned off for more than a single cycle to save power. The classical example of this is where a loop is bounded by a test whose result varies for each work item:
while (pixel.x < 1)
.... stuff that lasts for lots of cycles
And each time you start an iteration of the loop, the set of work items that passes the test changes. The size of the set could grow or shrink or stay the same. So the execution mask (which specifies which work items are required to compute a result) varies each time a new iteration of the loop starts.
But GCN runs an instruction over multiple cycles, which means that an ALU lane would toggle between on and off at a much higher frequency than either the loop iteration (which could be 20 cycles long, say, corresponding with 5 instructions) or the instruction frequency (every 4 cycles).
The patent mentions the problem I described in paragraph 35. That requires varying-width SIMDs, though. It also requires that operand collection/resultant scatter is also performed. Which is where we get into the subject of moving RFs from one SIMD to another.
Alternatively, all the SIMDs share a RF. But then each SIMD requires at least an operand collector.
It's interesting that the patent doesn't explicitly talk about a single cycle of execution for a SIMD.
Of course, you still need to be able to make a decision if it's feasible to just continue running this on the original, 16-wide configured SIMD (for example if all registers are readily populated and you just need to execute one more ADD on all of them) or if the remaining instructions are numerous enough (and maybe have a few hazards in them as well, so that they would be put on back in the queue anyway) to warrant a reconfigured SIMD for execution.
Now I am not sure about this, but I also think that it could work slightly different as well: You don't switch parts of the SIMD off for fully populated Wavefronts but have variable wavefront-width size from the start as well.
Yes, it's entirely possible that a wavefront spends its entire life only having 4, say, work items. So a wavefront that was defined as 4-wide would run on the scalar ALU over four cycles for all of its instructions (could be hundreds of instructions).
I've noticed that paragraph 38 indicates a setup that is very close to current GCN which has five ALUs (four 16-lane SIMDs and one scalar):
three 16-wide SIMDs + four 4-wide SIMDs + four scalar ALUs = 68 lanes
Here the 16-wide SIMDs are sort of the same as three of the four SIMDs in GCN. Then the fourth GCN SIMD is split into four 4-wide SIMDs instead. So far, that's 64 lanes, the same as GCN.
Then the four scalar ALUs compare to the single GCN scalar ALU. So that's three more scalar ALUs.
So this design is 3 ALU lanes wider than GCN. But, of course, there's much more control flow and data routing to worry about. Instead of GCN's 5 ALUs, we now have 11 ALUs.
And instead of GCN's register file being dedicated to each SIMD, with a bi-directional data path from each ALU to the compute unit's scalar ALU (and its dedicated register file), we conceptually have 11 ALUs all sharing a single register file.
So the ALU cost is only a small increment over GCN (more decoders, more narrow ALUs), the operand gather (collection) and resultant scatter units are way more costly, since the data paths are generally very wide and you have a many-to-many routing problem to solve (crossbar of some kind).
Running
all instructions on all ALUs over 4 cycles does help with collection/scatter complexity/cost/timing though. And if a change in execution mask needs to be evaluated to determine how to route work items to SIMDs, that latency cost overhead makes the latency of routing less painful: if you're waiting for one latency anyway (execution mask), adding more latency (routing) isn't a dramatic change in the amount of latency that needs to be hidden.
Edit:
Paragraphs 39, 40 and 41 are very interesting as well: Configuration updates every cycle, fully-configured scalar unit (you mentioned that!), and scheduling to narrower (configured) SIMDs in case of memory bottlenecking. I think this is a very interesting patent wrt to Polaris and Vega!
I can't help thinking it would be easier to re-configure GCN with smaller native wavefronts.
Currently four compute units share shader code (instruction cache) and each compute unit has four 16-wide SIMDs and a scalar ALU. So that's sixteen 16-wide SIMDs and 4 scalar ALUs that are grouped together.
So keep the SIMD count the same, sixteen, and make them 4-wide instead. With a native wavefront size of 16. Done.
If you really wanted to apply this patent then you could make the four scalar ALUs shared and general purpose. In current GCN programs, scalar ALU code and scalar registers are never normally a bottleneck. In other words GCN's current scalar ALU is over-specified.
Giving the scalar ALUs full math and making a compute unit consist of two types of ALU: 4-wide and scalar would bring all the advantages of the patent, but would constrain the data routing (collection and scatter) problems to the scalar ALUs. In this configuration, the scalar ALUs simply can't get through data fast enough to require a big fat crossbar. And in this configuration each scalar ALU would be constrained to a subset of four SIMDs, further reducing the chip area over which the crossbar has to operate.