One thing I don't really get is you want the cpu and the gpu to run at 100% when they're working. You want full utilization for better performance or better graphics. In the mobile world you also want to do your work at 100% so the cpu, gpu can race to a lower power state. I'll have to read through what Cerny said in detail again later, but I'd like to see more of his thoughts on low power programming, because it sounds at odds with what is typically the ideal, which is push the cpu to 100% for as short a time as possible to race to sleep.
It's because with portable and productivity devices (or IoT), the dominant power is the quiescent current (the amount of power when everything is powered but idling) and the power management have extremely low idling modes to counter this, so it does something for a few ms, then it shuts down almost completely for a second until the user clicks something, or an email is received, etc.... Also with low power devices there isn't as much of an exponential relationship with clocks.
With the high power used here, any reduction of quiescent current is dwarfed by the exponential (cubic!) power at the top of the frequency curve. And there isn't much opportunities to power down, if any.
So if you have a task for the frame which is using the max clock at 3.5, filled with AVX256 instructions, but it finishes within 80% of the time allowed before the next frame, you end up wasting a massive amount of power compared to running it at 3.0 for the 100% time allowed. There is zero difference in cpu performance between the two, but the former is consuming a LOT more energy, since power use is cubic, while the time slice is linear versus frequency.
So that means an additional chunk of watts "for free" which the GPU might have used at that particular point in the pipeline. Hence "squeezing every last drop". It's minimizing the possibilities of the GPU clocking down from 2.23, but the CPU would normally stay at the same effective performance as if it was always locked at 3.5. The power hungry device here is much more the GPU than the CPU, if there's any drop of frequency, it's the GPU that provides the most gain. It's just logical the CPU never drops unless it can do so for free.
The required real time profiling is no different from trying to predict a resolution downscale or LoD to keep the frame rate consistent, but it's doing the reverse. Estimate if the operation will finish much earlier than the next frame cycle, and lower the clock proportionately, with some safety margin?