Yes, but despite that the graphics pipeline has become ever more programmable, and there's no end in sight. That's because consumers aren't interested in running last year's game at 120 FPS.
They are interested in running last year's game at over 12 FPS with non-embarrassing settings without needing a quad-socket board to do so.
They want the latest game to run at 60 FPS (plus they want to run GPGPU applications). More diversity means more flexibility is required. So dedicated hardware, power efficient as it may be, is useless if it's going to be underutilized.
It is useless only if it is never used, and this is only a problem if there is something really compelling that could be put in the area it takes up.
Power consumption is absolutely a limiter. No argument there. But I still think that cost is at least as big a limiter. You can't justify having logic that is going to be idle much of the time.
Cost is not a straightforward thing to calculate, and the cost of logic that is idle much of the time has been cut in half every 18 months or so.
At what frequency and voltage? It seems to me that it still results in higher performance to use all of those transistors at less than their maximum performance, versus leaving a large part idle and using the other part at maximum performance.
Frequency will depend on the design and process. Voltage in the near to mid-term is going to be rather close to what it is now. I'm not sure if voltage scaling has been declared dead, but it has dramatically leveled off and I am not sure anything has been promised on a process roadmap that changes things. Since voltage scaling is a critical (quadratic) component of power improvement, losing it has made power per transistor lag far behind the number of transistors.
As far as using transistors at less then their max performance goes, I don't quite follow.
A transistor is switching and burning power, it is stable and leaking, or in some designs the block it is in may be gated off.
I'm not sure what metric of performance you mean. In terms of switching activity, specialized logic can get away with fewer clock cycles, fewer transistors, and less area for a given task.
It is not always better to spread the work out, particularly in the case of chips with power gating. Since leakage can take up to a third of total power consumption, there are situations where it is better to concentrate work in one area and power down the rest. Since the process of gating circuits involves some power devoted to switching the large power gates, and has a latency cost, it helps if the idle periods are long and predictable.
In that situation, spreading a task around and having multiple blocks working at half-speed means that they may not be able to turn as much off.
Intel desktop chips since Nehalem have devoted an entire microcontroller just for the purpose of managing the speed and gating of cores.
And yet even if I stress all of my CPU's threads, Turbo Mode still clocks it higher.
I'm not sure which version of Turbo it is, but the answer is that there are loads that can conceivably heat the core to the point that it would downclock below Turbo mode.
The idea is to cut as close to TDP as possible if performance is needed. There are workloads that would constrain turbo functionality, or introduce corner cases depending on the case environment and cooling solution.
Turbo is not Standard because the manufacturer cannot guarantee that speed at all times for all of its customers.
Even without that, there are workloads and loops that can heavily utilize your CPU enough to force throttling. Intel and AMD have internal apps that do just that for testing, just in case something out there happens to do the same.