IMO GPUs get refreshed in much shorter cycles than CPUs (with the GPU architecture being completely changed almost within 2~1 years or less) they cannot spend much resources into power saving techniques such as intel/AMD could.
You know, that wasn't true back then and it's even less true today. There's a clear lineage from the TNT to the NV30, just like there's a clear lineage from the K7 (arguably K6?) to the K10. The same is true for 3DFX, Intel, and many others.
The difference between CPUs and GPUs is more that GPUs have more iterations of the same architecture. CPUs tend to just have minor derivatives based on the cache size and, more recently, number of cores. GPUs need to evolve to add new features, and also hit the TSMC/UMC half-nodes including 150nm, 110nm and 80nm. Intel and AMD, on the other hand, only focus on the full nodes, and there are many reasons why that makes sense.
Also, if you exclude code morphing techniques, it's not like it's easy to come up with a new CPU architecture that's really better than what you could come up with by evolving your previous architecture. Unless, of course, your previous architecture just couldn't possibly be evolved because its design goals are incredibly different from what you'd want to do today. The Pentium 4 is a fairly obvious example there.
Of course, but hold on, do current GPUs use such power saving techniques? (look at the bolded part above). I thought this is why they employed 2d/3d clocks, because [...]
2D/3D clocks were introduced by NVIDIA for the GeForce FX Series. Back then, GPUs were deep, but not anywhere as wide as they are today. The NV3x only had one 'pipeline', or more precisely one quad pipeline for NV30/NV35 and one half-quad-pipeline for the lower-end derivatives. Now, you tell me how to disable *part* of one pipeline!
(without disabling specific ALUs in it, which wouldn't do miracles and might be much more complex!)
Now, look at G8x. You've got 8 clusters, and each cluster has two 'multiprocessors' (aka ALU blocks) and one quad-pipeline TMU. You've also got 6 quad-ROPs, and each of those is directly associated with one memory channel. Now, think about what might happen when under Vista Aero. Disable 7 clusters, and maybe even one of the two multiprocessors in that cluster and the interpolation/SFU unit. Disable several ROPs based on how much video memory you need. And since Aero never works at 250FPS, much of the time you could power-down everything else on the GPU too.
That's not complicated. It's easy as hell. As Jen-Hsun would say, it's even easier than walking two miles!
(bonus points to whoever remembers where that quote is from...) - and honestly, I don't know if it works like that, but you'd seriously expect it to. You can apply roughly the same principles to G7x, R5xx and R6xx of course. Although being wider helps, obviously.
And I'll admit not to really understand people who say GPUs are bad in terms of power consumption. AMD and NVIDIA GPUs are just fine in terms of power consumption given their die size and the amount of high-voltage GDDR memory that must be considered in the TDP calculation. Is there some room for improvement? Sure. But there is with CPUs too, and I don't think it's so obvious where there is the most room for improvement either. Custom logic helps CPUs for a given level of performance, but there's no magic there. The GPUs certainly have an image problem in terms of power consumption in the mainstream, though.