Recognizing the potential for a new market with low power demands in now way implies they're going to sacrifice performance for every other market. It only means they should be even more focused on the performance / Watt metric. And that's what AVX eventually offers, so it fits the plan perfectly.The primary design target had it's power budget gutted in half. One of their biggest customers threatened to cut them out and their executives acknowledged it as a real wake up call.
You should also realize that the Haswell design must have been near completion by the time Apple urged them to create lower power processors. And besides, they've got fast 17 Watt CPUs based on Sandy Bridge today. Since Tri-Gate offers substantial advantages to decrease power consumption, and AVX can be heavily clock gated, investing into 2 x 256-bit FMA shouldn't be much of a problem, while still being able to hit the 15 Watt design goal for ultrabooks.
First of all, 5 years is a relatively short time with only a couple major architectural changes, and GPU manufacturers are pretty secretive about these things. Still, over a period of merely 10 years GPUs have evolved from fixed-function to highly generic computing devices! That should tell you something, and it's pointless to discuss the individual changes that got us here. The relevant bit is that contemporary GPUs have complex scheduling not entirely unlike CPU schedulers. Sandy Bridge has a 54-entry scheduler, while Fermi deals with 48 potential warps each cycle. Also worth noting is that GF104 feeds a total of 7 execution ports, and uses superscalar scheduling. AMD will take a big leap from mostly static scheduling to dynamic scheduling with GCN. The incentive for adding such complexity is to avoid running out of thread context storage (registers) and improve cache hit rates to reduce bandwidth.What changes have happened on the scheduling side in the last ~5 years?
And I sincerely doubt that will be the last change to instruction scheduling for GPUs this decade. So it really doesn't make sense to say DLP is an afterthought for CPUs just because the scheduling is focussing on ILP. It's where GPUs are heading too and it doesn't make them any less DLP focussed. It matters little from a complexity and power consumption perspective where the instructions come from. But as things continue to scale you do want to minimize the thread count, and for that you need a healthy amount of ILP.
Turning the CPU into a high throughput architecture is within reach. Any remaining advantage GPUs have over CPUs would be addressed by AVX-1024.