Exactly. Even though future DDR4 will eventually catch up with today's low-end GDDR5, GPUs would end up having much higher clocked memory interfaces if they don't change strategies.
A very wide bus, very fast bus, or some combination of the two is quite possible with upcoming levels of package/interposer integration.
The GPU makers seem to be opting for a very wide bus with interposer integration. Intel's upcoming Xeon Phi is opting for HMC, but it may be using some kind of proprietary wide bus for its on-package HMC.
Even though HMC can hit these ridiculously high speeds, you need to stick with more modest clock speeds to keep it power efficient.
Or you can keep the wires very short.
The HMC specification provides guidance on what the interface can draw, so it's not unbounded and so far it doesn't seem to be high. The 30 Gbps standard is a claimed part of the second generation, so I'm not sure what the ceiling would be, but it'd have to be pretty high to make the raw link throughput a net negative.
That said Intel has apparently already implemented a
lower AVX clock. AVX-512 will probably have additional measures to optimize clock speed versus power consumption.
Sure, a core hits an AVX instruction and it automatically downclocks and deactivates the upper range of its turbo functionality. In heavy use scenarios, the clock can hover around 2.0 GHz, which is significantly lower than what the Haswell core can do in other implementations.
One would think that the core could simply handle this on a case by case basis instead of a static 1 AVX instruction and you're throttled for the next millisecond, but I think the number of cores and high upper clock range is leaving the design in a place where it may not reliably catch a sudden wave of AVX instructions in time before things start misbehaving.
The overall chip's 145W TDP is actually not that much lower than the chip-only TDP of some high end GPUs, and it is specced to draw more than that TDP for brief periods of time.