I don't see the claimed movement towards fixed function hardware. In fact quite the opposite seems to be true, when you compare DX9 hardware to DX11 hardware. Programmability has increased dramatically and many fixed function units have been replaced by more general purpose units.
- We now have unified shaders instead of separate vertex and pixel shaders.
- Registers and ALU (math) are now IEEE compatible 32 bit floats (special cased 16/24 bit float processing are thing of the past). ALU is cheap compared to bandwidth.
- Custom texture caches and constant caches have been replaced with general purpose L1 caches (backed up with general purpose L2 caches).
- Lots of custom internal data buffers have been replaced with a single internal general purpose shared memory ("LDS"). As data caches/buffers tend to be quite big (lots of transistors), it's a waste to have a separate custom one for each fixed function unit (because things like geometry shaders and tessellation are not always active, and not all vertex shaders output maximum amount of interpolants, etc).
- All the listed improvements have allowed cost effective way to introduce new compute shader functionality. The general purpose on-chip shared memory allows efficient data sharing and synchronization between multiple threads of the same compute unit (data can be kept close to the execution units to reduce latency and to reduce memory traffic = reduce power usage). Unified shader cores have flexible memory input/output paths (previously vertex shaders were scatter only, and pixel shaders were gather only), filling the other requirement for general purpose processing using shader cores.
- Geometry amplification is possible with geometry shaders and tessellation. DX11 compatible tessellation uses unified programmable shader cores (just like all other shader types). Previous fixed function tessellators never become popular because of limited flexibility. Geometry amplification allows many new algorithms to be implemented, and it allows data to kept in on chip shared memory (instead of multipassing, thus reducing lots of data movement = reducing energy usage).
- GPUs can now spawn threads and control draw call primitive counts themselves (by indirect draw/dispatch calls). This creates lots of new possibilities for programmers. Kepler also can spawn new kernels from GPU (without CPU assistance).
We have also seen failures of fixed function hardware. Free 4xMSAA required lots of transistors on Xbox 360 (and lots of internal EDRAM bandwidth). A few years after the console launch deferred rendering was invented, and pretty much nobody uses MSAA hardware anymore in their console games. These transistors are just idling there doing nothing... That's always the risk of fixed function hardware. If it doesn't suit the task, it will be just dead silicon.
Texture fetch is an interesting example because dedicated hardware can potentially reduce data movement across the chip.
Agreed, fixed function texture filtering reduces traffic from L1 cache to registers. However it doesn't reduce traffic from memory to L2 to L1 (as all texels need to be in L1 for filtering), and that's the biggest distance the data needs to move (and thus consumes most of the data traffic energy).
For compressed (or 8888/11f-11f-10f) data, the savings in register bandwidth are not always that clear. After filtering and sRGB conversion, the fixed function unit must send a 4x32f value though the internal link (as one texel might be 234 and next one 235 and it might be zoomed so that one pixel covers the whole screen, and we still need a smooth gradient = we need lots of bits of precision). In this case the fixed function unit loses the point sampled 8888 case by 4x, and ties the 8888 bilinear filtered case. The fixed function unit wins the trilinear case by 2x (and anisotropic case by a larger margin). The BC7 compressed case is harder to analyze. It favors the fixed function hardware, except in cases where the CPU implementation is allowed to branch (3x3 area of each 4x4 block needs just one 128 bit register load). However CPU performance completely craps out if you add those incoherent branches to the code, so this task should favor fixed function hardware if we only consider the L1->register data movement (and even more so for trilinear and anisotropic filtering).
However the fixed function unit too needs to load it's texels from the same general purpose L1 cache (no current GPU has special purpose texture caches anymore). In order to have any gains in data movement energy efficiency, the fixed function unit needs to be closer to the L1 cache than the general purpose register files. If the L1 is large, this might be problematic (unless you bank it, and replicate the fixed function units for each bank... adding even more dark silicon).
The anisotropic filtering case is quite hard for CPU to solve efficiently. It must adapt the texel count (and access pattern) rapidly based on the surface slope. But branch mispredicts are in general around 20 cycles, and these kind of scenarios are very hard to predict properly. The new Haswell gather instruction has read mask (to skip some lane loads without branching), but I don't know how efficient it is compared to standard register loads.