Err, sure. But the ALU load does go up. The more TMUs you have to feed, the more you need texture compression. But I'd argue that the TMU/bandwidth ratio has been pretty constant for years.
If it's not there, it's not there. Any fetch4 "emulation" will be far less efficient than the game itself taking 4 point samples instead.
However, since there are lots of older ATI cards (and some XGI and Intel graphics ) around, devs have to write a fallback "standard" shadowmap path anyway.