If there's one thing that Mintmaster's banged into my head, it's that AF isn't "extremely bandwidth intensive," as the Wikipedia author states, but clock intensive--at least on consumer GPUs with finite #s of texture samplers. A Radeon X1900, for instance, can only give you 16 bilinear filtered samples per clock. 16x AF would require 16x as many samples, but you're not getting that in the same clock (which would be bandwidth intensive indeed: 16x moreso), but rather in 16x more clocks. So AF doesn't require more bandwidth per clock, just more clocks to achieve the desired samples. This time spent waiting for the extra AF samples can be offset by increasing pixel shader complexity, so the rest of the GPU is kept usefully busy in the meantime. Or, if you think of it another way, more shaders makes crunching math the bottleneck, making AF close to "free" on otherwise idle texture units.
And, yeah, realizing that 16x AF requires 16x more clocks helps you realize why ATI and NV are so big into "adaptive" AF implementations, to speed things up by not applying AF on every single texture when it's forced via the drivers (rather than specified per-texture by the game).
Well, that's an interesting perspective, tho I'd still wonder if we've just pushed the "bandwidth limitation" up a level of abstraction and only given a false appearance of taking it out of the equation by baking it into the design in the first place.