A small texture cache is extremely efficient for doing bilinear filtering. It depends based on the angle and position of the surface (and any LOD bias), but the basic mipmap selection algorithm works such that each additional bilinear fragment requires around one texel sample from memory. In other words, with a texture cache, bilinear's bandwidth needs are just about the same as for point sampling.
This may seem a bit counterintuitive at first. Here's a quick, simplified (and not entirely correct) thought experiment that might help out. Imagine you want to texture a 100*100 pixel square parallel to the screen (i.e. "2d"). Using point sampling, it would be perfect if the texture you were using were 100*100. What about if you were using bilinear filtering? What size texture would you want then?
Well if it were a 1*1 square you'd want a 4 texel texture, obvoiusly. If it were a 2*2 square you would want 9--a 3*3 square texture. And so on. For the 100*100 pixel square you want a 101*101 texel texture. In the limit, you fetch 1 new texel for each rasterized pixel. And, in the limit, each texel gets sampled 4 times--that's 4 samples for the bandwidth price of 1, as long as it doesn't get evicted from cache first. And how big would the cache have to be to prevent that? Well, if you think about it, not very big--just big enough to hold two scanlines.
Of course under real-world conditions the mipmap selection algorithm is much more complicated, as it needs to take into account distance from the viewer and the angle in viewspace. Plus there's the fundamental difference that mipmaps only come in certain sizes--there isn't the "perfect size" texture just lying around for you to sample. But the overall point it is averages to around one texel per pixel, and the above analysis gives a hint of why this is correct, or at least plausible.
Ok, so we've established that bilinear isn't really a bandwidth hit over point sampling. What about trilinear? Trilinear uses the same mipmap selection algorithm as bilinear, except it samples from the two closest-sized mipmaps on either side, instead of just the closest one. In theory you'd think this would lead to exactly twice the required memory bandwidth. In practice the actual amount is a bit less than twice, because there exists a largest mipmap (namely the base texture).
But it's important to realize that every GPU out there (except for the original GeForce 256, and that was due to a bug) is capable of one bilinear sample per TMU per clock. In other words, you need two TMU-clocks per trilinear fragment. And it will probably always be this way, even though no one in their right mind (um, except apparently ATI and Nvidia
) would dream of using bilinear over trilinear in this day and age. The reason is that there are many ways textures are used other than just as colors to slop onto surfaces; and some of these ways have a use for bilinear filtering but no use for trilinear (e.g. light maps). (Similarly, some can't be used with any linear filtering, e.g. normal maps.) The obvious design compromise, and the one taken by every GPU, is to make bilinear the one-per-TMU-cycle operation.
So while trilinear almost doubles the texture bandwidth requirements...it actually does double the fillrate requirements. And it does nothing to the other per-fragment bandwidth costs, like color writes and z read/writes. In other words, although trilinear significantly raises the required bandwidth per
pixel, it actually lowers the required bandwidth per
clock. Trilinear samples are well-behaved with respect to texture cache, as well. (Although you might need a cache almost twice the size.)
If you analyze aniso, it comes out much the same. More samples and thus more bandwidth, but at the cost of more fillrate resources. Anisotropic samples may be less well behaved with respect to texture caches, depending on the sample distribution. But this is probably not enough to make up for the much lower bandwidth/clock costs of spending so long on each pixel. (Remember, AF is only applied to those pixels that need it, and only to the degree that they need it.)
To sum up: although at first glance it would appear that better texture filtering would require greater memory bandwidth resources, in reality the opposite is true. The biggest cost is the on-chip logic, buses and cache to support sampling 4 texels per TMU per clock. And as you scale up to better filtering--trilinear and then anisotropic--the bandwidth costs rise more slowly than the fillrate requirements, at least with any sensible design.
End result--better filtering does not require more bandwidth (per clock)!