First, Tom's calculation doesn't take into account texture caching, which reduces texture bandwidth IMMENSELY. Today's good GPU's hardly ever have to read a texture sample from memory twice when drawing a polygon, except when tiling a texture. Still, locally speaking, that hold about true.
Consider single-texturing. When minification is happening, bilinear filtering has texture bandwidth requirements of 32-bits per pixel maximum. Trilinear requires about 40 bits per pixel max, because one mip map is always 1/4 the resolution - however, since it requires 2 clocks to do the trilinear filtering (assumming 2 mipmaps are used instead of 1), that's only 20 bits per pixel per clock.
Remember, these are max figures, too. Increasing LOD bias lowers this, as well as looking at oblique angles. When textures are closer to the camera, magnification spreads the textures over more pixels, reducing this much more (the 3DMark2K1 has almost negligible texture bandwidth requirements for this reason - I'm talking only a few bits per pixel).
Most GPU's, including GF2, GF3, GF4, Radeon 8500, and R300, have about 64-bits of bandwidth per pixel per clock (give or take). You need 32-bits for the colour buffer write, and both Z reads and writes are necessary. With Z-compression, this is 16-64 bits per pixel, depending on compression (avg of 32 maybe?). This leaves only a bit for texture bandwidth, but again, texture bandwidth is not near as bad as Tom says it is. From here, the greater the texture bandwidth, the lower the efficiency. Alpha textures are a bit different, needing a Z-read an both a colour read and write (~80 bits/pix + texture bandwidth).
Generally, a second texture unit will help out a lot in multitexturing, because texture bandwidth is usually quite low. Some parts of the screen are bandwidth limited, so the performance gain isn't 100%, but it's still significant. Just look at RV250 vs. R200 in Quake 3 or Jedi Knight. The difference is quite noticeable.