fanATIVdiot said:
Not that I would typically use THG as a definitive source for information hower they do seem to have a very well written reason for ATI's possible decision for 1 TU / pipeline:
It might look as if one texture unit per pipeline is very little, but if you calculate the memory bandwidth requirement of eight parallel pipes with one texture unit doing a trilinear 32- bit color texture lookup, you will understand why two texture units wouldn't have made an awful lot of sense: 32 bit * 8 (trilinear filtering requires 8 texels to be read) * 8 (eight pipelines) = 2048 bit. 2048 bit would have to be read per clock, but 'only' 512 bit per clock are provided by the 256 bit-wide DDR memory interface of Radeon 9700. Bilinear filtering mode would still require 1024 bit per clock. Two texture units per pipe could never be fed by the memory interface.
http://www.tomshardware.com/graphic/02q3/020718/radeon9700-07.html
First, Tom's calculation doesn't take into account texture caching, which reduces texture bandwidth IMMENSELY. Today's good GPU's hardly ever have to read a texture sample from memory twice when drawing a polygon, except when tiling a texture. Still, locally speaking, that hold about true.
Consider single-texturing. When minification is happening, bilinear filtering has texture bandwidth requirements of 32-bits per pixel maximum. Trilinear requires about 40 bits per pixel max, because one mip map is always 1/4 the resolution - however, since it requires 2 clocks to do the trilinear filtering (assumming 2 mipmaps are used instead of 1), that's only 20 bits per pixel per clock.
Remember, these are max figures, too. Increasing LOD bias lowers this, as well as looking at oblique angles. When textures are closer to the camera, magnification spreads the textures over more pixels, reducing this much more (the 3DMark2K1 has almost negligible texture bandwidth requirements for this reason - I'm talking only a few bits per pixel).
Most GPU's, including GF2, GF3, GF4, Radeon 8500, and R300, have about 64-bits of bandwidth per pixel per clock (give or take). You need 32-bits for the colour buffer write, and both Z reads and writes are necessary. With Z-compression, this is 16-64 bits per pixel, depending on compression (avg of 32 maybe?). This leaves only a bit for texture bandwidth, but again, texture bandwidth is not near as bad as Tom says it is. From here, the greater the texture bandwidth, the lower the efficiency. Alpha textures are a bit different, needing a Z-read an both a colour read and write (~80 bits/pix + texture bandwidth).
Generally, a second texture unit will help out a lot in multitexturing, because texture bandwidth is usually quite low. Some parts of the screen are bandwidth limited, so the performance gain isn't 100%, but it's still significant. Just look at RV250 vs. R200 in Quake 3 or Jedi Knight. The difference is quite noticeable.