Thanks everyone for some very informative replies! I'm learning a lot!
Mintmaster said:
One is texture compression. This reduces texture size by a factor of 4 or 6, and suddenly texture bandwidth is not very important.
Well I didn't forget texture compression so much as purposely ignore it. But now I wonder if I was right to do that. I'm actually rather ignorant on this question: how often is texture compression used these days? Does it tend to be turned on by default in most new games/current drivers?
Another is the fact that when there's magnification rather than minification, you only need to sample the top level mipmap. ... This is especially true at high resolutions. In these cases texture bandwidth is nearly negligible, and in multitexture situations you really want another texture unit so that you can use extra bandwidth.
Doh! Should have realized that for myself. But it's a very interesting point, especially the observation about high resolutions. (Bumping up the resolution therefore makes you less bandwidth-limited then...)
Finally, even when you are bandwidth limited, only when you need over 2x the bandwidth than what is provided will single cycle trilinear be useless.
...
All these things make for a tough decision regarding TMU's. It seems the best thing to do is just simulate both scenarios with various scenes, and make a decision based on that.
There's always something a little silly about discussions on the Internet second-guessing some mid-level design decision like this, because obviously the engineers making the decision considered all the possibilities, ran simulations that go far beyond the capabilities of any layman on the Internet in revealing the remifications of each choice, and went with whatever came out best. So, obviously, limiting each TMU to one bilinear filtered pixel per clock was indeed the best decision for the R300.
The question, given your well-taken point that the ability to do trilinear in one clock should greatly increase fillrate even if bandwidth limitations stop you from running trilinear at full fillrate, is why? My guess is that it comes down to a low-level implementation issue. I would assume that the pixel pipelines are very highly, um...pipelined. That when we talk about "doing bilinear in one clock", we are not talking about calculating 4 texel positions, fetching those 4 texels, filtering them, fetching current z-value, comparing, writing new z-value, and writing to the framebuffer (am I missing anything?) all in one clock cycle. Instead, I'm pretty sure we're talking about doing all those things in several clock cycles, but with a throughput of one pixel per pipe per cycle.
Even so, there are some obvious reasons why moving to (a throughput of) one trilinear pixel per cycle might be more trouble than it's worth. One is that adding the extra pipeline stages (in each of 8 different pixel pipes) means a fair amount of extra transistors. (It's not just the extra texel sampling units; you have extra registers to save state between pipeline stages, plus extra control.) Given R300's ambitious task of putting a fast, complete DX9 GPU on a .15u process and selling it as low as the $150 bracket, it's safe to assume transistors were pretty tight. A second reason is that even if all 8 texel samples aren't for the same pixel, with a fully pipelined design you're still fetching 8 texels from the texture cache every clock. Of course there are going to be some clever ways around it, but...it is *tough* to design a cache that'll do that at a high clock rate.
So...those are my new guesses as to why they decided against trilinear in one clock. (Obviously weighed against the simulated performance benefit of having it.)
Chalnoth said:
After all, it only seems natural that many textures will remain plain-old-8888 format, possibly compressed, for some time to come.
Certainly. It's always been my impression that the FP formats are only useful for increased precision for various calculations, and as a way to store textures used in those calculations. Err, that's not very specific. That normal color-textures (like, you know--the ones that actually get stuck onto polygons) will stay 8888 for the forseeable future.
It appears that current architectures do this integer processing in parallel with the PS unit.
So, for a given number of cycles in the PS, how many textures are going to need accessing? How long will it take those texels to get through the filtering stage? At what stage in processing are the colors of the sampled textures going to be needed?
Clearly shading performance is going to be a huge part of overall GPU performance in the future. I do wonder, though, the degree to which shaders will be running parallel with the fixed-function pixel pipeline. Not because it can't be done, but because it seems like (for near-future games at least) PS will only be done on a small fraction of overall rendered texels if we're to get playable framerates. If most of the screen is still covered by "normal" pixels, I doubt you'll be able to be getting use out of the PS simultaneously with rendering those pixels. (I would assume the PS and the fixed-function pipeline have to be working on the same triangle at any one time, correct?)
Or maybe in the not-too-distant future we'll get games which run shaders on all pixels just as Doom3 puts bump maps on every surface.
As for DOOM3's z-only rendering, it seems to me that what should be done instead is to have the multiple z-checks per pixel pipeline per clock put to use for not just multisampling, basically adding a significantly-faster z-only rendering pipeline.
It might take explicit hardware support for such a feature (to be able to calculate the z-values of any 4 pixels in one pipe, instead of just 4 subpixels at various offsets from one pixel), but it does seem like an easy and very useful modification. I wonder if this is implemented in the R300/NV30 generation...? (And if it is implemented on the NV30, why can't it do PJMS??)
arjan de lumens said:
The 'correct' mipmap selection algorithm when aniso is enabled is different from when it is disabled - enabling aniso will, for pixels of polygons viewed at an angle, perform multiple bi/trilinear samples at a high-resolution mipmap level rather than a single sample at a much lower resolution level. The result is that the blurring effect of plain trilinear is removed, and the texel reuse is reduced substantially compared to plain trilinear.
Very interesting. It's kind of funny that I'd assumed the effects on texel reuse of isotropic filtered angled polys on the one hand and anisotropic on the other to be exactly the opposite of what you've said they are. But that's because I wasn't thinking of how to adjust the mip-map selection to make everything look correct, but rather of trying to make the closest analogy to the forward-facing poly case, because I actually understood that one!
But I think I understand everything now.
A couple questions:
First, doesn't aniso therefore run into the "problem" of hitting the largest mip-map size (and therefore becoming blurry but fast) more quickly?
And second, even with the smaller amortized z/framebuffer costs, it seems like if aniso incurs worse texel reuse it might still have a bandwidth/fillrate ratio not considerably different from isotropic filtered pixels. Is AF really primarily a fillrate hit? (Put another way: are there any common settings--other than bumping up the resolution, duh--which will tend to cost relatively less on a GFFX-style card than a 9700?)