Chalnoth said:
No, that's not true. You see, fillrate comes from the calculation of the color from the textures. Multisampling hardware only does this once for each pixel (except at edges, of course), and copies the color value to all covered sub-pixels within that pixel.
Thus, one pixel pipeline that has four z-checks can output up to four AA samples per clock.
That's not quite right, Chalnoth.
This is all spelled out in Nvidia's OpenGL reference material.
The GeForce3/4 stores color, depth, alpha, and stencil values for each sub-sample.
While the GeForce family will generate a single common color value per pixel from the texturing, lighting, and shading stages of the rendering pipeline, each sub-sample will have its own stencil operations, depth tests, and framebuffer blends.
To sustain full fillrate (ultimately a measure of rasterization speed: how fast pixels can be written to the framebuffer) when multi-sample anti-aliasing, the GeForce family needs more than just additional depth testing hardware. They need extra hardware sufficient to perform all of the sub-pixel unqiue processing in parallel, which in Nvidia's case is all framebuffer-specific interaction (other multi-sampling implementations could have very different sub-sample-unique operations).
Granted, this is an area open to many optimizations. Nvidia could decide that some operations aren't common enough to dedicate additional hardware for per-sub-sample processing... for instance, depth-tests and blending could be accommodated in parallel for all sub-samples, whereas stencil tests could require additional clocks for each sub-sample. In fact, there doesn't need to be any extra hardware dedicated to doing sub-sample operations in parallel... it could all be done through additional clock cycles instead, and could still result in performance improvements if shared processing earlier in the pipeline is sufficiently complex (complex pixel shaders could bottleneck the pipeline early on, for example, leaving the framebuffer-sampling hardware many idle clocks to play with).
Anyways, Nvidia's implementation doesn't do texture sampling any differently at edge pixels vs. non-edge pixels. It's still done once per pixel location. The texture sampling can be done long before the hardware has any idea that the pixel is only partially covered.
I'm also under the impression that Nvidia's implementation maintains all the requested samples all of the time... i.e. - if you request x2 multisampling, all pixels will be sampled in two locations, with the logical overall increase in framebuffer bandwidth cost. Whether it's an edge or not really doesn't make a difference...
Now, the GeForce3 and 4 do apparently have additional hardware dedicated to multisampling, such that they can generate a multi-sampled pixel in the same amount of clock cycles as a single-sampled pixel (or, at least, with the same per-clock throughput). The only thing holding back performance is the need to load and store the additional framebuffer data.
Whether you consider a limitation imposed by memory bandwidth to be a hinderance to fillrate will depend a lot on what you're trying to do. If memory bandwidth is the only thing that is holding you back from seeing peak theoretical performance, the distinction may not seem significant. If you're already tying up the hardware with complex shader operations, however, the fact that sub-sampling doesn't take additional clock cycles can be very significant, allowing you to introduce anti-aliasing with no performance impact.