Multisample AA takes a fill rate hit with and without aniso

There is no fillrate hit. What you're seeing is the result of a memory bandwidth hit.

As we've seen in the past, not enough memory bandwidth results in less effective fillrate than the video card is theoretically capable of. For example, a GeForce2 Ultra with 1GPix/sec theoretical fillrate cannot get very close to that in 3DMark2k1 fillrate tests (I checked on Madonion...gets closer to 400MPix/sec...less than half the theoretical).

If you saw no hit in the fillrate tests when enabling FSAA, you'd probably also see no performance hit from enabling FSAA.
 
DaveBaumann said:
Technically there is a small fillrate hit at edges, but generally its so small that its not noitcable over the bandwidth hit.

Yes, that is true...and it will make a difference as geometry gets extremely complex (i.e. sub-pixel polygons common), but that's a long way off yet.
 
Of course, the fillrate hit currently is around the range of somewhere like 3%-5% due to edges...much smaller than other performance limitations...
 
DaveBaumann said:
There is no fillrate hit.

Technically there is a small fillrate hit at edges, but generally its so small that its not noitcable over the bandwidth hit.

Depends on your implementation. There can be, but doesn't have to be.
 
The only time there might not be a fillrate hit would be with 4x or lower FSAA on a deferred renderer. Otherwise there has to be a fillrate hit for MSAA. Since it seems pretty apparent that deferred renderers have gone the way of the dodo (a move I support, given that they are bound to run into some pretty major problems with very high polycounts), and we're certainly going to be expecting higher than 4x FSAA in future hardware, it just seems obvious that every MSAA video card, now and in the future, will take a tiny fillrate hit at poly edges. Of course, the actual hit currently is so miniscule it can be ignored.
 
Why would there NOT be a fillrate hit with MSAA? After all, if doing 4xAA, it is FILLING 4x as many pixels as it would otherwise do.

-FaaR-
 
There's nothing special about a tiler that keeps it from having to generate extra samples for when using MSAA.
Alos, this high polygon thing is a red herring. It's like a mathmatical argument:

as polygon count approaches infinity TBRs start encountering difficulties.

but IMRs will never complete the scene either, so who cares. Besides, polygon limits still have mostly to do with getting the data to the card.
 
Grall said:
Why would there NOT be a fillrate hit with MSAA? After all, if doing 4xAA, it is FILLING 4x as many pixels as it would otherwise do.

-FaaR-

Because all four sub-pixels use the same color data. The only processing that is multiplied by four in multisampling is the z-checks, and so the GF3/4 cards are capable of at least four z-checks per pixel pipeline.

There's nothing special about a tiler that keeps it from having to generate extra samples for when using MSAA.
Alos, this high polygon thing is a red herring. It's like a mathmatical argument:

Well, granted, it wouldn't be easy for a tiler to not take a fillrate hit at poly edges with MSAA involved, as it would still require the rasterizer to access multiple textures per pixel, but it could be done.

And there certainly is a problem with tilers and high polycounts because tilers attempt to cache the entire scene before starting to render. As polycounts increase exponentially, as I'm sure we all know they will, that's an additional performance hit that is just not needed.
 
And there certainly is a problem with tilers and high polycounts because tilers attempt to cache the entire scene before starting to render.

This is probably the most technically aware forum to the nuances of the tiling architecture, so most people hear are aware of the 'issues'. The fact of the matter is that in most cases this appears to be overstated by some out there.

To also state that tiling is extinct is far from the truth.
 
Chalnoth said:
The only time there might not be a fillrate hit would be with 4x or lower FSAA on a deferred renderer. Otherwise there has to be a fillrate hit for MSAA. Since it seems pretty apparent that deferred renderers have gone the way of the dodo (a move I support, given that they are bound to run into some pretty major problems with very high polycounts), and we're certainly going to be expecting higher than 4x FSAA in future hardware, it just seems obvious that every MSAA video card, now and in the future, will take a tiny fillrate hit at poly edges. Of course, the actual hit currently is so miniscule it can be ignored.

You appear to have a very specific implementation of multi-sampling in mind.

With dedicated multi-sampling hardware, you can generate sub-samples in parallel, as many per-clock as you'd like to spend transistors on. There doesn't need to be any performance impact from generating these samples, aside from the need to load and store extra data (which is where the tiler advantage comes in.... a tiler can be built with all the bandwidth that it needs for any intermediate data storage). Going this route, edge pixels are no more expensive to render than any other pixel.
 
Oh, and that problem can be put off by reducing tile size and increasing the poly buffer size. :p

Yes, PCX and Neon-250 will hit their limits soon - probably with DOOM3 or UT2003, whichever uses more polys.

Kyro/II/SE still has a good amount of time before hitting bin limits... even longer for K2SE cos it occludes geometry and has a larger bin space. :)

*note: Occluded Geometry as seen in K2SE will probably raise the PowerVR Poly Bar at least, say, 3x? Assuming an average overdraw of 3 ;)
 
Chalnoth,

Even though all subsamples use the same color, there's still a framebuffer that is 2/4x larger than when not using MSAA, and that buffer needs to be filled, and that means a fillrate hit.
 
Grall said:
Chalnoth,

Even though all subsamples use the same color, there's still a framebuffer that is 2/4x larger than when not using MSAA, and that buffer needs to be filled, and that means a fillrate hit.

No, that's not true. You see, fillrate comes from the calculation of the color from the textures. Multisampling hardware only does this once for each pixel (except at edges, of course), and copies the color value to all covered sub-pixels within that pixel.

Thus, one pixel pipeline that has four z-checks can output up to four AA samples per clock.
 
Chalnoth said:
No, that's not true. You see, fillrate comes from the calculation of the color from the textures. Multisampling hardware only does this once for each pixel (except at edges, of course), and copies the color value to all covered sub-pixels within that pixel.

Thus, one pixel pipeline that has four z-checks can output up to four AA samples per clock.

That's not quite right, Chalnoth.

This is all spelled out in Nvidia's OpenGL reference material.

The GeForce3/4 stores color, depth, alpha, and stencil values for each sub-sample.

While the GeForce family will generate a single common color value per pixel from the texturing, lighting, and shading stages of the rendering pipeline, each sub-sample will have its own stencil operations, depth tests, and framebuffer blends.

To sustain full fillrate (ultimately a measure of rasterization speed: how fast pixels can be written to the framebuffer) when multi-sample anti-aliasing, the GeForce family needs more than just additional depth testing hardware. They need extra hardware sufficient to perform all of the sub-pixel unqiue processing in parallel, which in Nvidia's case is all framebuffer-specific interaction (other multi-sampling implementations could have very different sub-sample-unique operations).

Granted, this is an area open to many optimizations. Nvidia could decide that some operations aren't common enough to dedicate additional hardware for per-sub-sample processing... for instance, depth-tests and blending could be accommodated in parallel for all sub-samples, whereas stencil tests could require additional clocks for each sub-sample. In fact, there doesn't need to be any extra hardware dedicated to doing sub-sample operations in parallel... it could all be done through additional clock cycles instead, and could still result in performance improvements if shared processing earlier in the pipeline is sufficiently complex (complex pixel shaders could bottleneck the pipeline early on, for example, leaving the framebuffer-sampling hardware many idle clocks to play with).

Anyways, Nvidia's implementation doesn't do texture sampling any differently at edge pixels vs. non-edge pixels. It's still done once per pixel location. The texture sampling can be done long before the hardware has any idea that the pixel is only partially covered.

I'm also under the impression that Nvidia's implementation maintains all the requested samples all of the time... i.e. - if you request x2 multisampling, all pixels will be sampled in two locations, with the logical overall increase in framebuffer bandwidth cost. Whether it's an edge or not really doesn't make a difference...

Now, the GeForce3 and 4 do apparently have additional hardware dedicated to multisampling, such that they can generate a multi-sampled pixel in the same amount of clock cycles as a single-sampled pixel (or, at least, with the same per-clock throughput). The only thing holding back performance is the need to load and store the additional framebuffer data.

Whether you consider a limitation imposed by memory bandwidth to be a hinderance to fillrate will depend a lot on what you're trying to do. If memory bandwidth is the only thing that is holding you back from seeing peak theoretical performance, the distinction may not seem significant. If you're already tying up the hardware with complex shader operations, however, the fact that sub-sampling doesn't take additional clock cycles can be very significant, allowing you to introduce anti-aliasing with no performance impact.
 
Dan G said:
Anyways, Nvidia's implementation doesn't do texture sampling any differently at edge pixels vs. non-edge pixels. It's still done once per pixel location. The texture sampling can be done long before the hardware has any idea that the pixel is only partially covered.

Of course it doesn't...but edge pixels are covered twice (Or more, depending on how many triangles come together at that particular edge). This is particularly apparent when you're looking at triangle edges within a mesh, as opposed to overdraw (i.e. an edge where the triangles 'come together,' as opposed to one where one edge is located in front of another triangle).

What could possibly help would be a way of simplifying the multisampling calculations for triangle fans/strips, where the hardware just treats all triangles in a fan/strip that use the same triangle as one triangle for texture filtering purposes...
 
Hmm...

I don't entirely follow.

You're pointing out that resolving edge coverage to greater accuracy through multi-sampling can result in rendering operations being done for some pixels that would have been completely ignored if the pixel was single-sampled? Such as a pixel covered 20% by one triangle, 80% by another, meeting at what is intended to be a seamless edge, requiring texture sampling for both triangles at this pixel location with multisampling, rather than only for the 80% coverage triangle if single-sampling?

If so, then you're right.... there will be a slight increase in the total number of pixels rendered, once the definition of what counts as a visible pixel changes. This can slightly alter performance even if the sub-sample processing had no performance impact on its own.
 
Back
Top