Question: free MSAA? edit: and also free AF?

Simon F said:
LeGreg said:
and it was not done if I remember correctly on the Kyro II
Please define "not done correctly".

"and it was not done (scaling other parts accordingly), if I remember correctly, on the Kyro II."

I simply refer to reviews at that time, I didn't own the real thing: at worse, enabling 4xfsaa reduced perf by four so it wasn't really free.
 
The Kyro architecture had no peer in 32-bit FSAA in its time. The Geforce 2 (GTS/Ultra) could match it in 16-bit FSAA, due its vastly superior fillrate and bandwidth, but struggled to perform near it, in 32-bit mode.

I think the Kyro 2 was one of few cards in its time to offer playable 32-bit 4xFSAA. It gave next to no performance hit when moving from 16 to 32-bit rendering, except at higher resolutions it probably was never really designed to handle well anyway (in being a mainstream card vs Geforce 2mx).

The FSAA was not for free, but it had the smallest performance hit of anything in its generation of cards.
 
I recall OGSS to be in relative terms bandwidth "free" on the KYROII. Of course 2*2/2xVertical up to 1024*768*32 or 2xHorizontal up to 1280*1024. The driver wouldn't allow any higher resolutions with SSAA enabled.

IMHO for it's time it was a pure GF2 MX competitor. Compare the two boards raw bandwidth and the relative Supersampling performance for each and it's easier to see where the "bandwidth free" part fits in. I'm not aware of any architecture that could use Supersampling without any fill-rate penalty; Multisampling being of course an entire chapter of it's own.

There have been more than a few relevant discussions on these boards about the topic. One proposition was coverage mask AA:

Coverage mask AA techniques can process samples in a pixel with the same precision as individual point samples if the information is kept around to do so. This includes the correct processing of implicit edges. Z3 does this by using z slopes (hence Z3 - the z, dzdx, and dzdy), although there are other techniques, this is the most effective. FAA did not bother to do this.

When using z slopes with coverage mask AA, the result is actually a form of tile based renderer with pixel sized tiles. The rendering is done in two passes, much like a typical TBR. The first pass renders to "tiles" that are a rectangular collection of higher resolution screen samples. The second pass then renders each tile to the final frame buffer.

The major differences compared to typical TBRs are the format of the data in the tile and the size of the tile. In the case of coverage mask AA, the data is kept in the tile as a set of geometry information (z slopes and a z value), bit masks, and colors. This means that colors are rendered on the first pass which is different than a deferred rendering tiler which keeps the geometry as triangles and unrendered color information. However, both render at a lower resolution than the sample resolution on the first pass and collect up fragments in the "tile" buffer to be rendered on the second pass. On the second pass coverage mask AA, processes each pixel sized "tile" by processing each of the fragments found on the first pass - typically front to back to produce the final color for the pixel.
A typical TBR of course, processes each tile by processing each of the triangles in the tile (not necessarily front to back).

Coverage mask AA derives many of the same benefits as typical TBRs, that is, it substantially reduces external memory bandwidth for a given sample resolution. This is because the processing on the first pass is done at a lower resolution than the sample resolution, and the results are kept in a compressed format for the second pass which processes each tile.

There is one notable difference when it comes to AA though between coverage mask techniques and typical TBRs. Typical TBRs are capable of rendering super-sampled AA at high sample densities (say 16x) while coverage mask techniques must use multi-sampling. This is because TBR's do not render the color on the first pass, but only when the tile is rendered. Since coverage mask AA renders colors on the first pass, it can only afford to render one color per fragment, since rendering a color for each sample would lose most of the benefits.

This may not be much of an advantage in actual practice though, since a TBR would be limited by pixel shader performance, even though it is not restricted by external memory bandwidth. Super-sampling each sample when running a sophisticated pixel shader is simply impractical. The computational resources could be put to much better use and multi-sampled AA is good enough.

http://www.beyond3d.com/forum/viewtopic.php?p=129946&highlight=coverage#129946
 
When doing quad textureing Rampage had (almost) free 4xRGMS. Dual texturing gave it (almost) free 2xRGMS with 4xAF.

Sorry to bring up the "R" word again. :oops:
 
rashly said:
When doing quad textureing Rampage had (almost) free 4xRGMS. Dual texturing gave it (almost) free 2xRGMS with 4xAF.

Sorry to bring up the "R" word again. :oops:
It's "free" in the same way it is "free" on any other multisampling capable chip, only that Rampage had less ROPs per pipe.
 
Xmas said:
rashly said:
When doing quad textureing Rampage had (almost) free 4xRGMS. Dual texturing gave it (almost) free 2xRGMS with 4xAF.

Sorry to bring up the "R" word again. :oops:
It's "free" in the same way it is "free" on any other multisampling capable chip, only that Rampage had less ROPs per pipe.

Not only that, but the amount of samples is inevitably limited to a specific amount due to mainly bandwidth constraints. An alternative Multisampling method based on a coverage mask technique (not necessarily the so far known ones) could theoretically allow much higher sample densities, with as low or almost as low bandwidth penalties as a TBDR.

As for anisotropic filtering getting it fill-rate free, we're almost there with the countless of optimisations being added to the day. The challenge would be to get full trilinear, non angle dependent AF with minimal fill-rate penalties for the foreseeable future.

I'm not entirely sure how Spectre was counting taps exactly, but if my memory serves me well it wasn't independent of Multisampling, and I really haven't seen so far anything that proves that the true final output was 16xAF (as we know it today) equivalent.
 
Ailuros said:
Xmas said:
rashly said:
When doing quad textureing Rampage had (almost) free 4xRGMS. Dual texturing gave it (almost) free 2xRGMS with 4xAF.

Sorry to bring up the "R" word again. :oops:
It's "free" in the same way it is "free" on any other multisampling capable chip, only that Rampage had less ROPs per pipe.

As for anisotropic filtering getting it fill-rate free, we're almost there with the countless of optimisations being added to the day. The challenge would be to get full trilinear, non angle dependent AF with minimal fill-rate penalties for the foreseeable future.
You sure it's free? If doing less work means free, I guess so :?:
 
I'm not going to re-quote someone's trademarked (LOL) comment again, about the "free-ness" for 3D. ;)

It was obviously a sarcastic comment; in that regard AF on Spectre was even more "for free". The second sentence that you quoted above, clarifies it even more; I thought it was obvious enough.
 
Ailuros said:
I'm not entirely sure how Spectre was counting taps exactly, but if my memory serves me well it wasn't independent of Multisampling, and I really haven't seen so far anything that proves that the true final output was 16xAF (as we know it today) equivalent.
OT: I wonder where those rumours about the AF of Spectre/Rampage not being "real" AF came from. AFAICS they're unfounded.
 
Ailuros said:
As for anisotropic filtering getting it fill-rate free, we're almost there with the countless of optimisations being added to the day. The challenge would be to get full trilinear, non angle dependent AF with minimal fill-rate penalties for the foreseeable future.
Well, you could do this by programming an appropriate shader.

That is, if you write a shader that has many more math operations than it has texture operations, texture ops can take a long time without slowing performance.
 
That just made me curious actually - do NVIDIA and ATI shader compilers take AF into account?
I'd guess it wouldn't be an easy thing to check - another interesting factor if that was the case, however, is games that would precache shaders at it'd check for changing AF levels. What about games activating AF without restarting, or using the same shader for things that require a different level/type of filtering? (last case should be quite rare but eh)

It seems to me the on-chip scheduling system is unlikely to be able to do that kind of stuff by itself, although who knows.. My guess is that current compilers don't take AF into account. If so, that leaves a bit of performance to be gained in games that use arithmetic but also textures quite equally in a single shader. It might not be huge gains, but assuming higher latency can't be a Bad Thing - it can only be good, if it is indeed accurate that the fetching latency is higher :)
(And, of course, the NV40 is well advantaged here since it can keep filtered texels in their texture caches)

Uttar
 
It seems to me that it'd be very challenging to take anisotropic filtering into account for the compiler. After all, with anisotropic filtering enabled, the amount of time it takes to do one texture read becomes variable, making optimization rather challenging.
 
Chalnoth said:
It seems to me that it'd be very challenging to take anisotropic filtering into account for the compiler. After all, with anisotropic filtering enabled, the amount of time it takes to do one texture read becomes variable, making optimization rather challenging.
That's probably right, and even more so now that the caches can take filtered data.
However, it'd already be fair enough to guess that if you enable 16x AF, 2x AF will be used on much of the scene (in some cases it might not be much more than 50% for all I know, I never bothered much about those numbers - but that's still worthwhile), so you could assume texture operations to take 2-3 cycles minimum. The NV35 compiler, according to the docs IIRC, assumed it had to group everything in a "2 TEX together" fashion... Which might be more of a disadvantage than an advantage with 8x Quality AF on; you'd be using the data faster than you should in order to save registers, thus stalling the pipeline.
Of course, in the NV3x, registers would be the bottleneck anyway *grins*

Uttar
 
The degree of anisotropy selected only affects the maximum anisotropy. So, if 16-degree anisotropy is selected, just as much 2-degree anisotropy will be used as if 4-degree or 8-degree anisotropy is selected.

But yes, I suppose you could do some tests and examine what the average degree is, and optimize for that...
 
But yes, I suppose you could do some tests and examine what the average degree is, and optimize for that...

I think it's currently around 6x; just don't quote me on that one ;)
 
Chalnoth said:
Ailuros said:
As for anisotropic filtering getting it fill-rate free, we're almost there with the countless of optimisations being added to the day. The challenge would be to get full trilinear, non angle dependent AF with minimal fill-rate penalties for the foreseeable future.
Well, you could do this by programming an appropriate shader.

That is, if you write a shader that has many more math operations than it has texture operations, texture ops can take a long time without slowing performance.

I wouldn't leave this in the hands of ISVs personally for obvious reasons.

Just another sidenote I forgot to add:

It is my understanding that recent GF's use odd and even sample amounts with adaptive AF.
 
Ailuros said:
But yes, I suppose you could do some tests and examine what the average degree is, and optimize for that...
I think it's currently around 6x; just don't quote me on that one ;)
That doesn't seem to make much sense to me. I doubt texture fillrate drops by 1/6th when anisotropic is enabled.
 
Chalnoth said:
Ailuros said:
But yes, I suppose you could do some tests and examine what the average degree is, and optimize for that...
I think it's currently around 6x; just don't quote me on that one ;)
That doesn't seem to make much sense to me. I doubt texture fillrate drops by 1/6th when anisotropic is enabled.

I don't think you understood what I meant and it was admittedly phrased a tad vague. It's my understanding that adaptiveness on GF's can be from 1x, 2x, 3x....etc up to 16x samples with AF, depending of how many samples each texture really requires. The average should or might be around 6x samples, whereby the cases where more samples are really needed should be quite rare.
 
And I'm suggesting that if the average really was 6x, performance with anisotropic filtering enabled, when limited by texture fillrate, would drop to 1/6th its previous value.
 
Back
Top