Different filtering methods

Xmas · Jul 20, 2003

Hyp-X said:
Let's take a function f(x)=x^2.
f(0) = 0
f(1) = 1
Both cases are equally likely, averaging 0.5
Do you agree?
I hope not.

You're right. The function to determine the LOD value is nonlinear. However this means the average is even more towards taking more bandwidth when using bilinear.

Code:

[Bilinear]
Mip level a:   n .... 2n .... 4n
Mip level b:                   n .... 2n .... 4n
Mip level c:                                   n .... 2n .... 4n
Mip level d:                                                   n .... 2n .... 4n

Only one miplevel sampled at one point.

[Trilinear]
Mip level a:   n .... 2n .... 4n .... 8n
Mip level b:         0.5n .... n .... 2n .... 4n .... 8n
Mip level c:                         0.5n .... n .... 2n .... 4n .... 8n
Mip level d:                                         0.5n .... n .... 2n .... 4n

Two miplevels sampled at every point.

Sorry, I don't understand that table.

3dcgi · Jul 21, 2003

bloodbob said:
does anyone think there is hope for a cubic magnification filter in the next 10 years ( other then on the matrox cards ).

Are you saying Matrox is currently using cubic filtering instead of, or in addition to, bilinear? If so, where does this info come from?

Mintmaster · Jul 21, 2003

Dave H, I never thought of the scenario that you brought up. I always thought that for bilinear filtering, it was always rounded to the smaller mipmap, but obviously that would create massive aliasing. If it was rounded to the larger mipmap, then texture bandwidth would increase 5x from bilinear to trilinear.

So, the question is how the rounding is done. After checking Quake 3, it seems you guys are right, and Hyp-X's chart is right (at least to a constant factor).

However, it turns out that I was quite wrong, and Hyp-X was almost right. Going back to his chart...

Code:

[Bilinear]
Mip level a:   n .... 2n .... 4n
Mip level b:                   n .... 2n .... 4n
Mip level c:                                   n .... 2n .... 4n
Mip level d:                                                   n .... 2n .... 4n

[Trilinear]
Mip level a:   n .... 2n .... 4n .... 8n
Mip level b:         0.5n .... n .... 2n .... 4n .... 8n
Mip level c:                         0.5n .... n .... 2n .... 4n .... 8n
Mip level d:                                         0.5n .... n .... 2n .... 4n

Region A:              |-------|
Region B:                      |-------|

In region A, bi goes from 2 to 4, whereas tri goes from 2+0.5=2 to 4+1=5, so tri requires 1.25 times the bandwidth of bi in this entire region.

In region B, bi goes from 1 to 2, whereas tri goes from 4+1=5 to 8+2=10, so tri requires 5 times the bandwidth of bi in this entire region.

Function non-linearity is not an issue, because the multiples of 1.25 and 5 are constant in the regions.

Consider looking at an infinite flat plane along its normal from some arbitrary distance away. On average and under minification, 50% of the time situation A arises, and 50% of the time situation B arises (although the regions will be shifted). So you get 0.5*1.25+0.5*5 = 3.125. This weighting depends on the rounding method, but looking at quake 3 mipmaps it looks like the bilinear transitions occur right in the middle of a trilinear blend, so 50/50 seems right. Thus trilinear filtering appears to require 3.125 times the bandwidth of bilinear filtering.

However, this is assuming all mipmap levels appear in the same quantity throughout the scene (i.e. the randomly distant infinite plane), which is clearly false. For any angled surface like a floor (or even a bunch of non-angled polygons viewed head on, like road signs), each mip map will appear in objects covering 1/4 the area of the previous mipmap (on average). So that will be a weighted mean of the form (1)*1.25+(1/4)*5+(1/16)*1.25+(1/64)*5+..... / (1.33) = 2. Note that the 1.25 region comes first.

Good insight, Hyp-X, but it looks like in the end Dave H was right. Trilinear filtering requires 2 times the bandwidth of bilinear filtering. There are, however, a bunch of assumptions, and the area weighting I assumed doesn't always hold true. A far away building may cover a large part of the scene and use a fairly small mipmap, but when close it won't multiply in area like a road sign because half of it will be off the screen.

I guess we can only say it requires from 2-3.125 times the bandwidth on average, but when you include magnification and cases where the 5x case happens more often, then even this range isn't correct.

Xmas: The table tells you what bandwidth per pixel is required from each mipmap. The x-axis lets you specify the LOD, and the numbers tell you the bandwidth requirements per pixel. n is a constant dependent on LOD.

bloodbob · Jul 21, 2003

3dcgi said:
bloodbob said:

does anyone think there is hope for a cubic magnification filter in the next 10 years ( other then on the matrox cards ).

Click to expand...

Are you saying Matrox is currently using cubic filtering instead of, or in addition to, bilinear? If so, where does this info come from?

Sorry I ment 3dlabs

their doco said their latest vpu supported programable filtering.

Hyp-X · Jul 21, 2003

Mintmaster said:
In region A, bi goes from 2 to 4, whereas tri goes from 2+0.5=2 to 4+1=5, so tri requires 1.25 times the bandwidth of bi in this entire region.

In region B, bi goes from 1 to 2, whereas tri goes from 4+1=5 to 8+2=10, so tri requires 5 times the bandwidth of bi in this entire region.

Function non-linearity is not an issue, because the multiples of 1.25 and 5 are constant in the regions.

Consider looking at an infinite flat plane along its normal from some arbitrary distance away. On average and under minification, 50% of the time situation A arises, and 50% of the time situation B arises (although the regions will be shifted). So you get 0.5*1.25+0.5*5 = 3.125.

Your math is flawed.
The values in region A is twice the values in region B so you have to weight the multipliers 2:1 (even with equal probability).

Thats (1.25*2 + 5*1)/3 = 7.5/3 = 2.5
Thats 2.5x

I'll answer your claim of scene based probability in the next post.

Hyp-X · Jul 21, 2003

Mintmaster said:
However, this is assuming all mipmap levels appear in the same quantity throughout the scene (i.e. the randomly distant infinite plane), which is clearly false. For any angled surface like a floor (or even a bunch of non-angled polygons viewed head on, like road signs), each mip map will appear in objects covering 1/4 the area of the previous mipmap (on average).

Let's take road signs.

The road signs with the next mipmap has 1/4 area but there are twice as many assumed they are distributed evenly along the road.

So we are talking about more like 1/2 in this case.

So that will be a weighted mean of the form (1)*1.25+(1/4)*5+(1/16)*1.25+(1/64)*5+..... / (1.33) = 2. Note that the 1.25 region comes first.

Says who?

Mintmaster · Jul 22, 2003

Hyp-X said:
Your math is flawed.
The values in region A is twice the values in region B so you have to weight the multipliers 2:1 (even with equal probability).

Thats (1.25*2 + 5*1)/3 = 7.5/3 = 2.5
Thats 2.5x

First of all, that math was intentionally supposed to be flawed, which is why I carried on discussion. Second, this depends on your definition of average bandwidth multiple. Are you weighing the multiple per pixel or per byte of bandwidth? Since we are talking about triangle areas and per-pixel bandwidth, I assumed the former. If we were completely texture bandwidth bound in performance, then you'd be right, but that rarely happens. If you were talking about total bandwidth over all pixels, you'd be right, but this assumes equal coverage of each mip-map.

Lets say total texture bandwidth divided by total pixels is what we're after. In retrospect, this seems like the logical thing to calculate, so I apologize. Taking into account the 1/4^n weighting (further justified below), the series is now:

(1)*2*1.25+(1/4)*5+(1/16)*2*1.25+(1/64)*5+..... / (2.4) = 1.667

note: 2.4 = 2 * (1 + 1/16 + 1/256 + ...) + (1/4 + 1/64 + 1/1024 + ...)

Hyp-X said:
Let's take road signs.

The road signs with the next mipmap has 1/4 area but there are twice as many assumed they are distributed evenly along the road.

So we are talking about more like 1/2 in this case.

You're half right. I was including overlap for a very dense population of road signs, but then it becomes like an angled plane, like the floor or a wall. However, consider looking down a corridor with perpendicular corridors evenly spaced. Their walls, which you look at head on, increase in visibility as you go down the hallway, just like a dense population of road signs. The area ratios will work out to be like I said previously (if you average everything out) and half of the corridor looks something like this:

Code:

--------\
|      |  \
|      |  |_
|      |  | |\
|      |  | ||>
|      |  | |/
|      |  |-
|      |  /
--------/

The reason I said half right is because if there was no overlap from the road signs, then the series would be weighted (2^n)/(2n+1)^2 (note: numerator depends on sign spacing) instead of 1/(4^n) as I illustrated in my previous post. Obviously summing the first series doesn't converge, so having no overlap is impossible. You can't really say "more like 1/2" as a general statement. OTOH, if we neglect HSR then overlap is irrelevant. All in all, this specific case is not very good for generalization.

However, for the angled plane I am pretty sure that I am right in my weighting. See below.

Hyp-X said:
So that will be a weighted mean of the form (1)*1.25+(1/4)*5+(1/16)*1.25+(1/64)*5+..... / (1.33) = 2. Note that the 1.25 region comes first.

Click to expand...

Says who?

Says you. Look at your chart. Consider the floor of an FPS. Up to region A both bi and tri are equal. Next comes the 1.25x. Then the 5x region, and so on.

Sure it's possible that we'll start with the 5x region because that's the first region on the screen or that's where the object starts, but now we're getting into specific scenarios. Its also possible for there not to be any 5x region. On average, it should work out to being like I said, with the 1.25x region coming first in the series, and like the following situation. Consider a floor (with walls bounding it):

Code:

            /\
           /5 \
          /----\
         / 1.25 \
        /--------\
       /    5     \
      /------------\
     /     1.25     \
    /----------------\
   /                  \
  /        1           \
  ----------------------

If I drew it to scale, each trapezoid would be 1/2 the dimensions (hence 1/4 area) of the one below it, except for the "1" trapezoid.

In any case, even if my weighting is off, you can easily see that higher mipmap numbers have smaller areas on the screen. Also, in the nearest region where tri differs from bi, tri need 1.25x the bandwidth. Furthermore, as you correctly pointed out, the 1.25 multiple must have double the weighting.

In the end, the answer must be significantly less than 2.5 due to the area weighting. My guess is a bit over 1.667 (due to the case you described), but most likely less than 2. If you average in the magnification issue, then it's even lower.

Thanks, Hyp-X, for this enlightening conversation about filtering. I think I'll step out of this thread unless you have yet another important piece of the puzzle to mention.

darkblu · Jul 22, 2003

honeslty, i completely missed the crux of this whole bandwidth conversation. where do all those myriads of factors come from? bilinear in one LOD + bilnear in the next LOD is 8 texels altogether (at minification, that is), no matter how you look at it. so what did i miss?

Dio · Jul 22, 2003

The effects of caching. There is a 'resonance' effect as the fractional D level changes across the texture because you aren't at exactly 1:1 mapping.

I'm not 100% convinced by some of the numbers I've seen here though (if sireric posted that could be taken as gospel, of course...).

darkblu · Jul 22, 2003

Dio said:
The effects of caching. There is a 'resonance' effect as the fractional D level changes across the texture because you aren't at exactly 1:1 mapping.

aha. but shouldn't the 'resonance' effect be effectively the same for both LODs?

Dio · Jul 22, 2003

Pretty much. That's a central reason why I'm not sure about some of the calculations

Hyp-X · Jul 22, 2003

Mintmaster said:
Sure it's possible that we'll start with the 5x region because that's the first region on the screen or that's where the object starts, but now we're getting into specific scenarios.

That was my point.

I think, you are right about most of the other things.

Of course in a specific scene it will never be 2.5x.

Hyp-X · Jul 22, 2003

Dio said:
The effects of caching. There is a 'resonance' effect as the fractional D level changes across the texture because you aren't at exactly 1:1 mapping.

I'm not 100% convinced by some of the numbers I've seen here though (if sireric posted that could be taken as gospel, of course...).

The calculations assumed that the cache is 100% effective (nothing have to be read twice), which is pretty impossible to achieve.

Also I don't know the width and granularity of the cache->texture filter datapath.
Can it feed 512bit data for the worst case (DXT5 texture where bilinear is accross the tile corder), or does the filter gets starved in some of those extreme cases even when the data is in the cache.

Also calculating bandwidth increase doesn't tell much of the real-world performance as some of the scene can be fillrate limited, some of it is bandwidth limited, and the buffers to even that out are surely limited.

Dio · Jul 22, 2003

The traditional reason that trilinear was a big hit was not because of the absolute bandwidth requirements, but because the two textures tended to be in different DRAM pages and so a lot of the theoretical bandwidth got thrown away on page closures. This isn't so much a problem nowadays.

As you say, generally there is 'a bottleneck' but exactly where that is can vary greatly during the rendering of the scene.

Simon F · Jul 23, 2003

Dio said:
The traditional reason that trilinear was a big hit was not because of the absolute bandwidth requirements, but because the two textures tended to be in different DRAM pages and so a lot of the theoretical bandwidth got thrown away on page closures.

But you could put them in alternate banks which would make the page swap between levels free. For example, even lowly PCX1 did this.

Dio · Jul 23, 2003

It's all a question of priorities. For the PCX1 you could afford to make trilinear a top priority in terms of bank arrangements. Those of us who were using off-chip Z and colour had to make those travel first-class...

Scott C · Jul 29, 2003

The below is flawed. With trilinear filtering you always sample from one texture that is of greater or equal detail than the screen LOD (n >= 1 below) and from one texture that is less detailed than the screen LOD (n < 1)

Code:

Code:

[Bilinear]
Mip level a:   n .... 2n .... 4n
Mip level b:                   n .... 2n .... 4n
Mip level c:                                   n .... 2n .... 4n
Mip level d:                                                   n .... 2n .... 4n

[Trilinear]
Mip level a:   n .... 2n .... 4n .... 8n
Mip level b:         0.5n .... n .... 2n .... 4n .... 8n
Mip level c:                         0.5n .... n .... 2n .... 4n .... 8n
Mip level d:                                         0.5n .... n .... 2n .... 4n

Region A:              |-------|
Region B:                      |-------|

Region B is too detailed. It whould be:

Code:

[Trilinear]
Mip level a:     n ....   2n ....  4n
Mip level b: 0.25n .... 0.5n ....   n .... 2n .... 4n
Mip level c:                    0.25n ....0.5n .... n .... 2n .... 4n
Mip level d:                                    0.25n ....0.5n .... n .... 2n .... 4n

Region B1:       |--------|
Region A1:                |--------|
Region B2:                         |--------|

Note that the units are in the size of the mip level in AREA, not length. Hence, each mip is 1/4 the size of its neighbor.

The LOD to mipmap area mapping is NOT linear, so we can't just average this. In addition, the screen area to LOD mapping isn't linear, which complicates things further. Thus the "obvious" average of 3.125 texels per pixel does not apply. If we factor out the LOD to texel area mapping of n^2 we get (2.25^2) texels per pixel, but this is again flawed because the screen to LOD mapping is not linear, and is scene dependant (looking down a cylandrical tunnel? Or two infinite planes?)

However, these nonlinearities can be factored out, since we are only looking for Trilinear texture samples / Bilinear texture samples ratio.

In this case, Trilinear covers EXACTLY 5x the texel area per filtered pixel. At every point, it samples the bilinear filter texel area plus the next more detailed mip level, which is always 4x the detail in texel count. Bilinear filtering is not defined as sampling the "closest" mip level, but rather the mip level that is less than or equal to the screen LOD.

Trilinear does a weighted average of this and the next most detailed mip level which is by definition always greater than or equal in detail than the screen LOD, up to a max of four texels/pixel for this higher detail mipmap.
(ok, so its clamped at just under 4 texels/pixel).

However, what does the texel area per pixel covered have to do with texel bandwith required?

Almost nothing, due to texture caches, yet a lot, due to texture caches.

The texture cache works equally well for both mip levels as long as it is designed intelligently (favors the larger mipmap if there is a collision or is always large enough to fit both).

How large does this cache have to be per texel pipe? Well, lets imagine a 32 pixel wide tile (all recent graphics hardware renders in tiles to get best texel reuse for a cache).

The first line of rendering fetches at most (32 +1) *2 (low res samples) + (32 )* 4 (high res samples).

But each following line only requires (32+1) *1 low res samples (the other possible required texels were in the first rendered line.
and (32 ) * 4 high res samples, which in the worst case are never re-used because each pixel corresponds to exactly 4 texels.

Then why cache these? because this is the worst case, most times the higher res mip level does reuse texels, at at least a clip of 50%.

Overall, the total space needed to cache efficiently for trilinear filtering is sligtly more than 5*2*Tile width or 320 texels in the example above.
This is the per concurrent texture, per pipeline number. That is a little more than 1K for a single texturing single pipeline hardware renderer.

Now what people were asking about was bandwidth, but this can't really be calculated because the worst case is 5x the bandwidth, but this is very unlikely. This depends on the scene, and especially the texture reuse. If the cache is large enough, and many triangles and/or tiles worth of things are rendrerd that use the same textures, then the reuse will be high. WIth maximal reuse, only a few border cases call for texture reloads, and the 5x calculation depends on a full reload, not just a few missing samples that may be in one mip level or another, and not "tied" together in the 4:1 ration that brings about the 5x ratio.

We are also not factoring in one very imprtant effect: The high detail mip of one triangle can be the low level mip of another, and vice-versa. This will cause the apparent 5x bandwidth ratio to break, because this benefit of the texture cache is something that benefits trilinear filtering but not bilinear filtering, which when shifting mip levels has to get new textures, not retrieve older ones already in cache.

The best case scenario is when an entire mip pyramid is required to render say, an infinite plane stretching to the horizon. In this case, during rendering bilinear must fetch the whole textre, and all mip levels at some point.

The same is true of trilinear. If cached well enough, both use exactly the same bandwith: The whole texture mip pyramid, once.

KimB · Jul 29, 2003

Scott C said:
The best case scenario is when an entire mip pyramid is required to render say, an infinite plane stretching to the horizon. In this case, during rendering bilinear must fetch the whole textre, and all mip levels at some point.

The same is true of trilinear. If cached well enough, both use exactly the same bandwith: The whole texture mip pyramid, once.

To do this, wouldn't you need the triangle setup engine to break up the plane into pixels that use the same portion of the texture?

That is, imagine a repeating texture:
12345123451234512345

Let's assume that the texture cache cannot hold the entire texture in video memory. Therefore, rendering straight across the screen would result in lots and lots of texture loads.

But, if the triangle setup engine passed all pixels marked "1" above, then all pixels marked "2," and so on, then the cache would be completely efficient.

At the same time, breaking up the scene in such a way would require either a significant output cache, or there would be a significant granularity problem in writing to/reading from the framebuffer.

OpenGL guy · Jul 29, 2003

Chalnoth said:
To do this, wouldn't you need the triangle setup engine to break up the plane into pixels that use the same portion of the texture?

That is, imagine a repeating texture:
12345123451234512345

Let's assume that the texture cache cannot hold the entire texture in video memory. Therefore, rendering straight across the screen would result in lots and lots of texture loads.

But, if the triangle setup engine passed all pixels marked "1" above, then all pixels marked "2," and so on, then the cache would be completely efficient.

At the same time, breaking up the scene in such a way would require either a significant output cache, or there would be a significant granularity problem in writing to/reading from the framebuffer.

First, the triangle setup engine has no concept of pixels, I believe you are thinking of the scan converter. Second, the scan converter has no idea what texels are going to be accessed... the pixel shader is what dictates what texels will be loaded (based on texture coordinates, dependent texture reads, etc).

As has been mentioned before, one technique used to improve cache locality (particularly with bilinear filtering) is memory tiling. There have been many discussions on this topic here. And, yes, a texture with a high repeat count can hurt cache performance, even with tiling, but if you have mipmaps you'll still get good performance.

Mintmaster · Jul 29, 2003

Scott C said:
The LOD to mipmap area mapping is NOT linear, so we can't just average this. In addition, the screen area to LOD mapping isn't linear, which complicates things further. Thus the "obvious" average of 3.125 texels per pixel does not apply. If we factor out the LOD to texel area mapping of n^2 we get (2.25^2) texels per pixel, but this is again flawed because the screen to LOD mapping is not linear, and is scene dependant (looking down a cylandrical tunnel? Or two infinite planes?)

Have you been reading the above posts? Of course none of this is linear, but the multiples are constant throughout Region A or Region B, which is what we are after. From there we just need a screen area weighting, which is justified in my previous posts.

Scott C said:
In this case, Trilinear covers EXACTLY 5x the texel area per filtered pixel. At every point, it samples the bilinear filter texel area plus the next more detailed mip level, which is always 4x the detail in texel count. Bilinear filtering is not defined as sampling the "closest" mip level, but rather the mip level that is less than or equal to the screen LOD.

Check some games, and you'll find that isn't the case. Compare bilinear and trilinear here. This was also the topic of discussion above.

Thus Hyp-X's diagrams are correct. BTW, n is 2 times the LOD. In your redrawn diagram for trilinear, you just divided everying by 2.

Scott C said:
However, what does the texel area per pixel covered have to do with texel bandwith required?

Instead of getting into the rest of your discussion, I should reiterate our assumptions (again, mentioned above). Assume the tile does not fit into the cache, but memory tiling and caching are ideal in that no texel is loaded twice over a local region. Between tiles, yes, the texture is reloaded (since a larger texture would likely not fit entirely in the cache). During magnification, no, a cached texel does not need to be loaded twice when it is used in weighting over a group of neighbouring pixels.

Why this assumption? If the texture and its mipmap pyramid fit entirely in the cache, there is no change in bandwidth requirements, and this rarely happens with the larger textures of newer games. Futhermore, bilinear filtering has no impact on performance nowadays, so the cache is very good at holding texels for neighbouring pixels. Obviously the above assumption is the situation we are interested in.

Scott C said:
We are also not factoring in one very imprtant effect: The high detail mip of one triangle can be the low level mip of another, and vice-versa.

This is included in our assumptions. That's why we are looking at texel area per pixel for each mipmap. Again, remember that fitting a whole tile into the cache is not what we are interested in.

Read through our discussion again and you'll see that it was well thought out and carefully concluded by Hyp-X (who gave very good insight into these calculations) and myself.

Different filtering methods

Xmas

Porous

3dcgi

Mintmaster

bloodbob

Trollipop

Hyp-X

Irregular

Hyp-X

Irregular

Mintmaster

darkblu

Dio

darkblu

Dio

Hyp-X

Irregular

Hyp-X

Irregular

Dio

Simon F

Tea maker

Dio

Scott C

KimB

OpenGL guy

Mintmaster

Similar threads