Implementing trilinear/anisotropic in the pixel shader

Anteru

Newcomer
Just twiddling around with filtering inside the pixel shader, and I wonder how is this supposed to work. Got no mip-map (texture too large), so I can't use HW filtering, and I'm off implementing my own.

Trilinear in the pixel shader seems easy enough. I snap the coordinates, compute four bilinear samples plus the one for the current mip-map level, and blend them together using 4 lerps (3 for the lookup at the non-existing higher mip map level, and one for the final blend). So far, this looks 1:1 like the HW filtering. Not sure if there is a more efficient way, as the lerps and such look a bit wasteful.

However, I'm stuck with anisotropic. I couldn't find a good reference in the web except:For some reason or another (maybe I'm just too stupid), I can't get the first one working. Didn't get around to implementing the second one yet. Well, nevertheless, I'm wondering how exactly is it implemented in hardware? Or at least, how can I get reasonably close to a hardware implementation in the pixel shader? Does anyone know a good source, or better yet, a working GLSL/HLSL implementation which is reasonably fast?
 
Hmm... I'm curious, why are you using a larger texture than supported by the filtering unit? MegaTexture?

Yep. For previewing, I want to load it, and to reduce memory usage, I don't want the mip maps. Thinking further, I'd probably also don't want to load the mip maps of the tiles either, to save 25-33% bandwidth per frame during upload, at the expense of doing the filtering on my own.

Moreover, I'm really curious how to do it, having spent around one week fulltime now and still not anywhere close to good anisotropic filtering. For trilinear, I'm also wondering how an efficient implementation looks like, cause I got currently already loads of arithmetic going on, and I think I'll need trilinear lookups for the anisotropic lookup, so this is just getting worse then. DX10 also complains about some gradient instructions in it, so I wonder how one is supposed to do it.
 
Yep. For previewing, I want to load it, and to reduce memory usage, I don't want the mip maps. Thinking further, I'd probably also don't want to load the mip maps of the tiles either, to save 25-33% bandwidth per frame during upload, at the expense of doing the filtering on my own.
This is just a gut feeling, but that sounds like false economy to me.
 
This is just a gut feeling, but that sounds like false economy to me.

Well, anyway, I'm currently so far to just implement it for the sake of doing it, as it really bugs me that this should be not too complicated/slow, yet I can't find any good reference on the web for this (with reasonable performance). Especially for trilinear filtering, I would have expected loads of references how to do it, but besides a quote in Advanced virtual Texture Topics which says trilinear is easy, and for anisotropic, they didn't have time, I couldn't find even hints at how to do it properly.

Lately I've been looking at ATILA, but they do angle-dependent filtering, which is no longer really used by current GPUs. This also applies to the stuff in the GL extension spec, which goes in x or y, which will only filter for two angles.
 
You need mipmaps for anisotropic and trilinear filtering. That 33% of extra memory usage is well worth the image quality. And you can actually save a lot of memory by only loading the lower mipmap levels when you don't need the high detail ones.
 
You need mipmaps for anisotropic and trilinear filtering. That 33% of extra memory usage is well worth the image quality. And you can actually save a lot of memory by only loading the lower mipmap levels when you don't need the high detail ones.

Yep, but with 4 bilinear lookups, I can compute the values of the next mip map without ever having it, so I basically trade off performance for memory usage (yeah, box filter only). I'm only interested in this limited case, where I can compute the next mip map level, which rules out higher levels of anisotropy (where I guess it would require more mip-map levels to efficiently filter wider areas).

For a simple shader, doing the 4 additional bilinear lookups to get the trilinear filtering works just as fast as the HW filtering (read: the shader is so freakin' simple that the 4 additional bilinear lookups cost nothing), and it also looks like the HW implementation. I'm not sure whether the lerps can be saved, so far, I hadn't much luck optimising it.

Now I wonder how other people do this, and how an anisotropic filter would be implemented in the pixel shader.

[EDIT]: Just tossing around some numbers. For a 2^16 square texture, it'll use 2^32 byte of memory (4 GiB, DXTC 4:1 compressed), with mip-maps this results in at least 5 GiB -- too much for a DVD. The alternative is to generate the mip maps at runtime before uploading them to the GPU, but this requires the 33% more bandwith + the generation of the mip maps. In this light, 4 additional bilinear looks look really cheap.
 
Last edited by a moderator:
can I ask a noob question ?
since gfx cards have a max texture size of 8192x8192 how does megatexture get round this you could cut it into smaller peices but that sort of defeats the object..
 
can I ask a noob question ?
since gfx cards have a max texture size of 8192x8192 how does megatexture get round this you could cut it into smaller peices but that sort of defeats the object..

Yes, you chop the texture up into smaller pieces, however, but you still need to filter them, either using mip maps or using shaders. If you use mip-maps, you either use up a whole lot of memory more on disk, or you create them during runtime. If you don't use them, you gotta do it in your shaders.

What you can also do is chop the thing up into 8k squares, to have a simple preview, without generating the tiles etc. which is what I want to do :) But as I'm running on my notebook, I'd like to upload the 8k directly without mipmaps into the GPU, so I can save some memory and I don't have to precompute the mip-maps (easier run-time editing).
 
[EDIT]: Just tossing around some numbers. For a 2^16 square texture, it'll use 2^32 byte of memory (4 GiB, DXTC 4:1 compressed), with mip-maps this results in at least 5 GiB -- too much for a DVD. The alternative is to generate the mip maps at runtime before uploading them to the GPU, but this requires the 33% more bandwith + the generation of the mip maps. In this light, 4 additional bilinear looks look really cheap.
Silly question, but does it have to have an alpha channel? Would DXT1 do for at least some of the data?

If not, assuming you do need DXT2 or greater, if your primary concern is the shipping (the equivalent of) 5 GiB of precompressed DXTC data, might I suggest an alternate approach?

Let's say we want to target the equivalent of, say, 2bpp instead of the 8bpp of DXTn (n>1).

Let's break the data into manageable chunks, say 2^11 * 2^11 pixels, then group these into aligned 4x4 pixel blocks and then include all the MIP map blocks as well. We'd have approximately 350K 4x4 blocks which, if stored in DXTn, would be ~5.3MiB

If we were to VQ compress these 350K blocks to, say, 64K unique representatives, that would then require 350K*2bytes for indices + 64k*16Bytes for a table of DXTn blocks giving a total of ~1.7 MiB.

Would that be worth investigating?
 
Perhaps a silly question, but why aren't you just generating mipmaps on the GPU when the texture changes? There's direct API functionality for this, and it's simple to do even if there isn't.

Saving GPU memory is a moot point here... I guarantee that emulating anisotropic filtering (or arguably even trilinear) in the shader is going to be so much slower that it won't be worth it, especially if this is an 32-bit RGBA texture, and ESPECIALLY if it's compressed. There's no competing with dedicated hardware for that type of lookup.

If you want to do it all in software, you might as well bite the bullet and implement full EWA filtering which is well-documented in a number of places. It will be hella-slow, but that's kind of the point here...

Now with respect to mipmapping, I'm a little confused on your point. You seem to be saying that you can do trilinear without the "lower" mipmap level by generating it on the fly... this is of course true, but only if you know your LOD before-hand (so you know which mipmap level to upload).

For anisotropic lookups, you have to modify how you're computing that LOD before-hand, and I think you're going to realize that it'll now be a lot more variable which LOD's you're going to need over a full screen, and indeed you'll need *higher* level LODs, so the memory cost of just storing the mipmap chain is pretty moot IMHO. Again, hardware is just going to bury software here for the lookups that you're talking about.
 
Last edited by a moderator:
Another noob question if I may
if you have to cut the texture up what exactly is the point of megatexture
I thought the point of it was you have 1 texture
 
Yep, but with 4 bilinear lookups, I can compute the values of the next mip map without ever having it...
Sure, but for the next level you need 16 lookups, then 64, 256, 1024, etc. The magic of mipmaps is that you can have all that precomputed at a very modest memory cost.
Just tossing around some numbers. For a 2^16 square texture, it'll use 2^32 byte of memory (4 GiB, DXTC 4:1 compressed), with mip-maps this results in at least 5 GiB -- too much for a DVD.
For a texture that huge I think your best bet is to come up with your own compression algorithm and implement it in the shader. I'm thinking wavelets, which I believe also removes the need for mipmapping...
 
Ok, to clear up some confusion here. This would be for the per-tile lookup in any kind of virtualized texture memory solution (not the original point, as I'm not that far with the virtualized part ;) ). I just wanted to get stuff working basically before turning to the virtualization part. Eventually, I'm aiming for YAVT (Yet another virtual texturing)-Implementation.

Perhaps a silly question, but why aren't you just generating mipmaps on the GPU when the texture changes? There's direct API functionality for this, and it's simple to do even if there isn't.
Didn't try yet. The tile cache in the virtual texture is probably to large for per-frame mip-map recomputation (going to be like 4k), but it might be viable per tile. Haven't investigated yet.

Now with respect to mipmapping, I'm a little confused on your point. You seem to be saying that you can do trilinear without the "lower" mipmap level by generating it on the fly... this is of course true, but only if you know your LOD before-hand (so you know which mipmap level to upload).
In that case, you know the LoD before-hand, and you can easily compute the trilinear blend factor.

For anisotropic lookups, you have to modify how you're computing that LOD before-hand, and I think you're going to realize that it'll now be a lot more variable which LOD's you're going to need over a full screen, and indeed you'll need *higher* level LODs, so the memory cost of just storing the mipmap chain is pretty moot IMHO. Again, hardware is just going to bury software here for the lookups that you're talking about.
Well, I just wanted to hack it in in the shader to have some filtering now and reuse this later per-tile. With a clamped anisotropic maximum (let's say 2-4:1), I don't think it's actually going to get that bad. To check how bad the performance really is, I would need anisotropic filtering in the shader first, which I don't have :) I simply don't want to dismiss this solution until I see, yes, even fore modest, approximated anisotropy it's x% slower than the hardware, with x > 20% or so.
 
Another noob question if I may
if you have to cut the texture up what exactly is the point of megatexture
I thought the point of it was you have 1 texture

Yes, the idea is to assemble the tiles in such a way that you don't see that the megatexture has been split up. By analyzing the scene, you select the parts of the texture that you would see, and display them, so there is no difference to displaying the full texture.
 
Last edited by a moderator:
ATTILA actually implements an angle independent anisotropic algorithm. Not in the unsupported one time drop that the open source version is though. Well in fact it implements three: two botched attempts to discover what the hell ATI/AMD and NVidia are using and EWA (as Beyond3D readers of the G80 article should know ;)).

And I don't mind posting the code for EWA (it may not be the best implementation of Heckbert's algorithm though but at least seems to work).

In any case a real pain to implement in a shader.

Code:
//  Computes anisotropy based on a Heckbert's EWA algorithm
f32bit TextureEmulator::anisoEWA(f32bit dudx, f32bit dudy, f32bit dvdx, f32bit dvdy,
    u32bit maxAniso, u32bit &samples, f32bit &dsOffset, f32bit &dtOffset)
{
    f32bit A;
    f32bit B;
    f32bit C;
    f32bit F;
    f32bit p;
    f32bit t;
    f32bit q;
    f32bit axis1[2];
    f32bit axis2[2];
    f32bit l1;
    f32bit l2;
    f32bit scale;
    f32bit N;

    //  Calculate ellipse equation coefficients from the derivatives
    A = dvdx * dvdx + dvdy * dvdy;
    B = -2.0f * (dudx * dvdx + dudy * dvdy);
    C = dudx * dudx + dudy * dudy;
    F = (dudx * dvdy - dudy * dvdx) * (dudx * dvdy - dudy * dvdx);

    //  What is the purpose of this?
    A = A/F;
    B = B/F;
    C = C/F;

#ifndef GPU_SIGN
    #define GPU_SIGN(x) (((x) >= 0)?1:-1)
#endif

    //  Calculate additional factors required to compute the minor and major axis
    //  of the ellipse
    p = A - C;
    q = A + C;
    t = GPU_SIGN(p) * (f32bit) GPU_SQRT(p * p + B * B);

    //  Check if t is 0
    if (t == 0.0f)
    {
        axis1[0] = 1.0f / (f32bit) GPU_SQRT(A);
        axis1[1] = 0.0f;
        axis2[0] = 0.0f;
        axis2[1] = 1.0f / (f32bit) GPU_SQRT(A);
    }
    else
    {
        //  Calculate the major and minor axis of the ellipse
        axis1[0] = (f32bit) GPU_SQRT((t + p) / (t * (q + t)));
        axis1[1] = GPU_SIGN(B * p) * (f32bit) GPU_SQRT((t - p) / (t * (q + t)));

        axis2[0] = -1.0f * GPU_SIGN(B * p) * (f32bit) GPU_SQRT((t - p) / (t * (q - t)));
        axis2[1] = (f32bit) GPU_SQRT((t + p) / (t * (q - t)));
    }


    //  Compute the lenght of both vectors
    l1 = (f32bit) GPU_SQRT(axis1[0] * axis1[0] + axis1[1] * axis1[1]);
    l2 = (f32bit) GPU_SQRT(axis2[0] * axis2[0] + axis2[1] * axis2[1]);


    //  Check the major axis
    if (l1 > l2)
    {
        //  Calculate anisotropy ratio
        N = l1 / l2;

        //  Clamp aniso ratio to max aniso
        N = GPU_MIN(N, f32bit(maxAniso));

        /*  Calculate the number of samples required.  */
        samples = u32bit(GPU_CEIL(N));

        //  Calculate the texture scale for each sample.
        scale = l1 / N;

        //  Calculate the per anisotropic sample offsets in s,t space.
        dsOffset = axis1[0] / f32bit(samples + 1);
        dtOffset = axis1[1] / f32bit(samples + 1);
    }
    else
    {
        //  Calculate anisotropy ratio
        N = l2 / l1;

        //  Clamp aniso ratio to max aniso
        N = GPU_MIN(N, f32bit(maxAniso));

        /*  Calculate the number of samples required.  */
        samples = u32bit(GPU_CEIL(N));

        //  Calculate the texture scale for each sample.
        scale = l2 / N;

        //  Calculate the per anisotropic sample offsets in s,t space.
        dsOffset = axis2[0] / f32bit(samples + 1);
        dtOffset = axis2[1] / f32bit(samples + 1);
    }

    //  Check for finite results of the anisotropy algoritm
    if (!(finite(scale) && finite(dsOffset) && finite(dtOffset)))
    {
        samples = 1;

        scale = static_cast<f32bit>(GPU_MAX(GPU_SQRT(dudx * dudx + dvdx * dvdx),
            GPU_SQRT(dudy * dudy + dvdy * dvdy)));

        dsOffset = 0.0f;
        dtOffset = 0.0f;
    }

    return scale;
}
 
ATTILA actually implements an angle independent anisotropic algorithm. Not in the unsupported one time drop that the open source version is though. Well in fact it implements three: two botched attempts to discover what the hell ATI/AMD and NVidia are using and EWA (as Beyond3D readers of the G80 article should know ;)).

Yep, I have looked at the ATTILA code ;) Thanks for the code, I really gotta give EWA a chance, just for the laughs :) At the moment, I've buried the shader only approaches, as the filtering performance is really not that great, but I still wonder whether someone else had more luck with this.
 
Back
Top