Optimizing texture bandwidth and memory usage

sebbbi

Veteran
We have rather complex per pixel material shading in our engine. In addition to basic color texture and normal map, every material parameter can be adjusted in pixel precision. All this data needs be sampled for each rendered pixel, and this needs to be done in the deferred rendering geometry pass when the bandwidth is already eaten up by the multiple g-buffer writes.

Our artists do the following texture layers to each material:
- Color (rgb)
- Opacity (greyscale)
- Ambient multiplier (greyscale)
- Diffuse multiplier (greyscale)
- Specular multiplier (greyscale)
- Glossiness multiplier (greyscale)
- Self illumination (greyscale)
- Heightmap (greyscale)
- Normalmap (rgb)

Normalmap (tangent space) is usually automatically generated from heightmap. Ambient multiplier map is usually automatically generated by a ambient occlusion generation tool from object geometry and heightmap. Other maps are usually hand drawn by the artist.

Currently our texture management toolchain packs these 9 maps to 3 textures like this:

color.r, color.g, color.b, opacity (DXT5)
normal.x, normal.y, height, ambient (8888)
diffuse, specular, glossiness, illumination (8888)

In pixel shader, the normal vector z component is calculated as z=sqrt(1-x^2-y^2). The sign bit is not relevant as the map is in tangent space. All texture map material properties values are multipliers for the material value and are normalized to [0,1] range by our texture management tools.

With this setup the quality is excellent, but each pixel takes 9 bytes of memory.


Current and planned optimizations

Our current texture setup stores most of the data in a much higher precision than we need. Most of the data can be compressed (at least slightly) without any loss of image quality.

By going though our material content I noticed following things:
- RGB color is optimally stored in the DXT5 RGB channels. No change is needed.
- Opacity is 1.0 for all pixels in over 95% of our textures.
- Opacity does not need the high precision of the separate DXT5 alpha channel, as we only use it for alphatesting. We need more than one bit for smooth curvy clipping result, but 2-3 bits should give identical result as full 8 bits.
- Self illumination is 0.0 for all pixels in over 95% of our textures.
- In textures with self illumination, the self illumination affects only limited areas, and in those areas the illumination is so strong that it hides other material properties.
- Diffuse and specular are usually connected (both are often the same texture with different brightness and contrast and sometimes smoothing). Often glossiness is also connected to the specular.
- Material value multipliers (specular, diffuse, glossiness) do not need full 8 bit precision. These maps contain rarely any smooth gradients.
- Ambient needs less precision that other material channels, because our ambient cubemap lighting is rather dim. 1.0 can affect 0.15 to the final pixel value (without tone mapping).
- Height is important for parallax mapping to look good, and artifacts in it will look pretty bad.

First stage of optimization:

color.r, color.g, color.b, normal.x (DXT5)
opacity, height, illumination, normal.y (DXT5)
diffuse, specular, glossiness, ambient (8888)

The first step was to decide that I wanted at least two DXT5 textures. This way I could store the color nicely to the first one, and the normal vector (xy) to the both DXT5 alpha channels. This gives me identical normal map quality compared to 3Dc 'ATI2' format (as it's simply two DXT5 alphachannels glued together). As opacity is usually 1 and illumination is usually 0, I can put them in the DXT5 r and b channels and the height on the DXT5 g channel (g is 6 bits while r and b are 5). Because either opacity or illumination (or both) are almost always constant for the whole block, all the interpolation values are often be used for the height. This gives me enough quality for the height. For materials with much varying opacity and/or self illumination, the artist can choose full 8888 format for this combination texture. Less than 5% of materials should need this.

So far the memory and bandwidth usage per pixel has dropped from 9 to 6 bytes without any visible sacrifices in image quality. Less than 5% materials (by artist's choice) still use the 8888 format to store (opacity, height, illumination, normal.y) instead of the DXT5 format. The good thing is that this does not affect our code or shaders at all, as both formats have the same channels and same range (0,1).


How to get rid of the last uncompressed 8888 texture?

This is why I actually wrote this post. I'd like to have some feedback from the game developers who have written games to high end console platforms with proper texture compression support, as my own experience from packing textures comes from platforms like NDS and PS2/PSP. Using palette texture tricks is not something that cuts it in the high end side.

ATI1/ATI2:
One way to store these four channels is to use two ATI2 textures. This results in quality almost identical to the uncompressed 8888 texture and saves 50% of the data size and bandwidth. However I need to sample 2 textures instead of one, so there is a performance hit. With Radeon 1X-series ATI added support to one channel ATI1 format also, so I was kind of hoping that we get four channel ATI4 with the HD-series (or an official four DXT5 alpha channel format in DX10.1). That would have been a perfect format for my needs.

DXT5:
This would actually work for some textures and provide a nice 4x compression ratio. I'd store diffuse, specular and glossiness to the rgb components, as they are usually connected (have similar gradients). The ambient data is completely separate. The unconnected DXT5 alpha channel would suit it well. This only works for rather simple materials, but is surely worth testing.

4444:
16 values for diffuse, specular, glossiness and ambient multipliers (2x compression ratio). For ambient the precision is enough (as it cannot affect more than 0.15 of pixel value). This also should be ok for diffuse, but I fear it will cause banding in the specular highlights (smooth glossiness ramp is the worst case). Normalizing the channels to [0,1] range (and using the stored value to interpolate between low and high values in the pixel shader) would improve the quality. This should be enough for many materials, but not enough for the most demanding ones.

Currently I am leaning towards a mix of 8888 and 4444 for the last combine texture depending on how demanding the material is. DXT5 should also be used for the simplest materials (and channels sorted so that the DXT5 quality loss is manageable). Any advice on improvements is highly welcome!
 
I'd suggest ATI2 x2, unless you're very texture sampling bound. The least desirable option would be 4444, which is seldom a very useful format. It usually looks crappy even filtered in higher precision because the storage precision is just too low. DXT5 is might work as long as the first components are relatively connected. This would save one texture lookup and half the memory, at the expensive of noticably worse quality in many cases. An ATI4 format would be ideal, but I suppose the gains of implementing that in hardware might not be worth the cost.
 
I'd suggest ATI2 x2, unless you're very texture sampling bound. The least desirable option would be 4444, which is seldom a very useful format. It usually looks crappy even filtered in higher precision because the storage precision is just too low. DXT5 is might work as long as the first components are relatively connected. This would save one texture lookup and half the memory, at the expensive of noticably worse quality in many cases.

We are actually not sampling bound at all in the geometry pass, so ATI2 x2 should be a pretty good option. However if I go that route, we need to store the second combine texture in 2 formats, and that increases the data size on disk. And I still have to find a good alternative format for Radeon 9000 series and Geforces.

We have already experimented with DXT5 a quite a bit, and I very much agree with you that storing unconnected values to the rgb channels completely ruins the image quality, as there is just one color gradient for the all 3 channels. For simple materials it's the best choice and offers good image quality, but for complex materials the rgb channels become a blocky mess.

An ATI4 format would be ideal, but I suppose the gains of implementing that in hardware might not be worth the cost.

I don't actually know how the ATI2 support is done in hardware side. If we assume ATI has implemented it by adding one additional alpha channel sampler to their DXT5 decoder, then ATI4 support would need lots of extra transistors. However if the color and alpha channel decoders for DXT formats are separate (and can be flexibly assigned for the samplers) then it should not be a big issue (and the extra decoders would help ATI1 and ATI2 decoding as well). In future a format like ATI4 would be even more useful, as more and more developers are starting to use more and more per pixel material properties (that are not connected and cannot be compressed well with DXT color formats). Better compressed formats to store unconnected data are very much needed.
 
sebbbi,
Might I suggest a possible alternative. If all the texture data is same X&Y size, then what you effectively have is a single texture where every pixel is an N-dimensional value (13 dimensions by the look of it).

You could try compressing all of them in one go by doing principal component analysis of the 13 dimensional space and getting, say, the largest 4 (or 5/6 (YMMV)) representative 13-D vector directions.

You then map all of your data into that 4 dimensional sub-space and encode everything in a single 8:8:8:8 texture plus 4 13-D constant vectors.

In your shader program you could then reverse the mapping by taking the accessed texel value and multiplying it by the 4 principal vectors etc that you would store as constants in your shader program.
 
Well... that killed the conversation.

;) Great suggestion! Might want to include a human perception factor (for example people are must less sensitive to relatively consistent errors vs localized high contrast errors), and and other manually generated weighting factors (manually reduce the weight of the opacity) in your component analysis to weight values not just based on standard error metrics.
 
Well... that killed the conversation.

I was actually analyzing the method. However I am still skeptical about the image quality, as this method would only have four completely independent channels. The current biggest problem with DXT5 is the dependency between the rgb channels. With the method you suggest, the dependency is whole image wide instead of limited to 4x4 blocks. The problem is that you cannot afford any quality degration on normal vector and height, and incorrect changes in specularity look just awful. So there are already 5 channels that need almost perfect reproduction, and cannot be affected by other channels.

Also emissive (self illumination) and opacity seem to be almost impossible to compress properly with this method, as any false values with either one's results are instantly visible. And at the same time, emissive is zero and opacity is one for almost all pixels in the image. But you still need the values to be correct in the (very limited) areas where either one is used.

And this takes 4 bytes per pixel. You get 3 x DXT5 channels, and almost always better quality for 3 bytes. However it's an very interesting idea to compress multichannel images. With 3 x ATI2 textures you can reach the same 3 bytes per pixel and have 6 multipliers per pixel.
 
Perhaps compressing 13D to 4D won't work for you, but maybe 13D to 8D will.

Given that G-Buffer creation is probably limited by the ROP instead of texture reads, if you could get your writes down to say 1-2 MRTs (instead of say 3-4 MRTs) by writing 4-8 principal vectors, perhaps you would get a good improvement (on both ROP write and G-Buffer read back during lighting).
 
I was actually analyzing the method. However I am still skeptical about the image quality, as this method would only have four completely independent channels.
Well, YMMV, but you may be lucky. I used VQ to compress 16-D data (to 256 representatives) and it often did a very good job (do a search for Dreamcast's texture compressor) . Obviously, you have more degrees of freedom (by encoding directly rather than using a lookup) so may do better.

One other thing thing you could try is to encode the channels with, say, 4 bits of precision, or maybe use that on the less significant ones.

Another option would be to take the more significant vectors and encode those with the "alpha" encoding schemes of DXT, which would get an 8-bit channel down to 4-bits with probably negligible loss of quality.

The current biggest problem with DXT5 is the dependency between the rgb channels. With the method you suggest, the dependency is whole image wide instead of limited to 4x4 blocks.
Yes, but you are also allowing the system to exploit the (often frequent) correlation in separate areas of the image data.

Also emissive (self illumination) and opacity seem to be almost impossible to compress properly with this method, as any false values with either one's results are instantly visible. And at the same time, emissive is zero and opacity is one for almost all pixels in the image. But you still need the values to be correct in the (very limited) areas where either one is used.
As I said, just try it. You have nothing to lose, apart from a little time downloading an eigenvector calculation function and coding up the (rather trivial) covariance matrix generation. <shrug>. If you find that you only have a few significant eigenvalues, you have a winner.

Alternatively, you could try using PVRTC :) That typically does a much better job of normal vectors than DXTC and it's only 4bpp. Perfect if you want to run it on a one of the leading mobile devices.;)
 
Well, YMMV, but you may be lucky. I used VQ to compress 16-D data (to 256 representatives) and it often did a very good job (do a search for Dreamcast's texture compressor) .
those 256 representatives can be anywhere in the 16D space (each of them still contain 16 components). Your proposal limits the selection of possible locations to a tesseract in 13D spave. That's (somewhat) like limiting color selection in a RGB texture to a line in the 3D color space. Thats one line, for the entire picture, not just for each block like DXT1.
Obviously, you have more degrees of freedom (by encoding directly rather than using a lookup) so may do better.
Wouldn't that be fewer degrees of freedom?
 
those 256 representatives can be anywhere in the 16D space (each of them still contain 16 components).
Well, yes.. and no. The "shape" of the space is generally going to be like some sort of hyper-ellipse, and so I would expect being able to place any pixel in any position in the sub-space governed by the the N (=4?) principal vectors is probably going to be more flexible in general than just having a subset of 256 locations taken from anywhere in space. The latter compresses more, but is somewhat lossy.

Your proposal limits the selection of possible locations to a tesseract in 13D spave. That's (somewhat) like limiting color selection in a RGB texture to a line in the 3D color space. Thats one line, for the entire picture, not just for each block like DXT1.Wouldn't that be fewer degrees of freedom?
I understand the principal completely - I was just saying that there may be a very good chance that only a small set of the eigenvectors are significant and so Sebbbi may be able to significantly compress his data with little error.
 
Back
Top