Texture Compression and Quality Metrics (OpenGL4.3 thread spinoff)

You're "hiding your light under a bushel" :) A great feature that needs to be advertised!

:) There is so much text missing ... It'll take a while to become really neat.

But now I am really confused. How are you measuring the RMSE? For example, take image #30. Your table says the (R)MSE for S3TC with the AMD Compressonator is =4.52 but when I tried it (with version 1.30.1084) I got 7.80 which I doubled checked against our own differencing tool. Something is definitely wrong.

Here are the files I used, TGA as input and DDS as output, with the explicit diff-results:
http://squish.paradice-insight.us/fileadmin/materials/RGB_OR_1200x1200_030-dxt1-ati-s3tc.7z

The commandline is:
TheCompressonator.exe -convert in/${file}.tga ${basename}-weighted.dds -format .dds -codec ATICompressor.dll +fourCC DXT1 +alpha_threshold 128 +red 0.3086 +green 0.6094 +blue 0.082
(without the weights for linear: "+red 0.3333 +green 0.3334 +blue 0.3333")

ETC I'd understand because it's probably using the ETC1 reference encoder source, but for PVRTC what version encoder and what sort of system are you running it on?

PVRTexToolCL version 3.40
Uses: PVRTexLib version 4.2

The commandline:
PVRTexToolCL.exe -i inp.png -o result.pvr -d result.png -l -m 1 -f PVRTC2_4,UB,lRGB -q pvrtcbest
or:
PVRTexToolCL.exe -i inp.png -o result.pvr -d result.png -l -m 1 -f ETC2_RGB,UB,lRGB -q etcslow

So "pvrtcbest" + PVRTC2 is one that takes so long I guess. "pvrtchigh" takes around 5 to 15 minutes for PVRTC2 4bpp, the 2bpp takes 20 to 60 minutes, the PVRTC1 4bpp takes a few tens of seconds, the 2bpp a minute or two. "etcslow" + ETC2 takes around 20 to 40 minutes. But that's on the 1200x1200 RGB ones. The 3000x2000 RGBA images are the ones reaching 3 hours.
I have a Phenom II 3.6GHz, 4 cores. I let the program multi-thread.

I know the PVRTC research compressor still needs work but, for example, with image 030 of the 1200x1200 colour set, it took around 154 CPU-seconds ( e.g. ~20sec on a shared, multi-core Linux machine) on a "higher than PVRTextool's high" quality setting. Mind you, this version is slightly newer than the public SDK.

Sounds like PVRTC1?
 
The main effect of using larger pictures is that you, in general, get less detail per pixel - which makes things easier for every encoder and reduces the differences between them (the range between the highest-quality and lowest-quality in your RGB tests - excluding obviously-broken stuff - seems to be about 7 dB as measured by PSNR, while for the old Kodak images the difference is on the order of about 15 dB).

The main reason for us proponents of larger images is more that we can have more converging responses from the encoders. When the images are small then it becomes a battle of sparse information encoding (variable integer coding for the image header anyone :p ). While you actually want to understand say the adaptive information-extraction capabilities of your coder.
We all know the pigeon hole principle, in consequence, the more data you shuffle into your algorithm the smaller the chance you biased it for a specific set of data, which then makes it automatically expand it for others (not necessarily images). It happens more than one could think. It's kind of poor man's Kolmogorov Complexity convergence-approach, find an algorithm that compresses essentially all of them images well, and just about the non-image data not.
Another reason is that higher resolution makes outliers less influencial. Convergence to a stable statistic.
Well, I come from lossless entropy coding image compression, not lossy fixed rate hardware compression.

If you want something representative of 2000+, then I would suggest the Ryzom Asset Repository (http://media.ryzom.com/) which contains a large collection of actual (uncompressed) textures developed for an actual commercial game - which I would expect to be a lot more representative of textures than random photography collections. (Ryzom is an MMORPG that was originally released as a commercial game in 2004; unlike, say, World of Warcraft, it never really took off, and as a result, its art assets - including several thousand textures - eventually ended up being released under a creative-commons licence).

Nice, very nice. I am facing this lack of testset problem for the normal-map evaluation, currently I picked textures I use for raytracing, but I don't have object space normal-maps for example. I do have access to huge amounts of raw artist material at work, but obviously I can't use it.
I thought about using Vampire's textures, they were TGAs, but much too small and too smooth. I tend to prefer problem-cases. :LOL:
I tried to get originals from the Elder Scrolls community, but not much response ...

As for the blue stuff that you and SimonF discussed, I note it as an example of psychovisuals in action: in the case of the blue color channel, the human eye is very sensitive to hue but quite insensitive to detail. As a result, getting blue right for any individual pixel is rather unimportant, especially in high-detail regions, but getting the regional average right for any region (of a sufficiently large size) of the image is very important. The so-called "perceptual" 0.21/0.7/0.09 weighting captures the per-pixel unimportance well but completely fails to capture the importance of regional average, with occasional bad results.

Yeah, blue test images have up and downsides. Luckily they are not all that blue.
 
Very cool Rizon assets. Continuing i like to point out the http://exquires.ca sampling test suite and the respective image-bank from ImageMagick.

http://www.imagemagick.org/download/image-bank/16bit840x840images/
Described in http://www.imagemagick.org/discourse-server/viewtopic.php?f=22&t=20691

Besides the metrics and correlation present in the suite, one throught that occurred to me is the importance of sampling for textures.
In that sense Exquires can be used to check up/down sampling in both final texture size and mipmap levels because the mentioned image-bank was constructed with some sampling ratios in mind.

840 was chosen as width and height because it is the least common multiple (lcm) of 2, 3, 4, 5, 6, 7 and 8. Consequently, all images can be exactly box downsampled by these ratios, and enlarging back the reduced versions with the "align corners" image geometry convention using these integer ratios produces images with the original 840x840 dimensions.
 
Oh, nice. 16bit images are always very much welcome.

I thought on providing results for filtered images, axis-aligned exactly in the middle between pixels (+0.0,+0.5) and (+0.5,+0.0), it represents 45° rotation and a bit of zoom out. It's the worst case of 3Dc actually. It may be a very favorable case for PVRTC.
 
The main reason for us proponents of larger images is more that we can have more converging responses from the encoders. When the images are small then it becomes a battle of sparse information encoding (variable integer coding for the image header anyone :p ). While you actually want to understand say the adaptive information-extraction capabilities of your coder.
OK, but is that really a battle you would want to avoid? In particular, it would seem to me that no matter how large input images you use, this particular battle will just start up again and re-diverge the encoders as soon as you start constructing a mipmap pyramid?
Another reason is that higher resolution makes outliers less influencial. Convergence to a stable statistic.
This seems like an argument for having a much larger total number of textures in the overall corpus than for large textures as such; large textures + mipmapping leaves you vulnerable to outliers at the higher miplevels again.
Well, I come from lossless entropy coding image compression, not lossy fixed rate hardware compression.
The world of lossless coding is in many ways a rather simpler and cleaner one. On this side (lossy), the "pigeon holes" would basically map to just-perceptibly-different encodings rather than bit-strings that differ in one or more bits - which gets complicated further by "just-perceptibly-different" behaving rather differently between different use cases. Welcome to the jungle, I guess.
Nice, very nice. I am facing this lack of testset problem for the normal-map evaluation, currently I picked textures I use for raytracing, but I don't have object space normal-maps for example. I do have access to huge amounts of raw artist material at work, but obviously I can't use it.
I thought about using Vampire's textures, they were TGAs, but much too small and too smooth. I tend to prefer problem-cases. :LOL:
I tried to get originals from the Elder Scrolls community, but not much response ...
I had another look at the Ryzom data set; while it has lots of RGB and RGBA textures, it doesn't appear to have any normal maps at all. As such, I looked around a bit for normal maps elsewhere. Here is what I found:
I would have liked to suggest metrics for use when evaluating normal map compression, but other than just flat-PSNR and angular-error-PSNR, I couldn't really find much of anything (well, there is http://www.graphicshardware.org/previous/www_2007/presentations/munkberg-tightframenormal-gh07.pdf who used their normal maps to perform a reflection effect and then compute SSIM on the resulting rendered frames, however this seems like a rather nonstandard and hard-to-use metric). Which is a bit of a shame; the impression I've gotten is that PSNR-type metrics are considerably more flawed for normal maps than for RGB (as if PSNR wasn't bad enough on RGB already).



If you want 16-bit content, there is a handful of grayscale and RGB content at http://www.imagecompression.info/.

I guess you will be looking at HDR textures at some point as well, so here are a few starting points (albeit no really big corpora like Ryzom or Red Eclipse):
 
well, there is http://www.graphicshardware.org/previous/www_2007/presentations/munkberg-tightframenormal-gh07.pdf who used their normal maps to perform a reflection effect and then compute SSIM on the resulting rendered frames, however this seems like a rather nonstandard and hard-to-use metric.

One common mistake with SSIM is assume its a linear percentage like:

0.98 == 98% or 0.96 == 96% (just 2 % away).

In fact its a log metric in this form and 0.98 is twice the quality than 0.96, for that reason use DSSIM ( linear distance metric). Or in x264 there is a dB version but SSIM in x264 have a particular partition scheme so take care of that with different ssim implementations.

http://git.videolan.org/gitweb.cgi/x264.git/?a=commit;h=2d2abd8286f3744d79349a162e506f6502c52c56

EDIT: A bunch of metrics in a Matlab package, a few of them ive never heard but there`s potential.
http://foulard.ece.cornell.edu/gaubatz/metrix_mux/
 
Last edited by a moderator:
:)
Here are the files I used, TGA as input and DDS as output, with the explicit diff-results:
http://squish.paradice-insight.us/fileadmin/materials/RGB_OR_1200x1200_030-dxt1-ati-s3tc.7z

The commandline is:
TheCompressonator.exe -convert in/${file}.tga ${basename}-weighted.dds -format .dds -codec ATICompressor.dll +fourCC DXT1 +alpha_threshold 128 +red 0.3086 +green 0.6094 +blue 0.082
(without the weights for linear: "+red 0.3333 +green 0.3334 +blue 0.3333")
May I suggest that you compute RMSE simply as sqrt( Sum (Ri^2+Gi^2+B^i) / N) ? I believe that is more correct and has the added benefit you can use it to compute PSNR if so desired.

I have a Phenom II 3.6GHz, 4 cores. I let the program multi-thread.
Windows or Linux? I've just found that Windows multithreading might not be quite :rolleyes: as efficient as Linux so there'll have to be some changes... sigh.

Sounds like PVRTC1?
Yes. There are a lot more modes to try in TC2.
 
In fact its a log metric in this form and 0.98 is twice the quality than 0.96, for that reason use DSSIM ( linear distance metric).

Yes, I give linear M-SSIM values in the results. 0.0 is undistinuishable and then it goes up.

EDIT: A bunch of metrics in a Matlab package, a few of them ive never heard but there`s potential.
http://foulard.ece.cornell.edu/gaubatz/metrix_mux/

Nothing really good for normal-maps.

OK, but is that really a battle you would want to avoid? In particular, it would seem to me that no matter how large input images you use, this particular battle will just start up again and re-diverge the encoders as soon as you start constructing a mipmap pyramid?

My area of interest is lossless multi-resolution image compression based on recursive adaptive predictors. That is based on progressive reconstruction of "missing" pixels on a recursive quincunx lattice. In simple words, it's point-filtering, which means you have 1 of 4 pixels already correct for any 2x2 block below seed. In addition because it's quincunx rotated, you have resolutions for every doubling of resolution, instead of squaring:

lvl 1: (2 pixels)
x - x -
- - - -

lvl 1|2: (4 pixels)
x - x -
- x - x

lvl 2: (8 pixels)
x x x x
x x x x

and you can code bayer-patterns almost as-is. It's quite more efficient than wavelets or other multi-resolution filters for lossless. It also maximizes the available information in the context, much more than in just scan-lining (like JPEG-LS does).
The seed resolution is relative tiny in comparison to the the higher resolution levels, if the image is large. So it's quite the opposite, bigger last-levels hide the inefficiency of the seed, which means you want larger images to cover the warm-up. And because it's context-coded you want to fill up the contexts, nothing worst than sparse contexts.

This seems like an argument for having a much larger total number of textures in the overall corpus than for large textures as such; large textures + mipmapping leaves you vulnerable to outliers at the higher miplevels again.

Both, there are outliers within pictures as well. I agree that for block coding it is entirely irrelevant if you compress images separately or merged. For fun I made mosaics of some corpora long ago:
http://cdb.paradice-insight.us/?corpus=51&class=7

It's another matter for lossless coders. Ideally it should be the same as well, but local texture separation/classification is quite a challenge for most image compressors. Some people say compression and learning and A.I. is all the same. :)

The world of lossless coding is in many ways a rather simpler and cleaner one. On this side (lossy), the "pigeon holes" would basically map to just-perceptibly-different encodings rather than bit-strings that differ in one or more bits - which gets complicated further by "just-perceptibly-different" behaving rather differently between different use cases. Welcome to the jungle, I guess.

Indeed! :LOL:

I had another look at the Ryzom data set; while it has lots of RGB and RGBA textures, it doesn't appear to have any normal maps at all. As such, I looked around a bit for normal maps elsewhere. Here is what I found:

Thanks. You've seen object space normals in there?

I would have liked to suggest metrics for use when evaluating normal map compression, but other than just flat-PSNR and angular-error-PSNR, I couldn't really find much of anything (well, there is http://www.graphicshardware.org/previous/www_2007/presentations/munkberg-tightframenormal-gh07.pdf who used their normal maps to perform a reflection effect and then compute SSIM on the resulting rendered frames, however this seems like a rather nonstandard and hard-to-use metric). Which is a bit of a shame; the impression I've gotten is that PSNR-type metrics are considerably more flawed for normal maps than for RGB (as if PSNR wasn't bad enough on RGB already).

Oh yes, I break my head about that since months. You can do PSNR on normals, you just need to define variance in angle-space. Although you get then square angle-deviation, and I can't decide which of absolute or square angle-deviance is the correct one from a math perspective. I probably go for square angle.

If you want 16-bit content, there is a handful of grayscale and RGB content at http://www.imagecompression.info/.

I know Sachin since a decade or so, and actually I helped to elect some of those images. :)
I just wish there would be a 100 or so in perfect state, same size, same range, etc.
 
PING! This topic seems to have gone to sleep.
One common mistake with SSIM is assume its a linear percentage like:

0.98 == 98% or 0.96 == 96% (just 2 % away).

In fact its a log metric in this form and 0.98 is twice the quality than 0.96, for that reason use DSSIM ( linear distance metric).
According to wikipedia, the mapping from SSIM to DSSIM is just
DSSIM(image, ref) = (1-SSIMimage, ref))/2

Excuse my ignorance, but in what sense is that then linear?
EDIT: A bunch of metrics in a Matlab package, a few of them ive never heard but there`s potential.
http://foulard.ece.cornell.edu/gaubatz/metrix_mux/
A colleague who's been working on metrics said that (AFAIU) there are some possible problems with MeTriX in that it implements the "old" SSIM and, for MSSIM, it's using a different set of wavelets (actually a better type) to that of the published paper <shrug>.
 
IIRC, any distance metric is linear or at least "linear" comparable.

Code:
1/(1-0.98)==50  1/(1-0.96)==25  *** Expected two times lower quality***
Those multiple implementations for DSSIM are a problem. The one i been using in research papers since 2010 was: DSSIM =1/(1-SSIM)]

I used the version from wikipedia at the time but i checked both doom9 and x264dev and everyone advised that was correct. Now i didnt test but the one implementation i found is from floating precision texture compression.

http://pholia.tdi.informatik.uni-frankfurt.de/~philipp/publications/ftc.shtml

edit: Found another DSSIM https://github.com/pornel/dssim

About the comparison of enormous data, i 've run deep statistical tests with PSNR and DSSIM using http://sjeng.org/bootstrap.html
Bootstrap gives you the confidence that a difference in means between
two techniques (i.e. faster speed, lower energy) is not due to random
chance. It also corrects the confidence if you want to compare many such
techniques at the same time, i.e. it avoids this problem:

http://xkcd.com/882/

If you are still tuning your method, you probably have many parameter
settings you want to try. You can try to find the optimal ones on one
set of cases (but relaxing the confidence to accept an improvement), and
at the very end verify the 2-3 best candidates with bootstrap on another
set of cases.

If you have many parameter settings to try and the differences between
them are small (low bootstrap confidence), you will probably need to
increase the number of test cases (maybe by an order of magnitude) in
order to make meaningful progress. Bootstrap.py was succesfully used for audio codec comparisons at HydrogenAudio and A.I. research.
Other package for ANOVA with awesome charts is TIBCO Spotfire, one sample from a Linux kernel cpu scheduler comparison:
3770_K.png
 
Last edited by a moderator:
At least on the tid 2008 data, using 1/(1 - SSIM) has a lower linear correlation coefficient with the data, than simply using ssim directly.

Regarding the wikipedia entry, one can wonder, but the definition is very similar to Pearson's distance:

http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

The accepted method for linearization is using a 5 parameter curve which includes a linear term, a constant term and the logistic equation:

http://atc.umh.es/gatcom/bin/oqam/Referencias/Sheikh_Sabir_Bovik_2006.pdf

the VQEG recomendation didn't include the linear term and can be seen at:

ftp://vqeg.its.bldrdoc.gov/Documents/Meetings/Hillsboro_VQEG_Mar_03/VQEGIIDraftReportv2a.pdf (page 18)
 
IIRC, any distance metric is linear or at least "linear" comparable.

Code:
1/(1-0.98)==50  1/(1-0.96)==25  *** Expected two times lower quality***
Those multiple implementations for DSSIM are a problem. The one i been using in research papers since 2010 was: DSSIM =1/(1-SSIM)]

I haven't seen any published result showing that DSSIM=1/(1-SSIM) has a better linear correlation with subjective data than using the SSIM.

If one looks at the tid2008 database ssim data, using this version of dssim has a linear correlation (Pearson's linear correlation coefficient) of 0.41, while using the ssim has a linear correlation of 0.75. So, using that version of dssim is much worse at linearizing subjective results than ssim.

The VQEG advised method of linearization is a fit to the data using a (linear +) logistic equation (check ftp://vqeg.its.bldrdoc.gov/Documents/Meetings/Hillsboro_VQEG_Mar_03/VQEGIIDraftReportv2a.pdf, page 18).

Another study which looks at this is:
http://atc.umh.es/gatcom/bin/oqam/Referencias/Sheikh_Sabir_Bovik_2006.pdf
 
Thanks, for the information.

I used DSSIM to make objective comparisons at the same bitrate. Using ANOVA to identify DSSIM statistical significant results (p<0.05). Mainly to not fall in the shown 2% falacy. The RD curves did not show any problems. What are your recomendations on this ?
 
Thanks, for the information.

I used DSSIM to make objective comparisons at the same bitrate. Using ANOVA to identify DSSIM statistical significant results (p<0.05). Mainly to not fall in the shown 2% falacy. The RD curves did not show any problems. What are your recomendations on this ?

Not sure what you mean by identifying DSSIM significant results. I'm not an expert in ANOVA tests, but if you have only two datasets, you should consider running a t-test.

Also, I don't know which statistic you used to characterize your samples: usually ANOVA or t-tests are designed to compare means for two or more datasets, not to check linearity.

Regression studies and using correlation indices (Pearson, Spearman, Kendall, ...) are quite usefull for this, and often quoted as measure of model quality. This forum is probably not the place to discuss statistics though! ;)
 
These metrics are relatively uninteresting for compression for use on textures. It would have to factor in texture filtering. And nowadays half of the textures contain parameters, not visual information - in theory one would have to measure and optimize a lossy compressor for rendered output quality, and that's rather NP-superhard.

I think quality metric research is drifting into the A.I. direction, they try to make adequate human-simulating receptors, thinking less and less about the overhead to actually calculate the proposed metric at runtime in a producer (say metric-feedback in a noisy line with sample-adaptive error-codes). As such their use is limited to compressor bashing, not suitable for improving algorithms of actual online machine learning.
 
Back
Top