Realtime texture compression formats are specifically designed to be fast and useable in the texture units, so they have to sacrifice some opportunities for greater packing.
And one of the obvious, necessary requirements is usually a constant compression factor, so that addressing in a compressed image works the same as in an um-compressed image. Which results in only being able to use lossy compression codecs, which on top may not properly compress well compressible features either.
E.g. take a 3 channel image of something which resembles a typical company logo, just two colors and few features, in a 1k x 1k resolution. 4MB raw, still 1MB with S3TC family. While a decent compression algorithm can usually bring this down to a few kB.
Of course it doesn't get as good as e.g. PNG format. That format simply isn't made for random access at all.
But imagine e.g. a layer on top of good old BC1, except you declare that one BC1 block may be up-scaled to be representative of an entire macro block of 16x16 or 128x128, rather than just the usual 4x4 block. Still good enough for flat color, and maybe even most gradients. And the same lower level block may also be re-used for several upper level macro blocks. Now, the lookup table is still somewhat easy to construct yourself. Just reduce original picture resolution by 4x4, and store 32bit index to representative block. If you feel like it, spend 1 or 2 bit on flagging scaled use of macro blocks.
So far, this is something you can implement yourself, in pure software. Unconditionally trade a texture lookup in a 16x size reduced LUT, for the corresponding savings in memory footprint. 32x reduced if you can live with 16bit index. And then still benefit from HW accelerated S3TC decompression in the second stage, while likely getting a cache hit on both first and second stage. Up to this point, you have already a decent variable compression rate texture compression scheme. Applied virtual texturing, without the deferred feedback path.
You can just use existing tiled resources feature, with a pre-compiled (feature-density-aware) heap, to get this type of compression. Aggressively use deduplication, and whenever possible, just omit the high LOD outside of regions of interest.
... of course this isn't what the patent describes.
The patent goes one step further, and declares that there are still huge savings to be made from caching the decoding output from the second stage. E.g. you could actually go for much bigger macro blocks (e.g. 16x16), use a high compressing algorithm (alike to JPEG). But he hardware doesn't provide hard-wired decompression logic any more, instead it supports invoking a custom decompression shader on-demand, based on a cache-miss in a dedicated cache for decompressed macro blocks. Thereby enabling the use of compression schemes which are significantly more costly than S3TC family.
So I suppose this means there will be support for conditional decompression-callbacks made from within texture lookup calls in the near future. From programing interface it's going to look like a dedicated decompression shader bound to the pipeline. Note that this can then also be abused to simply run a generative shader instead of "decompression", essentially providing
forward texture space shading. Assuming that AMD had enough foresight to provide a data channel into decompression shader, and to explicitly provide sufficient memory space to hold an entire frame worth of decompressed blocks.
In terms of existing API usage, rather than polling
CheckAccessFullyMapped() afterwards, you get a shader invoked at the time where the access would have failed.
What's not covered by the patent, if this is potentially also applicable for memory accesses outside of texturing. E.g. on-demand block-decompression not only for texture, but also for arbitrary buffer access.