Decoding of entropy-coded data can AFAIK be done in parallel in about O( (log N)^2) time for N bits with a variant of the technique called 'parallel prefix computation'. I think the hardware cost is about O(N*S + N^2*log M) where S is the size of the symbol table and M is the maximum number of symbols that can be decoded in one go (if you use different symbol tables for different symbols, you will additionally need to multiply this hardware cost with the number of symbol tables that you use)
The idea: First, for every bit position, assume that a symbol starts there and compute how long that symbol would be. Then, for each bit position, look up the size of the 'next' symbol, so that for each bit position we get the sum of the 2 next symbols from that position. Repeat the procedure again (adding two sums) to get the length of the 4, then 8, then 16 etc next symbols. Now, at bit position 0 you will have generated pointers to symbol 0,1,2,4,8,16 etc; from each of these pointers, you can collect pointers to symbols n+1, n+2, n+4, n+8 etc, recursively until you have covered all symbols; you can this way collect pointer to ALL symbols in log(M) stages.
For texture compression, this approach is WAAY too heavy to allow data to be stored compressed in an L1 texture cache; it may be usable for an L2 cache.
I think some x86 CPUs use algorithms similar to this to enable fast parallel decoding of the instruction set's obnoxious variable-length instruction format.