As you said, the TMUs (GCN3/4) already can read (sample) variable width DCC targets, meaning that the DCC metadata needs to be loaded to L1 and L2 caches. I imagine Vega ROPs (tiny L1 ROP caches) and TMUs (CU L1 caches) load the DCC metadata from the shared L2 cache. I would assume that you need to flush both the ROP L1 caches and the CU L1 caches when transitioning a render target to readable. But since these caches are inclusive (L2 also has the L1 lines), this flush never causes any memory traffic. This should be very fast compared to the current ROP cache + L2 cache flush.
I see that there would be an effective capacity benefit to having the ROP L1s participating in the compression scheme, but it would be a change since it seems like the current method keeps the compression in writeback path of the ROP caches back to memory.
I could see it being more difficult to perform operations that write data across the parallel units of the RBE when the alignment and content of a ROP's specific position can have varying levels of complexity to read+decompress, modify, and then write+recompress (rinse and repeat).
Structuring hardware to somehow work natively in compressed data would be an interesting exercise, but generally I'd expect it would just decompress before working on the data--which means there's storage on the order of the uncompressed line somewhere.
Some special cases might be easier to manage, like all zero or all one color, but even then after one operation there's a decent chance that the output from that will require storage on the order of the orignal L1 lines to hold it while the compression is reapplied.
The variability in compression would present an interesting wrinkle for the L2 and cache flushes, varying based on the policies of the RBE caches and when the line footprint becomes known.