Sorry to be late at the party, please treat it like a conclusion
the approach is always the same for this stuff (depth buffer, MSAA etc), when you overflow you just drop to baseline uncompressed.
If the delta values fit into 2bits you go say 1:8, otherwise when 3bit then 1:4, when 4bit 1:2 otherwise dump 1:1.
Sounds reasonable, but this would only work with the
back buffer using an integer 8-bit per component RGB/BGR format, and not with HDR formats R10G10B10A2 and 16-bit floating point R16G16B16A16.
Same for
multiple render target formats, including one- and two-component 8-bit integer formats.
How do you know how many bytes a compressed tile occupies when reading it back
There is an array describing the size of each block with a bit mask (probably 2 bits as explained below).
8x8 block of 4 byte (32 bit) pixels takes a minimum of 32 bytes assuming 8:1 compression, then 64 bytes for 4:1 and 128 bytes for 2:1. Finally, it takes 256 bytes if uncompressed, so each block is aligned along the 256 byte boundary.
Assuming 128 bit (16 byte) or 256 bit (32 byte) memory bus, this takes 4 to 8 reads, and a peek into the above array (either 4 bytes, or most likely a full line readout to improve cache hits for next reads) is negligible.
Not exactly on-chip.... You need at least 2 bits per tile (cleared, 1:1, 1:2, 1:4, 1 bit more if you want to support higher ratios), so with 8x8 tiles it can add up to quite something (for 16kx16k buffer and 2 bits per tile that would give 1MB per RT).
Who needs 16K framebuffer? More realistically, with 2 bits per tile and 8x8 pixel block, 1080p requires 8100 bytes, 2560x1600 requires 16000 bytes, and 4K (2160p) requires 32400 bytes.
I agree that any of the above is too large to have a dedicated cache though.
one cache line, no matter where, can store information about quite a few tiles of course ... unless you have a really terrible case not achieving any compression it's not a big deal).
The memory controller probably doesn't supports reading individual bytes, so it's 32 bit (4 bytes) at minimum - though most likely this is not very efficient and would require as much cycles as just reading a full line (128 or 256 bit).
It could be done in Linux with a modified open source driver with an added function call to DMA out raw video RAM data?
This would involve a thorough analysis of the shader compiler in the OpenGL driver.
AFAIK AMD's open source driver is currently limited to kernel-mode-setting, i.e. basic configuration, framebuffer, display mode and refresh rate stuff. The heavyweight stuff like OpenGL rendering is maintained by Mesa/Gallium3D, and I'm not really sure it knows anything about the compression hardware in GCN3... it most likely doesn't, when you consider how its performance lags behind the proprietary AMD driver.
Wow, there is a blast from the past...
Textured models were first researched at LucasFilm/Pixar some 14 years before they appeared in mainstream PC graphics cards.