http://worldwide.espacenet.com/publ...20813&DB=worldwide.espacenet.com&locale=en_EP
Dual fragment-cache pixel processing circuit and method therefore
Multiple graphics primitives may be processed in quick succession where each of the multiple graphics primitives produces fragments that correspond to the same pixel location. As such, rather than forcing the render backend block to handle multiple fragments corresponding to the same pixel location, a cache structure can be used to buffer the received fragments prior to providing them to the render backend block. Including a cache structure in the data path for the pixel fragments enables multiple fragments that apply to the same pixel location to be combined prior to presentation to the render backend block. Offloading some of the blending operations from the render backend block can improve overall system performance.
This is an ancient patent (1999). The fragment cache is simply operating as a fragment selector (according to Z) or MSAA fragment selector (according to mask, with some blending).
(The patent is about using two caches in a ping-pong arrangement, to enable full speed processing across the boundary in time caused by a rendering state change. That isn't the reason I'm linking it.)
The render backend pulls from the fragment cache to perform read-modify-write (blend) against the render target (or just write). To do so the RBE will want to work on small blocks of the render target, which entails a colour buffer cache.
So RBE wants to pull from the fragment cache but only when CBC is ready. That's determined by whether the CBC has been populated from memory (if modifying) or whether the CBC lines are available (write only). The whole process needs to be pipelined, though I'm unclear on how RBE prioritises fragment cache lines to pull from. (Perhaps it's driven by the rasteriser's interaction with hierarchical-Z cache. As the rasteriser touches the hierarchical-Z cache it can feed forward to the RBEs as a predictor of which render target coordinates are "live" and which are "dead".)
What we don't know is whether CBC is held in delta-colour-compressed format or native.
Looking at the PNG lossless compression techniques:
http://optipng.sourceforge.net/pngtech/optipng.html
http://www.w3.org/TR/PNG-Filters.html
specifically delta compression, indicates that decompression would be pretty simple, enabling CBC to be held in compressed format with low-latency, low-area cost to read.
I'm assuming that delta colour compression operates on large blocks of pixels so that they align with the memory channel's native burst length. Some care is needed to ensure that compression is still possible with 16-bit per channel and 32-bit per channel pixels.
Translating this to PNG style delta-colour filters, a scanline might be 8 or 16 pixels long with a fixed count of 4 or 8 scanlines.
The problem then becomes about the time spent compressing final pixels produced by the ROPs. If suitably pipelined (using a fork-and-join to evaluate scanline filter choices to pick the best-compressed scanline), then this will run at the native ROP rate.
The next problem to solve is dealing with blocks that flip back and forth between compressed and uncompressed over the time spent rendering the frame.
Ultimately, you want to deliver coherent blocks (compressed or uncompressed) to the MCs, so that they match precisely with the memory system's granularity.