Modern texture caches ... how do they work?

MfA · Jan 21, 2010

My understanding of how texture caches work is hopelessly out of date ... anyone know of any papers which explore it for modern architectures?

The optimal design for cache optimized for 4x bilinear sampling per clock (like Evergreen/Fermi) I would guess to be something like fully associative, even/odd line ordered, 128 byte cache lines with 8x 128 bit banked ports, and a couple of coalescing stages in the pipeline to accumulate multiple cycles worth of texture accesses to form accesses with minimal bank conflicts. You could get away with not using banks I guess (which allows you to avoid having to put 8 ports on the tag part of the cache) but they would be really nice to increase the chances of getting hits for less neat accesses.

What do you do with hits? Do you maintain a buffer of sample instructions and hits and just add the misses afterwards?

prunedtree · Jan 21, 2010

I'm afraid you are unlikely to find any relevant answer as such information would be one of the many jealously guarded secrets IHVs keep

Some pointer following tests on RV770 and RV870 give (take this with a big grain of salt) a L1 capacity of 8 KB (yes, on RV770 too) with over a hundred cycles average latency for hits.

Regarding banking, well, I think ATI texture caches can at least be banked 4 times (for each pixel in a quad) `for free' as a bilinear fetch will never be able to create bank conflicts (it'll always be some permutation of a quad) and that's the only kind of access the hardware does handle.

I don't have much more to say than the fact it works pretty damn well in my experience (which actually makes me all the less curious about how it works... there are so very few things that just work)

MfA · Jan 21, 2010

prunedtree said:
Regarding banking, well, I think ATI texture caches can at least be banked 4 times (for each pixel in a quad) `for free' as a bilinear fetch will never be able to create bank conflicts.

To guarantee complete conflict free access it would have to be 4 banked and 4 ported (quads can be guarantueed conflict free with just banks, but the multiple parallel samples can only be handled with ports if you don't want to rely on coalescing). With 16 ports on the tag CAM (that's an expensive CAM).

Jawed · Jan 21, 2010

There's a little more data in this presentation:

http://developer.amd.com/gpu_assets... ACML-GPU SGEMM Optimization Illustration.ppt

Though this thread:

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=123138&messageid=1062985

mentions an error or two.

Also, of course, there's the patents, which are pretty detailed when all is said and done. This is a deliciously comprehensive overview - it really is the mother of recent ATI patent dox:

http://v3.espacenet.com/publication...=B1&FT=D&date=20091215&DB=EPODOC&locale=en_V3

Decoding the damn things and putting together a comprehensive model, gulp...

In NVidia L1, historically, has held texels in their DXT compressed form. ATI historically has held them decompressed.

Also need to bear in mind L1s versus L2s and also how hardware threads are allocated across SIMDs, as that will affect the cache thrashing patterns caused by disparate screen tiles, or the phases of particularly long shaders.

Jawed

MfA · Jan 21, 2010

The patent applications 20090309896 and 20090315909 give some more recent overviews of the pipelines, but no real hint of the exact nature of the texture cache. Also 20080273033 gives some clues about rasterization and 20090164726 hopefully a glimps of the future (native irregular z-buffers!).

MfA · Jan 21, 2010

In the end I was probably thinking too complex ... it probably simply is 4 banked with 4 ports on each bank (with 32 bit ports). Power of two tiling (potentially hierarchical). Banks for odd-line/odd-row, odd/even, even/odd and even/even. The 16 ported CAM really is a bit of a headache though (all the accesses can be on different cache lines).

PS. theoretically the APU from the last patent application could already be in there to create wavefronts ... but that seems an awfully limited use of such a powerful programmable mechanism.

Gorgonzola · Jan 24, 2010

If the cache isn't fully associative, that simplifies the CAM design.

Simon F · Jan 24, 2010

From my investigations years ago, fully associative gains you very little in hit rate over set associative caches, but would cost a great deal more in gates.

Modern texture caches ... how do they work?

MfA

prunedtree

MfA

Jawed

MfA

MfA

Gorgonzola

Simon F

Tea maker

Similar threads