Modern texture caches ... how do they work?

Discussion in 'Architecture and Products' started by MfA, Jan 21, 2010.

  1. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,910
    Likes Received:
    501
    My understanding of how texture caches work is hopelessly out of date ... anyone know of any papers which explore it for modern architectures?

    The optimal design for cache optimized for 4x bilinear sampling per clock (like Evergreen/Fermi) I would guess to be something like fully associative, even/odd line ordered, 128 byte cache lines with 8x 128 bit banked ports, and a couple of coalescing stages in the pipeline to accumulate multiple cycles worth of texture accesses to form accesses with minimal bank conflicts. You could get away with not using banks I guess (which allows you to avoid having to put 8 ports on the tag part of the cache) but they would be really nice to increase the chances of getting hits for less neat accesses.

    What do you do with hits? Do you maintain a buffer of sample instructions and hits and just add the misses afterwards?
     
    #1 MfA, Jan 21, 2010
    Last edited by a moderator: Jan 21, 2010
  2. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    I'm afraid you are unlikely to find any relevant answer as such information would be one of the many jealously guarded secrets IHVs keep

    Some pointer following tests on RV770 and RV870 give (take this with a big grain of salt) a L1 capacity of 8 KB (yes, on RV770 too) with over a hundred cycles average latency for hits.

    Regarding banking, well, I think ATI texture caches can at least be banked 4 times (for each pixel in a quad) `for free' as a bilinear fetch will never be able to create bank conflicts (it'll always be some permutation of a quad) and that's the only kind of access the hardware does handle.

    I don't have much more to say than the fact it works pretty damn well in my experience (which actually makes me all the less curious about how it works... there are so very few things that just work)
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,910
    Likes Received:
    501
    To guarantee complete conflict free access it would have to be 4 banked and 4 ported (quads can be guarantueed conflict free with just banks, but the multiple parallel samples can only be handled with ports if you don't want to rely on coalescing). With 16 ports on the tag CAM (that's an expensive CAM).
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    There's a little more data in this presentation:

    http://developer.amd.com/gpu_assets... ACML-GPU SGEMM Optimization Illustration.ppt

    Though this thread:

    http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=123138&messageid=1062985

    mentions an error or two.

    Also, of course, there's the patents, which are pretty detailed when all is said and done. This is a deliciously comprehensive overview - it really is the mother of recent ATI patent dox:

    http://v3.espacenet.com/publication...=B1&FT=D&date=20091215&DB=EPODOC&locale=en_V3

    Decoding the damn things and putting together a comprehensive model, gulp...

    In NVidia L1, historically, has held texels in their DXT compressed form. ATI historically has held them decompressed.

    Also need to bear in mind L1s versus L2s and also how hardware threads are allocated across SIMDs, as that will affect the cache thrashing patterns caused by disparate screen tiles, or the phases of particularly long shaders.

    Jawed
     
  5. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,910
    Likes Received:
    501
    The patent applications 20090309896 and 20090315909 give some more recent overviews of the pipelines, but no real hint of the exact nature of the texture cache. Also 20080273033 gives some clues about rasterization and 20090164726 hopefully a glimps of the future (native irregular z-buffers!).
     
  6. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,910
    Likes Received:
    501
    In the end I was probably thinking too complex ... it probably simply is 4 banked with 4 ports on each bank (with 32 bit ports). Power of two tiling (potentially hierarchical). Banks for odd-line/odd-row, odd/even, even/odd and even/even. The 16 ported CAM really is a bit of a headache though (all the accesses can be on different cache lines).

    PS. theoretically the APU from the last patent application could already be in there to create wavefronts ... but that seems an awfully limited use of such a powerful programmable mechanism.
     
    #6 MfA, Jan 21, 2010
    Last edited by a moderator: Jan 21, 2010
  7. Gorgonzola

    Newcomer

    Joined:
    Feb 28, 2005
    Messages:
    5
    Likes Received:
    3
    If the cache isn't fully associative, that simplifies the CAM design.
     
  8. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    From my investigations years ago, fully associative gains you very little in hit rate over set associative caches, but would cost a great deal more in gates.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...