GPU cache sizes and architectures

Discussion in 'Architecture and Products' started by Infinisearch, Apr 7, 2015.

  1. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Is there a resource somewhere with a comprehensive list of GPU's and the amount of l1, l2 cache and the architecture? I've recently started looking into L2 cache sizes of recent GPU's and it seems maxwell seems to have 2MB l2 no matter what sub-model, while GCN cards seem to have 64kb-128kb per memory controller implying the size varies depending on sub-model. I've also tracked down a little info on a couple of kepler based gpu's. I was wondering if anyone has been maintaining a list of at least total l2 size per GPU?

    I'm considering a test renderer that does some tiling into "super tiles" to try to keep rendering in the L2 cache to test the effect on fillrate and subsections of frame time of various rendertarget configs and geometry source variations. So basically I want to know how far back GPU wise I can go while maintaining at least 256kb of L2.
     
  2. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,153
    Likes Received:
    928
    Location:
    still camping with a mauler
    L2 size is not fixed for all Maxwell chips. GM107 and GM204 both have 2MB, but GM206 has 1MB and GM200 has 3MB.
     
  3. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Thanks for the correction. Know any other GPU L2 sizes off hand?
     
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    GM108 should be same size per partition as GM107 (so 1MB total).
    (Though for Maxwell GM2xx the cache is actually per enabled ROP partition, hence GTX970 having less than 2MB.)
    GK208 was 512kB per 64bit partition (and total). GK104-GK107 was 128kB per 64bit partition, whereas GK110 (plus GK210) was 256kB per 64bit partition.
    I don't know the Fermi numbers off-hand...
    There's a table for GCN chips here:
    http://www.hardware.fr/articles/926-2/tonga-gcn-1-2-256-bit-5-milliards-transistors.html
    Though noone seems to know the l2 size of tonga...
     
  5. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    For AMD the L2 cache size isn't relevant for the test you propose. You care about the CB and DB caches. Sebbbi did an experiment like this a while back.
     
  6. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Right, that's something I always thought AMD might unify some day - I thought Tonga (GCN 1.2) might do it, but afaik it didn't happen (I don't think there was actually enough technical information published to rule it out completely, not sure if anyone actually tested it).
     
  7. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,153
    Likes Received:
    928
    Location:
    still camping with a mauler
    wut? Lol why did they do it that way in the first place?
     
  8. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    It's the same on NV. L2 cache is pretty much read only thing and you can't use it to output pixels to it and expect not to get bandwidth bound.

    Basically you need to know that ROP won't always output just plain RGBA values since they might compress some blocks. That would complicate context of what you want in L2 quite a bit.
     
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    If you write data to UAVs, the L2 cache will hold that data. If you are doing a L2 test application I would recommend doing reads/writes to/from UAV in a compute shader.

    Pixel shaders can also output to UAVs in addition to render targets. This way you can utilize both the L2 cache and the ROP cache of AMD GPUs while writing data from a pixel shader. Use [earlydepthstencil] attribute for the pixel shader if you are using depth buffering. This forces depth test before the UAV write (reducing overdraw bandwidth cost).

    Writing to render targets uses slightly less BW on Tonga/Maxwell GPUs because of the color compression (as MDolenc said).
     
  10. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Thanks for the info.
    Foiled again... yeah equipped with the vocab of CB and DB caches it clearly states that said caches are disjunct(don't evict into) from the shared cache in the Southern Island programming guide. Sebbbi... prior art you say... is there a thread/post for this? Oh and anywhere a non-NDA person such as myself can find out about CB and DB sizes?
     
  11. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    My guess would be to much contention on the L2 and potentially layout constraints.
     
  12. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Might I inquire your source as to it being the same on NV?
     
  13. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    Found it.

    http://forum.beyond3d.com/showpost.php?p=1755674&postcount=541
     
  14. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    It shouldn't be true. Most talk about "unified L2 cache" only talk about compute, but there's some versions of the Fermi whitepaper (v9) which explicitly say that the L2 replaces the separate ROP cache of prior GPUs (though not the official one you can still download, but it appears to be newer). You can read it here for instance: http://www.bjorn3d.com/2010/01/nvidia-gf100-fermi-gpu/
    ROPs and L2 cache slices are VERY closely connected anyway (on maxwell GM2xx ROPs and L2 is disabled together, but independent of the MC). It may be possible though nvidia is doing some tricks there (like not using all associativity sets for ROPs, that whitepaper also talks about sophisticated replacement priority which I take to mean it's not just LRU but takes into account where the request came from maybe though that is pure speculation).
     
  15. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Well this was just going to be in regards to speeding up various stages of a renderer... and I don't want to get sidetracked... but an L2/cache testing codebase/app does sound like it might come in handy. Oh and in regards to the thread mentioned below thanks to @sebbbi and @Gipsel for taking the time. (but why would you guys bury such important info on GCN in a Haswell vs. Kaveri thread of all places - have you guys considered a blog that just points to posts in B3D threads.)
    Thanks alot for going to the trouble.
     
  16. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Anywhere besides that whitepaper talks about such things? Thanks for contributing.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...