GPU cache sizes and architectures

Infinisearch · Apr 7, 2015

Is there a resource somewhere with a comprehensive list of GPU's and the amount of l1, l2 cache and the architecture? I've recently started looking into L2 cache sizes of recent GPU's and it seems maxwell seems to have 2MB l2 no matter what sub-model, while GCN cards seem to have 64kb-128kb per memory controller implying the size varies depending on sub-model. I've also tracked down a little info on a couple of kepler based gpu's. I was wondering if anyone has been maintaining a list of at least total l2 size per GPU?

I'm considering a test renderer that does some tiling into "super tiles" to try to keep rendering in the L2 cache to test the effect on fillrate and subsections of frame time of various rendertarget configs and geometry source variations. So basically I want to know how far back GPU wise I can go while maintaining at least 256kb of L2.

homerdog · Apr 7, 2015

L2 size is not fixed for all Maxwell chips. GM107 and GM204 both have 2MB, but GM206 has 1MB and GM200 has 3MB.

Infinisearch · Apr 7, 2015

homerdog said:
L2 size is not fixed for all Maxwell chips. GM107 and GM204 both have 2MB, but GM206 has 1MB and GM200 has 3MB.

Thanks for the correction. Know any other GPU L2 sizes off hand?

mczak · Apr 7, 2015

GM108 should be same size per partition as GM107 (so 1MB total).
(Though for Maxwell GM2xx the cache is actually per enabled ROP partition, hence GTX970 having less than 2MB.)
GK208 was 512kB per 64bit partition (and total). GK104-GK107 was 128kB per 64bit partition, whereas GK110 (plus GK210) was 256kB per 64bit partition.
I don't know the Fermi numbers off-hand...
There's a table for GCN chips here:
http://www.hardware.fr/articles/926-2/tonga-gcn-1-2-256-bit-5-milliards-transistors.html
Though noone seems to know the l2 size of tonga...

3dcgi · Apr 7, 2015

For AMD the L2 cache size isn't relevant for the test you propose. You care about the CB and DB caches. Sebbbi did an experiment like this a while back.

mczak · Apr 7, 2015

3dcgi said:
For AMD the L2 cache size isn't relevant for the test you propose. You care about the CB and DB caches. Sebbbi did an experiment like this a while back.

Right, that's something I always thought AMD might unify some day - I thought Tonga (GCN 1.2) might do it, but afaik it didn't happen (I don't think there was actually enough technical information published to rule it out completely, not sure if anyone actually tested it).

homerdog · Apr 7, 2015

3dcgi said:
For AMD the L2 cache size isn't relevant for the test you propose. You care about the CB and DB caches. Sebbbi did an experiment like this a while back.

wut? Lol why did they do it that way in the first place?

MDolenc · Apr 8, 2015

3dcgi said:
For AMD the L2 cache size isn't relevant for the test you propose. You care about the CB and DB caches. Sebbbi did an experiment like this a while back.

It's the same on NV. L2 cache is pretty much read only thing and you can't use it to output pixels to it and expect not to get bandwidth bound.

Basically you need to know that ROP won't always output just plain RGBA values since they might compress some blocks. That would complicate context of what you want in L2 quite a bit.

sebbbi · Apr 8, 2015

If you write data to UAVs, the L2 cache will hold that data. If you are doing a L2 test application I would recommend doing reads/writes to/from UAV in a compute shader.

Pixel shaders can also output to UAVs in addition to render targets. This way you can utilize both the L2 cache and the ROP cache of AMD GPUs while writing data from a pixel shader. Use [earlydepthstencil] attribute for the pixel shader if you are using depth buffering. This forces depth test before the UAV write (reducing overdraw bandwidth cost).

Writing to render targets uses slightly less BW on Tonga/Maxwell GPUs because of the color compression (as MDolenc said).

Infinisearch · Apr 9, 2015

mczak said:
GM108 should be same size per partition as GM107 (so 1MB total).
(Though for Maxwell GM2xx the cache is actually per enabled ROP partition, hence GTX970 having less than 2MB.)
GK208 was 512kB per 64bit partition (and total). GK104-GK107 was 128kB per 64bit partition, whereas GK110 (plus GK210) was 256kB per 64bit partition.
I don't know the Fermi numbers off-hand...
There's a table for GCN chips here:
http://www.hardware.fr/articles/926-2/tonga-gcn-1-2-256-bit-5-milliards-transistors.html
Though noone seems to know the l2 size of tonga...

Thanks for the info.

3dcgi said:
For AMD the L2 cache size isn't relevant for the test you propose. You care about the CB and DB caches. Sebbbi did an experiment like this a while back.

Foiled again... yeah equipped with the vocab of CB and DB caches it clearly states that said caches are disjunct(don't evict into) from the shared cache in the Southern Island programming guide. Sebbbi... prior art you say... is there a thread/post for this? Oh and anywhere a non-NDA person such as myself can find out about CB and DB sizes?

Infinisearch · Apr 9, 2015

mczak said:
Right, that's something I always thought AMD might unify some day - I thought Tonga (GCN 1.2) might do it, but afaik it didn't happen (I don't think there was actually enough technical information published to rule it out completely, not sure if anyone actually tested it).

homerdog said:
wut? Lol why did they do it that way in the first place?

My guess would be to much contention on the L2 and potentially layout constraints.

Infinisearch · Apr 9, 2015

MDolenc said:
It's the same on NV. L2 cache is pretty much read only thing and you can't use it to output pixels to it and expect not to get bandwidth bound.

Might I inquire your source as to it being the same on NV?

3dcgi · Apr 10, 2015

Infinisearch said:
Thanks for the info.
Foiled again... yeah equipped with the vocab of CB and DB caches it clearly states that said caches are disjunct(don't evict into) from the shared cache in the Southern Island programming guide. Sebbbi... prior art you say... is there a thread/post for this? Oh and anywhere a non-NDA person such as myself can find out about CB and DB sizes?

Found it.

http://forum.beyond3d.com/showpost.php?p=1755674&postcount=541

mczak · Apr 10, 2015

Infinisearch said:
Might I inquire your source as to it being the same on NV?

It shouldn't be true. Most talk about "unified L2 cache" only talk about compute, but there's some versions of the Fermi whitepaper (v9) which explicitly say that the L2 replaces the separate ROP cache of prior GPUs (though not the official one you can still download, but it appears to be newer). You can read it here for instance: http://www.bjorn3d.com/2010/01/nvidia-gf100-fermi-gpu/
ROPs and L2 cache slices are VERY closely connected anyway (on maxwell GM2xx ROPs and L2 is disabled together, but independent of the MC). It may be possible though nvidia is doing some tricks there (like not using all associativity sets for ROPs, that whitepaper also talks about sophisticated replacement priority which I take to mean it's not just LRU but takes into account where the request came from maybe though that is pure speculation).

Infinisearch · Apr 10, 2015

sebbbi said:
If you write data to UAVs, the L2 cache will hold that data. If you are doing a L2 test application I would recommend doing reads/writes to/from UAV in a compute shader.

Pixel shaders can also output to UAVs in addition to render targets. This way you can utilize both the L2 cache and the ROP cache of AMD GPUs while writing data from a pixel shader. Use [earlydepthstencil] attribute for the pixel shader if you are using depth buffering. This forces depth test before the UAV write (reducing overdraw bandwidth cost).

Writing to render targets uses slightly less BW on Tonga/Maxwell GPUs because of the color compression (as MDolenc said).

Well this was just going to be in regards to speeding up various stages of a renderer... and I don't want to get sidetracked... but an L2/cache testing codebase/app does sound like it might come in handy. Oh and in regards to the thread mentioned below thanks to @sebbbi and @Gipsel for taking the time. (but why would you guys bury such important info on GCN in a Haswell vs. Kaveri thread of all places - have you guys considered a blog that just points to posts in B3D threads.)

3dcgi said:
Found it.

http://forum.beyond3d.com/showpost.php?p=1755674&postcount=541

Thanks alot for going to the trouble.

Infinisearch · Apr 10, 2015

mczak said:
ROPs and L2 cache slices are VERY closely connected anyway (on maxwell GM2xx ROPs and L2 is disabled together, but independent of the MC). It may be possible though nvidia is doing some tricks there (like not using all associativity sets for ROPs, that whitepaper also talks about sophisticated replacement priority which I take to mean it's not just LRU but takes into account where the request came from maybe though that is pure speculation).

Anywhere besides that whitepaper talks about such things? Thanks for contributing.

GPU cache sizes and architectures

Infinisearch

homerdog

donator of the year

Infinisearch

mczak

3dcgi

mczak

homerdog

donator of the year

MDolenc

sebbbi

Infinisearch

Infinisearch

Infinisearch

3dcgi

mczak

Infinisearch

Infinisearch

Similar threads