I know I am late but, I want to chime in on the latency vs bandwidth debate. I believe that not only would a large LLC would not only provide more bandwidth at lower latency, but it it could also help AMD relax the amount of latency hiding per cu. for example, NVidia runs 32 threads per shader to hide the very long latency of gddr6x, with a large LLC AMD could play the odds and say most memory accesses hit the Cache and lower the amount of threads per shader to something like 16* or 8* threads freeing up registers and thus transistors!
*note I don't know for sure how many threads of latency hiding RDNA 1 has since i haven't worked with AMD gpu's at the low level since a very long time.
GCN has 10 wavefronts per SIMD, RDNA1 has 20, at least some of the RDNA2-based consoles have 20, and Sienna Cichlid appears to have 16.
The different SIMD widths and cadences make comparisons more complex. A single GCN wavefront stalling on an instruction might have up to 9 other wavefronts that might have one non-stalling instruction each on a 4-cycle cadence, giving 36 cycles if they each issued one instruction. An RDNA CU would have 19 other wavefronts, but they would be single-cycle. AMD provided a wave64 mode, which among other things emulates a 2-cycle cadence and brings the number of lanes per wave into a more equivalent relationship with a GCN wavefront. The apparent recommendation for that mode is apparently workloads like pixel shading where latency hiding is a major need, and the 2-cycle cadence gives roughly the same number of cycles as GCN.
Sienna Cichlid reduces the amount somewhat in this scenario, although the majority of the hiding is still possible.
Not evaluated is how many memory accesses can be issued sequentially per stalling. Vega doubled the amount over earlier GCN, and RDNA further split reads and writes to allow for more flexible batching of accesses and to avoid stalls on generally fire-and-forget writes.
Wave32 would be more latency-sensitive, but it may make more sense for less parallel workloads or ones better contained in the LDS or register file.
Guys, why are you still mentioning MALL in reference to cache subsystem?
It's part of DCN (Display Controller) and it's about timings of framebuffers etc.
Or do I read that code incorrectly?
It depends on whether this MALL functionality is interpreted as only applying to the display controller, or that it's a larger functionality change with a subset relating to the display controller.
Changes relating to cache allocation masks and a change to page table entries seems to imply things like depth tiling, SDMA, and other functions could be related.
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055005.html
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055006.html
Values introduced in the first show up in the second, which deals with MALL.