I was referencing them to give some perspective on the area occupied by the caches in recent designs.
The reason why I had questions was because of the claim about GPU L2 area. The area saved by putting the GPU L2 storage is not that major, and for various reasons may not be a significant win if AMD's concept is considered evidence.
In regards to the ESRAM, or some form of scratchpad, that's not necessarily about memory, but energy consumption at an exascale level. Memory stacked on the logic die could conceivably appear as L2/3 depending on function of the processor.
TOP-PIM and the GPU chiplets in the paper already have an HBM or similar stack right on top of them.
The capacities traditionally are designed with planar area in mind. In exascale with 3D designs energy is the larger concern. Going off package is simply bad. The design focus will always be keeping cache close to logic. Going vertical with the cache should yield shorter paths which consume less energy.
The concept in question already stacks the DRAM on top of the logic, so the off-package issue is resolved.
What this does do with the current data flow between DRAM, L2, and on-die clients is add another 2 vertical movements from DRAM to L2, then L2 to clients. For GCN's write-through L1 to L2, everything coherent is going off-die all the time. The cost for going through TSVs and bumps is slightly higher than a similar traversal through standard metal layers, but this is in an incremental amount on top of the extra traversal(s).
Also, the horizontal traversal is still there, given the long line of CU clients and control processors laid out horizontally dwarfs the dimensions of the L2 blocks. Making the L2 big enough to take up horizontal space under the die just makes it very likely that the distance traversed approaches twice what a small on-die L2 would have (and this is again for a write-through L1).
Since it is below the logic, there are very likely density penalties (larger horizontal distance) from all the power and IO drilled through the L2 layer.
The L2 serves as a concentration point, so it by design is not local to everything that uses it. In theory, a more local change would be bigger L1/LDS/register files--but those make the TSV/interface problem even worse than my next concern.
The other concern is the space taken up by doing this, which AMD hasn't indicated will be fully addressed. The interface area for the 1024 IOs and power for an HBM stack looks like it might be 1/5 (more?) of the stack's footprint. An L2 that services that many channels and 64-byte lines would just for read service be 4 times wider, if it's not bidirectional like the ESRAM. The GPU chiplets are likely to lose a fair chunk of area to the DRAM stack's data lines and power/ground equivalent to what each stack layer loses to vertical connectivity, before making the L2 an 4x burden.
Worth noting even Nvidia implemented the op caches to reduce access energy to L1. That's already pretty close and still significant. Seems increasingly likely Vega did something similar and Zen has the op caches as well.
The operand cache is physically small because it stays on-die. The current 2.5 and 3D integration methods use vias and pads that measure in tens of microns with 40-55um pitch in the place of wires that and features measure somewhat over 10 nanometers. It makes sense versus the big wires and pads for off-package that measures in mm.
The problem with multi-gigabyte LLC cache in HBM is that the size of the tag-arrays get unwieldy. If you have 4GB HBM and a cacheline size of 64 bytes, you end up 64M cache lines. Each of the tags for these lines needs 42-44 bits for address and a few bits for state (MOESI, etc).
Since HBM is DRAM, the natural alignment is the 1 or 2KB page, at least if there is to be any power efficiency and bandwidth utilization. Atomics, false sharing, or other cache operations would incur more complexity if the 64B granularity is kludged into the DRAM arrays, worsened by the long latencies and other device restrictions of the DRAM.
When moving clusters of this size, perhaps it might start looking like an in-memory physical disk system? It might start doing things like clustering data, compressing tags, or doing some kind of search in a region-based or segmented storage system.