Well, trademarking it before would have given more clues about it being real, right?.
The same happens with PS5 and no mentions about VRS or mesh shaders. I always thought the real cause was Nvidia, as they are waiting for AMD to make its presentation, so that Sony and devs can after that be more open about PS5's tech.
It's generally the other way around with semicustom. The clients have often controlled what AMD can disclose, to the point that AMD hid that Bonaire was the same family as the current gen consoles until after they were fully disclosed/launched. I think it's more a question of how open the vendor wants to be, and that can be inconsistent.
1. 128 MB eSRAM pool (prior art: XB1) managed by “HBCC” (prior art: Vega)
I would wonder if the miss rate for 128MB is too high relative the the fault rate for a 4-16GB card. Miss handling would go from a hardware pipeline to what is likely a microcontroller monitoring page faults and DMA requests.
2. 128 MB memory-side “L3” cache
3. 1 and 2 combined + dynamic partitioning
Memory-side would impact the ISA and programming model less, and might make sense if the infinity cache name means there's a closer link to the data fabric than the anonymous L2 of the GPU. It could possibly make the concept something transferrable to other uses or products, and memory-side may make coherent memory between GPU and CPU spaces less painful.
Another possibility along the lines of what Intel did with its external memory is using much larger sectors than what the L2 is likely to use, to keep down the volume of cache tags.
At 128MB, I'm curious if sectors could be a significant fraction of a DRAM page in size, which could smooth over more of the read/write scheduling headaches if the data loaded is not under-utilized. For ROP or other forms of shader export, there's a good chance for using and modifying a significant portion of a DRAM page. DCC might amplify this, if small changes can propagate throughout the deltas for a ROP tile.
Something like the RBEs or shader writeback to contiguous memory could allow for more aggressive write coalescing before memory. If not explicitly managed by software, perhaps a modification to the transactions over the fabric could indicate to the cache that there's an intent to make a significant number of writes or many reads to a given sector. The cache could flag those as being recently used, with other traffic like less localized texturing potentially being relegated to a subset of ways with more turnover. I think Nvidia's L1 has something like this, where streaming data is typically serviced by a subset of the cache to avoid thrashing. ROPs and exports would have a clear hard-wired preference that could be communicated to the cache, and the aggressive coalescing in the L1/L2 pipeline might autonomously generate similar hints to another cache.
Explicit locking might be another possibility to manage on-die storage, although it may be more disruptive to the programming model.
4. 128 MB L2 Cache
5. 4 but with L2 cache line/range locking (e.g. lock some tiles of the render target in L2 during PS?)
The L2 may be in an awkward place due to its physical proximity to the key areas of the GPU, and 128MB at existing cache line lengths is a lot of lines+tags.
Power-wise, I'm curious if the level of activity within the GPU's internal caches is high enough to keep a lot of that 128MB at a more active state than otherwise desired.
It would be the most transparent implementation to the rest of the system, although it's still so much cache that I wonder if AMD wouldn't want it to be a more generally applicable cache block.
I agree that it's almost certainly going to be on-chip. General rule of thumb for bandwidth of a cache like this is 2x memory bandwidth, so if we assume roughly 1TiB/s and work backward from there using AMD's published figures for 2nd gen Infinity Fabric on package (~9pJ/bit) - that gives you 79W consumed by the interconnect between a hypothetical off-die L4 cache and the GPU compute die. This isn't even counting the actual active power of the SRAM or eDRAM in the cache itself!
Power considerations aside, if this cache is another tier in the GPU, that assumed figure reduces the amplification ratio that the L2 would provide over subsequent layers.
With Navi 10, it was probably in the 3-4x over VRAM. If it doesn't scale significantly, it's 1-2x of this proposed cache. If the cache and VRAM are accessed in parallel by the GPU, the L2's amplification drops below 2x if RDNA2 doesn't do something to scale it.