So the diagrams in this document are interesting because the output of Hull Shader and Tessellator are synchronisation/data-transfer workloads between two APDs (accelerated processing devices). This is the "gnarly" part of a multi-chiplet architecture because down-stream function blocks determine which chiplet will use which chunks of output produced by HS or TS. The routing of the work is "late", when screen-space is used to determine workload apportionment.
The synchronisation/data-transfer tasks require queues ("FIFO" in the document), which is where an L3 (Infinity Cache) would come in. The document locates the FIFOs within each APD, but if there's a monster chunk of L3, let's say 512MB, shared by two chiplets, that would seem to be a preferable place. These buffers would not waste die space if they were dedicated memory blocks within each chiplet when tessellation is not being used.
AMD always struggled with stream out functionality of the geometry shader, compared with NVidia. NVidia better-handled SO with on-chip buffers (cache) whereas AMD always decided to use off-chip memory (AMD's drivers over time messed about with GS parameters that tried to avoid the worst problems associated with the volume of data produced by GS). Similarly, tessellation has always caused AMD problems because on-die buffering and work distribution were very limited. In the end SO buffering is effectively no different from the FIFOs that are required to support HS and TS work-distribution. So whether we're talking about a single die or chiplets, fast, close, memory is a key part of the solution.
So if AMD is to solve the FIFO problem properly, it will need to use a decent chunk of on-package memory.
Similarly the "tile-binned" rasterisation in:
https://www.freepatentsonline.com/10957094.html
would seem to depend upon an L3. We've seen from NVidia's tile-binned rasterisation that the count of triangles/vertices that can be cached on die varies according to the count of attributes associated with each vertex (and the per-pixel count of bytes defined by the format of the render target). I don't think we've ever really seen a performance degradation analysis for NVidia in games according to the per-vertex/-pixel data load, but as time has gone by it appears NVidia has substantially increased the size of on-die buffers to support tile-binned rasterisation.
It seems to me that a monster Infinity Cache lies at the heart of these algorithms for AMD. Well, I imagine that comes across as "stating the obvious", but NVidia has been using a reasonably large on-die cache for quite a long time so it's time AMD caught up.
In theory, with RDNA 2, Infinity Cache is already making tessellation work better. But I don't remember seeing any analysis.
Stupid question time: I can't find a speculation thread for NVidia GPUs after Ampere. Is there one?