It doesn't read full replication at all to me — there are repeated descriptions and references to fine-grained address interleaving across participating caches in clusters to dynamically deliver larger effective cache capacity. Paragraph 62, for example, specifically addresses that the number of CU clusters is to be decreased if lines are expericing high level of sharing (i.e., replication level), and having fewer CU clusters mean larger pool of caches for interleaving in the cluster.
Thanks, should have read the full patent text. My bad.
For a given number of CUs (e.g., N number of CUs 112 of FIG. 1), the number of CU clusters 120 determines a maximum number of cache line replicas at the GPU 104. Generally, increasing the number of CU clusters 120 (such as from two CU clusters in the embodiment of FIG. 1 to three or more) results in a smaller effective L1 cache capacity within each CU cluster and an increase in the number of cache line replicas at the GPU 104. Further, increasing the number of CU clusters 120 increases miss rates to the L1 caches (due to the smaller effective L1 cache capacity at each cluster) but decreases access latency to the L1 caches (due to fewer number of L1 caches at each CU cluster to traverse through for searching to locate a requested cache line). Similarly, decreasing the number of CU clusters 120 results in a decrease in the number of cache line replicas at the GPU 104 and a larger effective L1 cache capacity within each CU cluster that decreases miss rates to the L1 cache at the computational expense of longer L1 access latency. By increasing the effective L1 cache capacity, some applications may increase the L1 cache 116 hit rate, and therefore decrease L2 cache 118 pressure. Further, in some embodiments, the processing system 100 balances competing factors of the L1 cache 116 miss rate and L1 cache 116 access latency to fit a target application profile by dynamically changing the number of clusters.
So trading access latency (when clustered and interleaved) vs small L1 (when not clustered / interleaved).
Even though I'm having trouble seeing where this is applicable? It appears to be focused primarily at reducing the load on the (shared) L2 cache?
Is the L2 cache on GPU actually still
that much slower than L1, so that even 3 trips over the crossbar can outweigh an L2 hit?
Rather than utilizing ring interconnects for CU-to-CU communication (such as previously discussed with respect to FIGS. 4-5), the GPU 600 of FIG. 6 positions one or more dummy communication nodes (e.g., dummy communication nodes 604, 606) on a side opposite that of the CUs 112 to receive requests from a CU and forward the requests to other CUs 112. As used herein, the term “dummy communication node” refers to a module or other structure that receive a request/reply from a CU, buffers the request/reply, and then forwards the request/reply to a destination CU.
Even if you consider that in the context of the patent even a store & forward approach was suggested for CU to CU communication, so this is potentially quite a few cycles wasted?
Further, GPUs often experience performance limitations due to LLC bandwidth in some workloads. Accordingly, the CU clustering discussed herein reduces pressure on LLC and increases compute performance by improving L1 hit ates.
By the looks of it, yes. (3.90TB/s L1 bandwidth, 1.95TB/s L2 bandwidth on the Radeon RX 5700XT.)
For what's it's worth, the author also often spoke only about small number of CUs in each cluster, and about locality on the chip. So possibly this isn't even aiming at a global crossbar, but actually only at 4-8 CUs maximum in a single cluster?
I suppose 8 CUs with 128kB of L1 each still yield a 1MB memory pool. And you got to keep in mind L0 cache still sits above that, so at least some additional latency isn't going to ruin performance.
An interesting question, can this reliably reduce the bandwidth requirements on the L2 so far, that distribution of L2 cache slices across dies becomes viable yet?
A clear "no" to that. Even when increasing the L1 size 8x this way, L1 cache misses are unavoidable in the bad cases. Maybe a 30-50% reduction of L2 hits on average, but not even proportional to the number of CUs participating in each cluster. And nothing changed on write back bandwidth. Rough estimates, not properly founded. Still not even remotely enough to allow splitting the LLC.
Should still amount to a reasonable performance uplift in bad cases which had been suffering from excessive L1 miss, but good L2 hit ratio before. I could see this pushing the viable size limit for lookup tables by a fair amount.