AMD says, at 4k across [a variety] of top gaming titles, they are getting a hit rate of about 58%.
Thanks! I don't suppose they gave any figures for (hypothetical) 32, 64, 256, 512…MB caches, did they?
AMD says, at 4k across [a variety] of top gaming titles, they are getting a hit rate of about 58%.
Thanks! I don't suppose they gave any figures for (hypothetical) 32, 64, 256, 512…MB caches, did they?
Nope, only for the 128MB in N21. N22 and N23 are rumoured to be 96 and 32MB respectively so we should get figures for those when the cards launch.
For APUs, there's nothing technically preventing them from having the GPU share the CPU L3 is there? I believe Intel iGPUs already do this.
MCMs generally only make sense if you want to make as powerful enterprise-class system as possible inside one package
If you want to make a consumer-priced chip that has 10x-100x times more performance than 3090 or 6090, splitting the logic into multiple dies and having lots of communications overheads and more latency will not help you. When you can afford the die area needed for that performance, you can then also afford that area inside single logic die and have none of these problems.
CPU L3 in Zen is architecturally a private victim cache of a CCX. The whole CCX is one complete blackbox with a coherent master role on the data fabric, only being snooped by the home memory controller for coherence protocol traffic. Otherwise, we wouldn’t even have this big Zen 3 selling point of two CCXs merging into one.For APUs, there's nothing technically preventing them from having the GPU share the CPU L3 is there? I believe Intel iGPUs already do this.
If we contemplate whether a GPU is likely to be able to "dynamically lock" certain surfaces as regions, even if for only a portion of a render pass (?) which of 1 or 2 is more likely? It would seem that 2 fits the bill.“Infinity Cache” has also been said to amplify bandwidth by having many slices tying to fabric nodes that apparently scale with memory channels, where a CCX has only one single port out to the SDF (IF data plane). So there isn’t much possibilities left for how “Infinity Cache” works — it is either:
1. a memory-side cache, meaning being part of the memory controllers; the GPU IP does not see it directly, and uses it implicitly when it accesses memory via the SDF; or
2. a GPU private cache, meaning L2 misses go to this GPU internal LLC, and only those that still misses in the LLC gets turned into a memory request on the SDF.
Either way, GPU generally have non-coherent requests going into SDF — those are routed to the home MC directly, and generate no further coherence traffic. So not even a snoop will hit the CCX and in turn it’s private L3 cache controllers.
As Linux driver patches seem to suggest, 64KB GPUVM pages can be individually marked as LLC No-Alloc, which is somewhat the inverse of pinning things.If we contemplate whether a GPU is likely to be able to "dynamically lock" certain surfaces as regions, even if for only a portion of a render pass (?) which of 1 or 2 is more likely? It would seem that 2 fits the bill.
... and APUs.It could be a potential bolt-on for EYPC and Instinct GPUs.
What about Samsung's next gen Exynos.... and APUs.
What about Samsung's next gen Exynos.
IC could be a candidate there too, while we are at it . Super Resolution tech would be awesome for mobiles too.
16 CUs @ 2+ GHz with IC would be insane, 4 TF/ XSS Level
A chiplet system includes a central processing unit (CPU) communicably coupled to a first GPU chiplet of a GPU chiplet array. The GPU chiplet array includes the first GPU chiplet communicably coupled to the CPU via a bus and a second GPU chiplet communicably coupled to the first GPU chiplet via a passive crosslink. The passive crosslink is a passive interposer die dedicated for inter-chiplet communications and partitions systems-on-a-chip (SoC) functionality into smaller functional chiplet groupings.
The primary chiplet designation appears to be relevant only in the context of host communication (presumably PCIe controller and a "lead" SMU for DVFS coordination and stuff). This isn't a new semantic for Infinity Fabric, considering that we've seen similar situations with working solutions since Zen 1. More specifically, we have multiple Zeppelin chips in the package/system, each of which owns a replication of resources incl. PSP and SMU, that has some roles requiring one exclusive actor in the system (e.g., parts of the secure boot sequence).GPU CHIPLETS USING HIGH BANDWIDTH CROSSLINKS - ADVANCED MICRO DEVICES, INC. (freepatentsonline.com)
Seems to be a design without TSVs and uses a single chiplet as a master with the others as slaves. Last level cache coherency with dedicated routes through chiplet PHYs and passive interposer connections appears to be the technique by which communication amongst the chiplets is achieved. This would appear to imply that Infinity Cache is crucial to this architecture.
If anything, Figure 2 seems depicting a 2.5D interposer with TSVs passing through pins to the substrate. At least this is the only configuration that works in such way and commercially exists.Seems to be a design without TSVs
5. The system of claim 4, wherein the first PHY region of the first GPU chiplet comprises a first passive crosslink PHY including conductor traces solely for communications between the passive crosslink and a last level cache of the first GPU chiplet.
8. The system of claim 1, further comprising: a first cache memory hierarchy at the first GPU chiplet, wherein a first level of the first cache memory hierarchy is coherent within the first GPU chiplet; and a second cache memory hierarchy at the second GPU chiplet, wherein a first level of the second cache memory hierarchy is coherent within the second GPU chiplet.
9. The system of claim 8, further comprising: a unified cache memory including both a last level of the first cache memory hierarchy and a last level of the second cache memory hierarchy, wherein the unified cache memory is coherent across all chiplets of the GPU chiplet array.
IMO this is arguing semantics. Patent describing things metaphorically is well expected, while the public domain knowledge on packaging technologies indicates that likely only 2.5D interposers or EMIB/LSI can deliver the bump and wire density required. That is unless you assume SerDes significantly outclassing existing on-package variants is used. Such detail is left vague by the patent as expected.There aren't TSVs in this design, because those pillars (212) you've identified are through a moulding (220).
Not disagreeing. I was trying to put this in perspective with the public domain information on the Scalable Data Fabric. My interpretation is that this HBX crosslink is no different from existing blocks like CAKE/IFIS (inter-socket) or IFOP (on-package), which are network layer constructs, designed for a particular signalling medium, with configurable routing in the grand data fabric scheme.As to the way that a GPU is constructed from chiplets, there is no need to make the chiplets uniform. This patent is merely about the communication method amongst chiplets and how it is formed using passive interlinks and a last level cache coherency protocol.
It is hard for me to argue if we strictly go by the patent text only. But let's say If the (presumably) memory-side last-level cache (paragraph 33) always misses or have zero capacity, does the system functionally fall apart? It doesn't. Life of the fully connected HBX interconnect still goes on, just that requests now always hit the memory controller.HBX is peer-to-peer dedicated to the last level cache, in other words the last level cache is formed of units that are not in a mesh, it's fully connected. What the patent describes as the last level cache is required for a GPU constructed of chiplets to function, there is no optionality here:
The open question is that device atomics require... device-level coherence, i.e., across all chiplets in the setup described by the patent. So the GCN/RDNA tradition of GPU atomics being processed at L2 can no longer continue, because L2 is only coherent within the chiplet, as you quoted.Levels of the cache hierarchy other than the last level are specifically coherent within the chiplet:
Good point.So in my opinion we're likely to see a GPU comprised of a master chiplet that handles CPU and other IO (PCI Express, display port, HDMI etc.) and graphics chiplets. For exascale computing I can imagine that it is solely constructed from graphics chiplets as there is no need for "other IO".
There aren't TSVs in this design, because those pillars (212) you've identified are through a moulding (220).
[0018] As previously noted, the GPU chiplets 106 are communicably coupled by way of the passive crosslink 118. In various embodiments, the passive crosslink 118 is an interconnect chip constructed of silicon, germanium or other semiconductor materials and may be bulk semiconductor, semiconductor on insulator or other designs. The passive crosslink 118 includes a plurality of internal conductor traces, which may be on a single level or multiple levels as desired. Three of the traces are illustrated in FIG. 2 and labeled collectively as traces 206. The traces 206 interface electrically with conductor structures of the PHY regions 202 of the GPU chiplets 106 by way of conducting pathways. It is noted that the passive crosslink 118 does not contain any through silicon vias (TSVs). In this manner, the passive crosslink 118 is a passive interposer die that communicably couples and routes communications between the GPU chiplets 106, thereby forming a passive routing network.
It's cost too: silicon with TSVs is more costly. Though I'm not saying the packaging cost overhead is zero for this design.IMO this is arguing semantics.
I think it's reasonable to assume that what we know (well "you know", since I know effectively zero) about SDF is out of date with respect to an Infinity Cache based architecture. Assuming that this design is based upon Infinity Cache.Not disagreeing. I was trying to put this in perspective with the public domain information on the Scalable Data Fabric. My interpretation is that this HBX crosslink is no different from existing blocks like CAKE/IFIS (inter-socket) or IFOP (on-package), which are network layer constructs, designed for a particular signalling medium, with configurable routing in the grand data fabric scheme.
This especially takes in account of (allegedly) how SDF is used in GPUs since Vega 10, and SoCs like Xbox Series X. Most of the SDF network switches are bound to a pair of an L2 slice (SDF master) and a MC/LLC slice (SDF slave), and traffic between the pair for their designated memory address partition can simply be routed straight through by the switch. Meanwhile, all these SDF switches are interconnected, say perhaps in a cost effective ring topology, that enables full VRAM access for the rest of the SoC (multimedia blocks, display IP, etc).
I can't say I understand the point you're making, since a cache system by definition is always backed by non-cache RAM. I think there's a subtlety in your use of the term "memory-side" that I'm missing.It is hard for me to argue if we strictly go by the patent text only. But let's say If the (presumably) memory-side last-level cache (paragraph 33) always misses or have zero capacity, does the system functionally fall apart? It doesn't. Life of the fully connected HBX interconnect still goes on, just that requests now always hit the memory controller.
I think this comes back to what RDNA 2 does with Infinity Cache and where and how global atomics are implemented - remembering that ROPs implement a type of global atomics, but in their case the memory space is partitioned such that L2 would suffice (so not truly global, only global in the programmer's model). In other words, I don't know. I've always thought of ROPs as directly being how AMD implements global atomics, but since the ROPs have moved away from the memory controllers I honestly don't know.The open question is that device atomics require... device-level coherence, i.e., across all chiplets in the setup described by the patent. So the GCN/RDNA tradition of GPU atomics being processed at L2 can no longer continue, because L2 is only coherent within the chiplet, as you quoted.
I would suggest the latter, since atomics in the compute API are only valid on writeable random access buffers. So the GPU can promote affected cache lines to global coherence as required and will be forewarned.Two of all possible outcomes are: (1) SDF extends its protocol to support "memory-side atomics", and they are now processed at L3/LLC; and (2) L2 continues to process atomics, and SDF maintains cache coherence between L2s across all chiplets (for lines touched by device-coherent atomics/requests).
Didn't go through the patent on my phone , but is it suggesting adding another cache level? Since L2 used to be LLC for AMD GPUs, with RDNA2 the new L3 is LLCAMD multi chiplet GPU Patent: https://www.freepatentsonline.com/20200409859.pdf
Haven't gone through it all, though glancing at it, it seems a bit generic. Which is hardly surprising. There is a mention of "caching chiplet" which... the big LLC on its own chiplet makes sense, but going by the other figs, it seems the l3 is shared. Though I wonder how necessary that is with the big LLC.
Didn't go through the patent on my phone , but is it suggesting adding another cache level? Since L2 used to be LLC for AMD GPUs, with RDNA2 the new L3 is LLC