AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Thanks! I don't suppose they gave any figures for (hypothetical) 32, 64, 256, 512…MB caches, did they?

Nope, only for the 128MB in N21. N22 and N23 are rumoured to be 96 and 32MB respectively so we should get figures for those when the cards launch.

For APUs, there's nothing technically preventing them from having the GPU share the CPU L3 is there? I believe Intel iGPUs already do this.
 
Nope, only for the 128MB in N21. N22 and N23 are rumoured to be 96 and 32MB respectively so we should get figures for those when the cards launch.

For APUs, there's nothing technically preventing them from having the GPU share the CPU L3 is there? I believe Intel iGPUs already do this.

Thanks. In principle, no, but that would depend on how the Infinity Cache works exactly.
 
MCMs generally only make sense if you want to make as powerful enterprise-class system as possible inside one package

If you want to make a consumer-priced chip that has 10x-100x times more performance than 3090 or 6090, splitting the logic into multiple dies and having lots of communications overheads and more latency will not help you. When you can afford the die area needed for that performance, you can then also afford that area inside single logic die and have none of these problems.

OK, i hope I'm not coming across as being argumentative,
I'm learning a lot here, and appreciate your responses.

So lets assume we're NOT talking about consumer GPU's. just how to get the most performance,
OR - The path to Multi-GPU, and how we can make it work?

So my thoughts...
Does it start with something like Multiple GPU's chiplet each having their own Large L3 cache - similar to infinity cache, but share a single mem controller?

How hard is to tell 1 chiplet to render the top half of the screen and 1 to render the bottom half?
How much do we lose in efficiency? is 2 x 6800, on a single card, but with a single Mem-controller sharing 1 x 16Gb memory, 200%, or is it only 110%,
My random guess is that it ends up around 180%?
I know this is a sort of Brute force way to improve GPU perf, but at least your not duplicating the DDR memory, ala previous Dual GPU solutions.

Basically does the infinity cache concept allow the ability to create Dual GPU's but retain a single shared pool of DDR?

Cheers,
 
For APUs, there's nothing technically preventing them from having the GPU share the CPU L3 is there? I believe Intel iGPUs already do this.
CPU L3 in Zen is architecturally a private victim cache of a CCX. The whole CCX is one complete blackbox with a coherent master role on the data fabric, only being snooped by the home memory controller for coherence protocol traffic. Otherwise, we wouldn’t even have this big Zen 3 selling point of two CCXs merging into one.

This is kind of the SoC modularity and separation that AMD touted since maybe 2012 — they had been longing for a versatile SOC fabric that focuses only on connecting IPs to memory, and avoids being tied to nitty gritty of specific IPs (cache hierarchies included; also those good old Onion and Garlic buses).

(EYPC probe filters no longer requiring “stealing” L3 cache ways is also an illustration of the social distancing between memory controllers and L3 cache, now under the Infinity Fabric regime.)

“Infinity Cache” has also been said to amplify bandwidth by having many slices tying to fabric nodes that apparently scale with memory channels, where a CCX has only one single port out to the SDF (IF data plane). So there isn’t much possibilities left for how “Infinity Cache” works — it is either:

1. a memory-side cache, meaning being part of the memory controllers; the GPU IP does not see it directly, and uses it implicitly when it accesses memory via the SDF; or

2. a GPU private cache, meaning L2 misses go to this GPU internal LLC, and only those that still misses in the LLC gets turned into a memory request on the SDF.

Either way, GPU generally have non-coherent requests going into SDF — those are routed to the home MC directly, and generate no further coherence traffic. So not even a snoop will hit the CCX and in turn it’s private L3 cache controllers.
 
Last edited:
“Infinity Cache” has also been said to amplify bandwidth by having many slices tying to fabric nodes that apparently scale with memory channels, where a CCX has only one single port out to the SDF (IF data plane). So there isn’t much possibilities left for how “Infinity Cache” works — it is either:

1. a memory-side cache, meaning being part of the memory controllers; the GPU IP does not see it directly, and uses it implicitly when it accesses memory via the SDF; or

2. a GPU private cache, meaning L2 misses go to this GPU internal LLC, and only those that still misses in the LLC gets turned into a memory request on the SDF.

Either way, GPU generally have non-coherent requests going into SDF — those are routed to the home MC directly, and generate no further coherence traffic. So not even a snoop will hit the CCX and in turn it’s private L3 cache controllers.
If we contemplate whether a GPU is likely to be able to "dynamically lock" certain surfaces as regions, even if for only a portion of a render pass (?) which of 1 or 2 is more likely? It would seem that 2 fits the bill.
 
If we contemplate whether a GPU is likely to be able to "dynamically lock" certain surfaces as regions, even if for only a portion of a render pass (?) which of 1 or 2 is more likely? It would seem that 2 fits the bill.
As Linux driver patches seem to suggest, 64KB GPUVM pages can be individually marked as LLC No-Alloc, which is somewhat the inverse of pinning things.

Both (1) and (2) could still fit anyway. Just that if (1) is true, it would imply that the SDF protocol is extended to support nitty gritty about an optional memory-side LLC, which would not come as a surprise. It could be a potential bolt-on for EYPC and Instinct GPUs.
 
It could be a potential bolt-on for EYPC and Instinct GPUs.
... and APUs.

A memory-side cache controller has to track the request origin anyway. So in theory, one could have the the SMU controlling the LLC allocation policy based on IP block activity levels, alongside DVFS and power budget. Say, for example, the LLC allocation could be exclusive to only GPU-originated requests during high GPU activity, while the rest of the time it is relaxed to serve as a CPU/SoC L4 cache.
 
... and APUs.
What about Samsung's next gen Exynos.
IC could be a candidate there too, while we are at it :). Super Resolution tech would be awesome for mobiles too.
16 CUs @ 2+ GHz with IC would be insane, 4 TF/ XSS Level
 
What about Samsung's next gen Exynos.
IC could be a candidate there too, while we are at it :). Super Resolution tech would be awesome for mobiles too.
16 CUs @ 2+ GHz with IC would be insane, 4 TF/ XSS Level

Could easily be.

Meanwhile I'd suspect that they could launch a refresh of RDNA2 next year, assuming the rumor of a chiplet arch with the I/O separated out, without infinity cache at all! Hey you're replacing the memory bus anyway, and getting yields up on it. So ditch the giant chips, attach a 512bit bus chiplet to the big one, and potentially watch 4k performance and yields soar.

The hit to IPC and clockspeeds would be interesting to see. But if you doubled the yields (it might actually be better) and dropped the price of everything by 30% or so, well that probably sounds better to consumers.
 
Questioningly... Wasn't the HBCC (theoretically) able to simultaneously connect with two types of memory, at once ..? (ddr & HBM)
 
GPU CHIPLETS USING HIGH BANDWIDTH CROSSLINKS - ADVANCED MICRO DEVICES, INC. (freepatentsonline.com)

A chiplet system includes a central processing unit (CPU) communicably coupled to a first GPU chiplet of a GPU chiplet array. The GPU chiplet array includes the first GPU chiplet communicably coupled to the CPU via a bus and a second GPU chiplet communicably coupled to the first GPU chiplet via a passive crosslink. The passive crosslink is a passive interposer die dedicated for inter-chiplet communications and partitions systems-on-a-chip (SoC) functionality into smaller functional chiplet groupings.

Seems to be a design without TSVs and uses a single chiplet as a master with the others as slaves. Last level cache coherency with dedicated routes through chiplet PHYs and passive interposer connections appears to be the technique by which communication amongst the chiplets is achieved. This would appear to imply that Infinity Cache is crucial to this architecture.
 
GPU CHIPLETS USING HIGH BANDWIDTH CROSSLINKS - ADVANCED MICRO DEVICES, INC. (freepatentsonline.com)



Seems to be a design without TSVs and uses a single chiplet as a master with the others as slaves. Last level cache coherency with dedicated routes through chiplet PHYs and passive interposer connections appears to be the technique by which communication amongst the chiplets is achieved. This would appear to imply that Infinity Cache is crucial to this architecture.
The primary chiplet designation appears to be relevant only in the context of host communication (presumably PCIe controller and a "lead" SMU for DVFS coordination and stuff). This isn't a new semantic for Infinity Fabric, considering that we've seen similar situations with working solutions since Zen 1. More specifically, we have multiple Zeppelin chips in the package/system, each of which owns a replication of resources incl. PSP and SMU, that has some roles requiring one exclusive actor in the system (e.g., parts of the secure boot sequence).

In the context of the Scalable Data Fabric setup, it appears that every GPU chiplet is both a SDF master (that funnels all memory accesses from the local compute/graphics functions), and a SDF slave (that owns a memory controller, optionally with an LLC, bound to a fixed portion of the interleaved VRAM address space).

Then through the "HBX crosslink", seemingly part of the SDF network layer, a 4-way full interconnect [#] can be formed (given 4 HBX PHYs per chiplet). Memory interleaving seems to continue to happen at L1 -> L2. A new level of interleaving seems to be necessary for L2 -> SDF, presumably through configurable routing in SDF, assuming single-, duo- and quad-chiplet setups are all meant to be supported by the same die.

This poses an open question on whether the "point of coherency" (incl. device-scope atomics) is now moved to the L3/LLC, and the implications on the SDF protocol. This is because L2 in this setup would see only accesses from the local chiplet, unless L2s across all the chips are cache coherent [*].

In any case, another open question would be about multimedia blocks and display controllers. Replicate them in all chiplets? Have an extra small chiplet? Ehm, active interposer?

* By chance, AMD did claim to bring "coherent connectivity" incl. "CPU caching GPU memory" with its 3rd generation Infinity Architecture. Coincidence?

# Imagine each chiplet contains 2 SerDes PHYs. Four chiplets give 8 in total, which is coincidentally the number of you need for an "8-way GPU interconnect" (1 to host, 7 to peers). Is this a glimpse to, ehm, an upcoming CDNA product too?


Seems to be a design without TSVs
If anything, Figure 2 seems depicting a 2.5D interposer with TSVs passing through pins to the substrate. At least this is the only configuration that works in such way and commercially exists. :p
 
Last edited:
There aren't TSVs in this design, because those pillars (212) you've identified are through a moulding (220).

As to the way that a GPU is constructed from chiplets, there is no need to make the chiplets uniform. This patent is merely about the communication method amongst chiplets and how it is formed using passive interlinks and a last level cache coherency protocol.

HBX is peer-to-peer dedicated to the last level cache, in other words the last level cache is formed of units that are not in a mesh, it's fully connected. What the patent describes as the last level cache is required for a GPU constructed of chiplets to function, there is no optionality here:

5. The system of claim 4, wherein the first PHY region of the first GPU chiplet comprises a first passive crosslink PHY including conductor traces solely for communications between the passive crosslink and a last level cache of the first GPU chiplet.

Levels of the cache hierarchy other than the last level are specifically coherent within the chiplet:

8. The system of claim 1, further comprising: a first cache memory hierarchy at the first GPU chiplet, wherein a first level of the first cache memory hierarchy is coherent within the first GPU chiplet; and a second cache memory hierarchy at the second GPU chiplet, wherein a first level of the second cache memory hierarchy is coherent within the second GPU chiplet.

9. The system of claim 8, further comprising: a unified cache memory including both a last level of the first cache memory hierarchy and a last level of the second cache memory hierarchy, wherein the unified cache memory is coherent across all chiplets of the GPU chiplet array.

So in my opinion we're likely to see a GPU comprised of a master chiplet that handles CPU and other IO (PCI Express, display port, HDMI etc.) and graphics chiplets. For exascale computing I can imagine that it is solely constructed from graphics chiplets as there is no need for "other IO".
 
There aren't TSVs in this design, because those pillars (212) you've identified are through a moulding (220).
IMO this is arguing semantics. Patent describing things metaphorically is well expected, while the public domain knowledge on packaging technologies indicates that likely only 2.5D interposers or EMIB/LSI can deliver the bump and wire density required. That is unless you assume SerDes significantly outclassing existing on-package variants is used. Such detail is left vague by the patent as expected.

As to the way that a GPU is constructed from chiplets, there is no need to make the chiplets uniform. This patent is merely about the communication method amongst chiplets and how it is formed using passive interlinks and a last level cache coherency protocol.
Not disagreeing. I was trying to put this in perspective with the public domain information on the Scalable Data Fabric. My interpretation is that this HBX crosslink is no different from existing blocks like CAKE/IFIS (inter-socket) or IFOP (on-package), which are network layer constructs, designed for a particular signalling medium, with configurable routing in the grand data fabric scheme.

This especially takes in account of (allegedly) how SDF is used in GPUs since Vega 10, and SoCs like Xbox Series X. Most of the SDF network switches are bound to a pair of an L2 slice (SDF master) and a MC/LLC slice (SDF slave), and traffic between the pair for their designated memory address partition can simply be routed straight through by the switch. Meanwhile, all these SDF switches are interconnected, say perhaps in a cost effective ring topology, that enables full VRAM access for the rest of the SoC (multimedia blocks, display IP, etc).

HBX is peer-to-peer dedicated to the last level cache, in other words the last level cache is formed of units that are not in a mesh, it's fully connected. What the patent describes as the last level cache is required for a GPU constructed of chiplets to function, there is no optionality here:
It is hard for me to argue if we strictly go by the patent text only. But let's say If the (presumably) memory-side last-level cache (paragraph 33) always misses or have zero capacity, does the system functionally fall apart? It doesn't. Life of the fully connected HBX interconnect still goes on, just that requests now always hit the memory controller.

Levels of the cache hierarchy other than the last level are specifically coherent within the chiplet:
The open question is that device atomics require... device-level coherence, i.e., across all chiplets in the setup described by the patent. So the GCN/RDNA tradition of GPU atomics being processed at L2 can no longer continue, because L2 is only coherent within the chiplet, as you quoted.

Two of all possible outcomes are: (1) SDF extends its protocol to support "memory-side atomics", and they are now processed at L3/LLC; and (2) L2 continues to process atomics, and SDF maintains cache coherence between L2s across all chiplets (for lines touched by device-coherent atomics/requests).

So in my opinion we're likely to see a GPU comprised of a master chiplet that handles CPU and other IO (PCI Express, display port, HDMI etc.) and graphics chiplets. For exascale computing I can imagine that it is solely constructed from graphics chiplets as there is no need for "other IO".
Good point.
 
Last edited:
There aren't TSVs in this design, because those pillars (212) you've identified are through a moulding (220).

From
20200409859 GPU CHIPLETS USING HIGH BANDWIDTH CROSSLINKS

[0018] As previously noted, the GPU chiplets 106 are communicably coupled by way of the passive crosslink 118. In various embodiments, the passive crosslink 118 is an interconnect chip constructed of silicon, germanium or other semiconductor materials and may be bulk semiconductor, semiconductor on insulator or other designs. The passive crosslink 118 includes a plurality of internal conductor traces, which may be on a single level or multiple levels as desired. Three of the traces are illustrated in FIG. 2 and labeled collectively as traces 206. The traces 206 interface electrically with conductor structures of the PHY regions 202 of the GPU chiplets 106 by way of conducting pathways. It is noted that the passive crosslink 118 does not contain any through silicon vias (TSVs). In this manner, the passive crosslink 118 is a passive interposer die that communicably couples and routes communications between the GPU chiplets 106, thereby forming a passive routing network.

There is the actual patent application related to the manufacture of the Crosslink die on which the chiplets are mounted.

20200411443 HIGH DENSITY CROSS LINK DIE WITH POLYMER ROUTING LAYER
Various multi-die arrangements and methods of manufacturing the same are disclosed. In one aspect, a semiconductor chip device is provided that includes a first molding layer and an interconnect chip at least partially encased in the first molding layer. The interconnect chip has a first side and a second side opposite the first side and a polymer layer on the first side. The polymer layer includes plural conductor traces. A redistribution layer (RDL) structure is positioned on the first molding layer and has plural conductor structures electrically connected to the plural conductor traces. The plural conductor traces provide lateral routing.
 
IMO this is arguing semantics.
It's cost too: silicon with TSVs is more costly. Though I'm not saying the packaging cost overhead is zero for this design.

Not disagreeing. I was trying to put this in perspective with the public domain information on the Scalable Data Fabric. My interpretation is that this HBX crosslink is no different from existing blocks like CAKE/IFIS (inter-socket) or IFOP (on-package), which are network layer constructs, designed for a particular signalling medium, with configurable routing in the grand data fabric scheme.

This especially takes in account of (allegedly) how SDF is used in GPUs since Vega 10, and SoCs like Xbox Series X. Most of the SDF network switches are bound to a pair of an L2 slice (SDF master) and a MC/LLC slice (SDF slave), and traffic between the pair for their designated memory address partition can simply be routed straight through by the switch. Meanwhile, all these SDF switches are interconnected, say perhaps in a cost effective ring topology, that enables full VRAM access for the rest of the SoC (multimedia blocks, display IP, etc).
I think it's reasonable to assume that what we know (well "you know", since I know effectively zero) about SDF is out of date with respect to an Infinity Cache based architecture. Assuming that this design is based upon Infinity Cache.

It is hard for me to argue if we strictly go by the patent text only. But let's say If the (presumably) memory-side last-level cache (paragraph 33) always misses or have zero capacity, does the system functionally fall apart? It doesn't. Life of the fully connected HBX interconnect still goes on, just that requests now always hit the memory controller.
I can't say I understand the point you're making, since a cache system by definition is always backed by non-cache RAM. I think there's a subtlety in your use of the term "memory-side" that I'm missing.

The open question is that device atomics require... device-level coherence, i.e., across all chiplets in the setup described by the patent. So the GCN/RDNA tradition of GPU atomics being processed at L2 can no longer continue, because L2 is only coherent within the chiplet, as you quoted.
I think this comes back to what RDNA 2 does with Infinity Cache and where and how global atomics are implemented - remembering that ROPs implement a type of global atomics, but in their case the memory space is partitioned such that L2 would suffice (so not truly global, only global in the programmer's model). In other words, I don't know. I've always thought of ROPs as directly being how AMD implements global atomics, but since the ROPs have moved away from the memory controllers I honestly don't know.

Two of all possible outcomes are: (1) SDF extends its protocol to support "memory-side atomics", and they are now processed at L3/LLC; and (2) L2 continues to process atomics, and SDF maintains cache coherence between L2s across all chiplets (for lines touched by device-coherent atomics/requests).
I would suggest the latter, since atomics in the compute API are only valid on writeable random access buffers. So the GPU can promote affected cache lines to global coherence as required and will be forewarned.
 
AMD multi chiplet GPU Patent: https://www.freepatentsonline.com/20200409859.pdf

Haven't gone through it all, though glancing at it, it seems a bit generic. Which is hardly surprising. There is a mention of "caching chiplet" which... the big LLC on its own chiplet makes sense, but going by the other figs, it seems the l3 is shared. Though I wonder how necessary that is with the big LLC.
 
AMD multi chiplet GPU Patent: https://www.freepatentsonline.com/20200409859.pdf

Haven't gone through it all, though glancing at it, it seems a bit generic. Which is hardly surprising. There is a mention of "caching chiplet" which... the big LLC on its own chiplet makes sense, but going by the other figs, it seems the l3 is shared. Though I wonder how necessary that is with the big LLC.
Didn't go through the patent on my phone , but is it suggesting adding another cache level? Since L2 used to be LLC for AMD GPUs, with RDNA2 the new L3 is LLC
 
Didn't go through the patent on my phone , but is it suggesting adding another cache level? Since L2 used to be LLC for AMD GPUs, with RDNA2 the new L3 is LLC

It's suggesting a separate L3 is on each compute chiplet but accessible from other chiplets. Which makes me think this was done before the big LLC was decided on, as that way you can have a separate giant LLC while making each compute smaller.

It's also got a memory bus on each compute chiplet like RDNA1 has, whereas I'd assume they'd stick closer to RDNA2 and have a more unified memory bus like Zen.
 
Status
Not open for further replies.
Back
Top