AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

  1. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,528
    Likes Received:
    953
    Thanks! I don't suppose they gave any figures for (hypothetical) 32, 64, 256, 512…MB caches, did they?
     
    Frenetic Pony likes this.
  2. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    Nope, only for the 128MB in N21. N22 and N23 are rumoured to be 96 and 32MB respectively so we should get figures for those when the cards launch.

    For APUs, there's nothing technically preventing them from having the GPU share the CPU L3 is there? I believe Intel iGPUs already do this.
     
    Alexko likes this.
  3. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,528
    Likes Received:
    953
    Thanks. In principle, no, but that would depend on how the Infinity Cache works exactly.
     
  4. vjPiedPiper

    Newcomer

    Joined:
    Nov 23, 2005
    Messages:
    107
    Likes Received:
    61
    Location:
    Melbourne Aus.
    OK, i hope I'm not coming across as being argumentative,
    I'm learning a lot here, and appreciate your responses.

    So lets assume we're NOT talking about consumer GPU's. just how to get the most performance,
    OR - The path to Multi-GPU, and how we can make it work?

    So my thoughts...
    Does it start with something like Multiple GPU's chiplet each having their own Large L3 cache - similar to infinity cache, but share a single mem controller?

    How hard is to tell 1 chiplet to render the top half of the screen and 1 to render the bottom half?
    How much do we lose in efficiency? is 2 x 6800, on a single card, but with a single Mem-controller sharing 1 x 16Gb memory, 200%, or is it only 110%,
    My random guess is that it ends up around 180%?
    I know this is a sort of Brute force way to improve GPU perf, but at least your not duplicating the DDR memory, ala previous Dual GPU solutions.

    Basically does the infinity cache concept allow the ability to create Dual GPU's but retain a single shared pool of DDR?

    Cheers,
     
  5. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    CPU L3 in Zen is architecturally a private victim cache of a CCX. The whole CCX is one complete blackbox with a coherent master role on the data fabric, only being snooped by the home memory controller for coherence protocol traffic. Otherwise, we wouldn’t even have this big Zen 3 selling point of two CCXs merging into one.

    This is kind of the SoC modularity and separation that AMD touted since maybe 2012 — they had been longing for a versatile SOC fabric that focuses only on connecting IPs to memory, and avoids being tied to nitty gritty of specific IPs (cache hierarchies included; also those good old Onion and Garlic buses).

    (EYPC probe filters no longer requiring “stealing” L3 cache ways is also an illustration of the social distancing between memory controllers and L3 cache, now under the Infinity Fabric regime.)

    “Infinity Cache” has also been said to amplify bandwidth by having many slices tying to fabric nodes that apparently scale with memory channels, where a CCX has only one single port out to the SDF (IF data plane). So there isn’t much possibilities left for how “Infinity Cache” works — it is either:

    1. a memory-side cache, meaning being part of the memory controllers; the GPU IP does not see it directly, and uses it implicitly when it accesses memory via the SDF; or

    2. a GPU private cache, meaning L2 misses go to this GPU internal LLC, and only those that still misses in the LLC gets turned into a memory request on the SDF.

    Either way, GPU generally have non-coherent requests going into SDF — those are routed to the home MC directly, and generate no further coherence traffic. So not even a snoop will hit the CCX and in turn it’s private L3 cache controllers.
     
    #45 pTmdfx, Nov 6, 2020
    Last edited: Nov 6, 2020
    T2098, Alexko, Erinyes and 2 others like this.
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    If we contemplate whether a GPU is likely to be able to "dynamically lock" certain surfaces as regions, even if for only a portion of a render pass (?) which of 1 or 2 is more likely? It would seem that 2 fits the bill.
     
  7. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    As Linux driver patches seem to suggest, 64KB GPUVM pages can be individually marked as LLC No-Alloc, which is somewhat the inverse of pinning things.

    Both (1) and (2) could still fit anyway. Just that if (1) is true, it would imply that the SDF protocol is extended to support nitty gritty about an optional memory-side LLC, which would not come as a surprise. It could be a potential bolt-on for EYPC and Instinct GPUs.
     
    Jawed, Lightman and BRiT like this.
  8. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    ... and APUs.

    A memory-side cache controller has to track the request origin anyway. So in theory, one could have the the SMU controlling the LLC allocation policy based on IP block activity levels, alongside DVFS and power budget. Say, for example, the LLC allocation could be exclusive to only GPU-originated requests during high GPU activity, while the rest of the time it is relaxed to serve as a CPU/SoC L4 cache.
     
  9. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    79
    Likes Received:
    191
    What about Samsung's next gen Exynos.
    IC could be a candidate there too, while we are at it :). Super Resolution tech would be awesome for mobiles too.
    16 CUs @ 2+ GHz with IC would be insane, 4 TF/ XSS Level
     
  10. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    631
    Likes Received:
    323
    Could easily be.

    Meanwhile I'd suspect that they could launch a refresh of RDNA2 next year, assuming the rumor of a chiplet arch with the I/O separated out, without infinity cache at all! Hey you're replacing the memory bus anyway, and getting yields up on it. So ditch the giant chips, attach a 512bit bus chiplet to the big one, and potentially watch 4k performance and yields soar.

    The hit to IPC and clockspeeds would be interesting to see. But if you doubled the yields (it might actually be better) and dropped the price of everything by 30% or so, well that probably sounds better to consumers.
     
  11. w0lfram

    Regular Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    250
    Likes Received:
    48
    Questioningly... Wasn't the HBCC (theoretically) able to simultaneously connect with two types of memory, at once ..? (ddr & HBM)
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    GPU CHIPLETS USING HIGH BANDWIDTH CROSSLINKS - ADVANCED MICRO DEVICES, INC. (freepatentsonline.com)

    Seems to be a design without TSVs and uses a single chiplet as a master with the others as slaves. Last level cache coherency with dedicated routes through chiplet PHYs and passive interposer connections appears to be the technique by which communication amongst the chiplets is achieved. This would appear to imply that Infinity Cache is crucial to this architecture.
     
    Kej, Lightman, Krteq and 4 others like this.
  13. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    The primary chiplet designation appears to be relevant only in the context of host communication (presumably PCIe controller and a "lead" SMU for DVFS coordination and stuff). This isn't a new semantic for Infinity Fabric, considering that we've seen similar situations with working solutions since Zen 1. More specifically, we have multiple Zeppelin chips in the package/system, each of which owns a replication of resources incl. PSP and SMU, that has some roles requiring one exclusive actor in the system (e.g., parts of the secure boot sequence).

    In the context of the Scalable Data Fabric setup, it appears that every GPU chiplet is both a SDF master (that funnels all memory accesses from the local compute/graphics functions), and a SDF slave (that owns a memory controller, optionally with an LLC, bound to a fixed portion of the interleaved VRAM address space).

    Then through the "HBX crosslink", seemingly part of the SDF network layer, a 4-way full interconnect [#] can be formed (given 4 HBX PHYs per chiplet). Memory interleaving seems to continue to happen at L1 -> L2. A new level of interleaving seems to be necessary for L2 -> SDF, presumably through configurable routing in SDF, assuming single-, duo- and quad-chiplet setups are all meant to be supported by the same die.

    This poses an open question on whether the "point of coherency" (incl. device-scope atomics) is now moved to the L3/LLC, and the implications on the SDF protocol. This is because L2 in this setup would see only accesses from the local chiplet, unless L2s across all the chips are cache coherent [*].

    In any case, another open question would be about multimedia blocks and display controllers. Replicate them in all chiplets? Have an extra small chiplet? Ehm, active interposer?

    * By chance, AMD did claim to bring "coherent connectivity" incl. "CPU caching GPU memory" with its 3rd generation Infinity Architecture. Coincidence?

    # Imagine each chiplet contains 2 SerDes PHYs. Four chiplets give 8 in total, which is coincidentally the number of you need for an "8-way GPU interconnect" (1 to host, 7 to peers). Is this a glimpse to, ehm, an upcoming CDNA product too?


    If anything, Figure 2 seems depicting a 2.5D interposer with TSVs passing through pins to the substrate. At least this is the only configuration that works in such way and commercially exists. :razz:
     
    #53 pTmdfx, Jan 2, 2021
    Last edited: Jan 2, 2021
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    There aren't TSVs in this design, because those pillars (212) you've identified are through a moulding (220).

    As to the way that a GPU is constructed from chiplets, there is no need to make the chiplets uniform. This patent is merely about the communication method amongst chiplets and how it is formed using passive interlinks and a last level cache coherency protocol.

    HBX is peer-to-peer dedicated to the last level cache, in other words the last level cache is formed of units that are not in a mesh, it's fully connected. What the patent describes as the last level cache is required for a GPU constructed of chiplets to function, there is no optionality here:

    Levels of the cache hierarchy other than the last level are specifically coherent within the chiplet:

    So in my opinion we're likely to see a GPU comprised of a master chiplet that handles CPU and other IO (PCI Express, display port, HDMI etc.) and graphics chiplets. For exascale computing I can imagine that it is solely constructed from graphics chiplets as there is no need for "other IO".
     
  15. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    IMO this is arguing semantics. Patent describing things metaphorically is well expected, while the public domain knowledge on packaging technologies indicates that likely only 2.5D interposers or EMIB/LSI can deliver the bump and wire density required. That is unless you assume SerDes significantly outclassing existing on-package variants is used. Such detail is left vague by the patent as expected.

    Not disagreeing. I was trying to put this in perspective with the public domain information on the Scalable Data Fabric. My interpretation is that this HBX crosslink is no different from existing blocks like CAKE/IFIS (inter-socket) or IFOP (on-package), which are network layer constructs, designed for a particular signalling medium, with configurable routing in the grand data fabric scheme.

    This especially takes in account of (allegedly) how SDF is used in GPUs since Vega 10, and SoCs like Xbox Series X. Most of the SDF network switches are bound to a pair of an L2 slice (SDF master) and a MC/LLC slice (SDF slave), and traffic between the pair for their designated memory address partition can simply be routed straight through by the switch. Meanwhile, all these SDF switches are interconnected, say perhaps in a cost effective ring topology, that enables full VRAM access for the rest of the SoC (multimedia blocks, display IP, etc).

    It is hard for me to argue if we strictly go by the patent text only. But let's say If the (presumably) memory-side last-level cache (paragraph 33) always misses or have zero capacity, does the system functionally fall apart? It doesn't. Life of the fully connected HBX interconnect still goes on, just that requests now always hit the memory controller.

    The open question is that device atomics require... device-level coherence, i.e., across all chiplets in the setup described by the patent. So the GCN/RDNA tradition of GPU atomics being processed at L2 can no longer continue, because L2 is only coherent within the chiplet, as you quoted.

    Two of all possible outcomes are: (1) SDF extends its protocol to support "memory-side atomics", and they are now processed at L3/LLC; and (2) L2 continues to process atomics, and SDF maintains cache coherence between L2s across all chiplets (for lines touched by device-coherent atomics/requests).

    Good point.
     
    #55 pTmdfx, Jan 2, 2021
    Last edited: Jan 2, 2021
  16. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    79
    Likes Received:
    191
    From
    20200409859 GPU CHIPLETS USING HIGH BANDWIDTH CROSSLINKS

    There is the actual patent application related to the manufacture of the Crosslink die on which the chiplets are mounted.

    20200411443 HIGH DENSITY CROSS LINK DIE WITH POLYMER ROUTING LAYER
    Various multi-die arrangements and methods of manufacturing the same are disclosed. In one aspect, a semiconductor chip device is provided that includes a first molding layer and an interconnect chip at least partially encased in the first molding layer. The interconnect chip has a first side and a second side opposite the first side and a polymer layer on the first side. The polymer layer includes plural conductor traces. A redistribution layer (RDL) structure is positioned on the first molding layer and has plural conductor structures electrically connected to the plural conductor traces. The plural conductor traces provide lateral routing.
     
    Lightman, Krteq, Jawed and 1 other person like this.
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    It's cost too: silicon with TSVs is more costly. Though I'm not saying the packaging cost overhead is zero for this design.

    I think it's reasonable to assume that what we know (well "you know", since I know effectively zero) about SDF is out of date with respect to an Infinity Cache based architecture. Assuming that this design is based upon Infinity Cache.

    I can't say I understand the point you're making, since a cache system by definition is always backed by non-cache RAM. I think there's a subtlety in your use of the term "memory-side" that I'm missing.

    I think this comes back to what RDNA 2 does with Infinity Cache and where and how global atomics are implemented - remembering that ROPs implement a type of global atomics, but in their case the memory space is partitioned such that L2 would suffice (so not truly global, only global in the programmer's model). In other words, I don't know. I've always thought of ROPs as directly being how AMD implements global atomics, but since the ROPs have moved away from the memory controllers I honestly don't know.

    I would suggest the latter, since atomics in the compute API are only valid on writeable random access buffers. So the GPU can promote affected cache lines to global coherence as required and will be forewarned.
     
  18. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    631
    Likes Received:
    323
    AMD multi chiplet GPU Patent: https://www.freepatentsonline.com/20200409859.pdf

    Haven't gone through it all, though glancing at it, it seems a bit generic. Which is hardly surprising. There is a mention of "caching chiplet" which... the big LLC on its own chiplet makes sense, but going by the other figs, it seems the l3 is shared. Though I wonder how necessary that is with the big LLC.
     
  19. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,596
    Likes Received:
    3,712
    Location:
    Finland
    Didn't go through the patent on my phone , but is it suggesting adding another cache level? Since L2 used to be LLC for AMD GPUs, with RDNA2 the new L3 is LLC
     
  20. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    631
    Likes Received:
    323
    It's suggesting a separate L3 is on each compute chiplet but accessible from other chiplets. Which makes me think this was done before the big LLC was decided on, as that way you can have a separate giant LLC while making each compute smaller.

    It's also got a memory bus on each compute chiplet like RDNA1 has, whereas I'd assume they'd stick closer to RDNA2 and have a more unified memory bus like Zen.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...