AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. CarstenS

    CarstenS Legend Subscriber

    I was under the impression that it actually saved power.
     
  2. neckthrough

    neckthrough Newcomer

    Yeah caches reduce DRAM energy consumption by filtering accesses, but they themselves do consume energy. A crude estimate is that a 1MB cache would consume around 2 orders of magnitude lower energy than a DRAM access (in recent technology nodes). As the cache size grows, the energy per access grows roughly linearly (again a crude estimate) but with a very low slope.
     
    Jawed and Alexko like this.
  3. T2098

    T2098 Newcomer

    One thing I've been curious about is the fact that GDDR6X seems disproportionately power hungry compared to GDDR6 even when you adjust for the higher frequencies. I'm wondering if Micron was somewhat coyly talking up the power dissipation at the DRAM package, and intentionally leaving out how much more power is dissipated by the memory controller on the GPU die (in comparison to GDDR6)

    If I were a betting person, I'd be betting on most of the extra power being dissipated on die, and not being reflected in the datasheets for the GDDR6x modules themselves.
     
    Lightman likes this.
  4. Scott_Arm

    Scott_Arm Legend

    Buildzoid just did a vid on gddr6x power draw. I'll admit I haven't watched it yet, but he's a good information source so this might be relevant to the discussion.

     
    Lightman, Krteq and T2098 like this.
  5. Frenetic Pony

    Frenetic Pony Regular

    Hmmm, video is saying it's 7.25pj per bit, while I was going off anandtech which had it per byte: https://www.anandtech.com/show/1597...g-for-higher-rates-coming-to-nvidias-rtx-3090

    Taking a look around, it does appear to be per bit, though a ton of sites screw it up, replacing little b with big B and watnot, so be careful of that. Checklist for time machine is to tell whoever came up with bytes to just not since it's just annoying later (also switch negative and positive charge, etc.) Still, an eighth of the total power for memory isn't huge, and that's going off chip. Staying on package with a substrate for Infinity Fabric 1.0 is only 2pj per bit, and by the time we get to chiplet GPUs we'll be on 3.0, however much more efficient that'll be. So two terabyes a second (or more) is easily doable for a bigger card.
     
  6. Jawed and Lightman like this.
  7. CarstenS

    CarstenS Legend Subscriber

    It's per Bit alright. Table on page 2 spells it out very clearly.
    https://media-www.micron.com/-/medi...rief/ultra_bandwidth_solutions_tech_brief.pdf
     
    Jawed, T2098, Lightman and 1 other person like this.
  8. tunafish

    tunafish Regular

    Speaking of chiplet GPUs, this AMD patent is just making the rounds.

    It describes an active interposer which holds the L3 cache of a GPU, which is connected via 2.5 integration methods to multiple GPU chiplets. The detailed description makes clear that the system will be exposed as a single GPU to the programmers.

    (edit, added quote: )

     
    Last edited: Apr 5, 2021
    Jawed, Pete, Cuthalu and 6 others like this.
  9. Bondrewd

    Bondrewd Veteran

    Let's amp it a bit and show the packaging flow too.
    https://www.freepatentsonline.com/y2021/0098419.html
    (yes, that's what GCD stands for)
     
    Jawed, Pete, Krteq and 2 others like this.
  10. tunafish

    tunafish Regular

    An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute. However, any access outside of a chiplet's own L2 requires looking at the L3 on the interposer, so any memory access always requires a roundtrip to the interposer, even if the memory request will be served by an interface on the same chiplet where it originated.
     
    Jawed, Pete, BRiT and 3 others like this.
  11. Bondrewd

    Bondrewd Veteran

    Yep it's a bit like Zen2 parts where you've always walked into the IODs to check the directories even if your target was the 2nd CCX on the same die.
     
  12. Komachi somehow got wind of this almost 9 months ago, seems he has his sources. On twitter world he and Kopite seems connected to people in the know.

     
    Krteq and BRiT like this.
  13. Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
    This could mean adding additional SAs is like adding another chiplet.
     
  14. Bondrewd

    Bondrewd Veteran

    No it means they could move LLC offdie (well, they did) with little to no losses versus ondie.
    Without any additional d2d links a-la GMI or QPI/UPI or w/ever else.
    It's basically one chungus GPU from your POV.
    Imagine if you sawed off N21 in half and moved its LLC into an active bridge.
    N-n-nope, it's merely for the sake of heat and not dealing with fuckton more TSVs wherever possible.
    Sooner or later analog bs would be ghetto'd into its own set of chiplets (okay sorry, Intel sorta did that already with PVC Xe-Link (well, CXL with fancy) tile), but for now this would do.
     
    Last edited: Apr 5, 2021
    Jawed and Pete like this.
  15. tunafish

    tunafish Regular

    Yeah, the way I parsed it was that it was just a complicated way of saying: "Our LLC on the interposer is almost as fast as the LLC was on-die."

    By scaling, I just mean that if you double the amount of GCDs, you also double the amount of memory PHYs, so they keep up.

    Another interesting point I noticed: It appears that they will not have a separate "interface die", that is, every GCD will have what it needs to connect to PCI-e and displays.

    The advantage of this is that they can sell the GCD alone for their lowest-end part, and save a few bucks there, the disadvantage of course is that they lose a few % of silicon on every higher-end part because only the interfaces on the first die are in use.
     
    Malo likes this.
  16. Bondrewd

    Bondrewd Veteran

    It's a patent, stuff has to be worded in a very obtuse manner there.
    Well you see, AMD sources most of their client GPU bandwidth from their chungus SRAM array.
    For now, yes.
    Later, all gets ghetto'd like Intel did in PVC.
    Remember, this is AMD trying to tinker with expensive, hardly ever proven packaging tech in client of all things.
    Gotta do it in increments.
    Nope, only the upper two N3x parts are chiplet, rest are N-1 single dies.
    Technically it also allows them to salvage more but TSMC yields are excellence as is thus well, not entirely necessary.
     
    Jawed and Lightman like this.
  17. yuri

    yuri Regular

    True, this stuff is very advanced technically. Targeting client (patent mentions "acting as one GPU" several times) is bold.

    Intel's tiles are way ahead of AMD's CPU SerDes. However, they seems to target servers with PVC and SPR(?).

    Kinda can't believe this form comes straight as Navi 3x with no intermediate generation. Is the MI200 MCM acting as one?
     
  18. Bondrewd

    Bondrewd Veteran

    It's also very fucking ghetto; ridicilously advanced parts that look like they've been made in a shack.
    Being fair, AMD/ATi always targeted new stuff with client usually first, see GDDR5 or HBM with 2.5D.
    Granted they're not using organic carriers.
    The parts that do (haha CL-AP) are unarguably worse than what AMD offers in Rome/Milan.
    Yep, the former of which is more of a moonshot that many would like to agree.
    AMD is doing it way more gradually (i.e. N31 is 3 tiles total while PVC is 40-something).
    The tile count will ramp with each RDNA/CDNA gen until it looks very held-by-spit-and-tape.

    Also Intel is just schizo when it comes to leveraging their very cool advanced packaging roadmap.
    You know it when the most relevant EMIB part shipped to date is Kaby Lake-G.
    YEP.
     
    Last edited: Apr 5, 2021
  19. manux

    manux Veteran

    It should be doable. Nvidia's a100 does that already via nvlink. Nvidia had that whole blurb about world's biggest gpu via nvlink(was it 8 or 16gpu's in one box?). NVlink offers unified memory over nvlink. Memory and caches are coherent. It's more of a question of performance. In case of nvlink the gpu to gpu traffic definitely is slower than inter gpu traffic. User still has to be aware of individual gpu's to avoid memory related performance bottle necks.

    AMD solution looks like it could potentially offer a lot more bandwidth than nvlink. If AMD was able to provide so low latency and high bandwidth that chiplets would just act as one giant gpu without any specific optimizations needed that would be really awesome.
     
  20. Bondrewd

    Bondrewd Veteran

    That's an understatement if I've ever seen one.
     
Loading...

Share This Page

Loading...