AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

  1. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,362
    Likes Received:
    3,101
    Location:
    Germany
    I was under the impression that it actually saved power.
     
  2. neckthrough

    Newcomer

    Joined:
    Mar 28, 2019
    Messages:
    67
    Likes Received:
    169
    Yeah caches reduce DRAM energy consumption by filtering accesses, but they themselves do consume energy. A crude estimate is that a 1MB cache would consume around 2 orders of magnitude lower energy than a DRAM access (in recent technology nodes). As the cache size grows, the energy per access grows roughly linearly (again a crude estimate) but with a very low slope.
     
    Alexko likes this.
  3. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    48
    Likes Received:
    100
    One thing I've been curious about is the fact that GDDR6X seems disproportionately power hungry compared to GDDR6 even when you adjust for the higher frequencies. I'm wondering if Micron was somewhat coyly talking up the power dissipation at the DRAM package, and intentionally leaving out how much more power is dissipated by the memory controller on the GPU die (in comparison to GDDR6)

    If I were a betting person, I'd be betting on most of the extra power being dissipated on die, and not being reflected in the datasheets for the GDDR6x modules themselves.
     
    Lightman likes this.
  4. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,761
    Likes Received:
    6,896
    Buildzoid just did a vid on gddr6x power draw. I'll admit I haven't watched it yet, but he's a good information source so this might be relevant to the discussion.

     
    Lightman, Krteq and T2098 like this.
  5. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    682
    Likes Received:
    363
    Hmmm, video is saying it's 7.25pj per bit, while I was going off anandtech which had it per byte: https://www.anandtech.com/show/1597...g-for-higher-rates-coming-to-nvidias-rtx-3090

    Taking a look around, it does appear to be per bit, though a ton of sites screw it up, replacing little b with big B and watnot, so be careful of that. Checklist for time machine is to tell whoever came up with bytes to just not since it's just annoying later (also switch negative and positive charge, etc.) Still, an eighth of the total power for memory isn't huge, and that's going off chip. Staying on package with a substrate for Infinity Fabric 1.0 is only 2pj per bit, and by the time we get to chiplet GPUs we'll be on 3.0, however much more efficient that'll be. So two terabyes a second (or more) is easily doable for a bigger card.
     
  6. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    88
    Likes Received:
    207
    Lightman likes this.
  7. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,362
    Likes Received:
    3,101
    Location:
    Germany
    It's per Bit alright. Table on page 2 spells it out very clearly.
    https://media-www.micron.com/-/medi...rief/ultra_bandwidth_solutions_tech_brief.pdf
     
    T2098, Lightman and BRiT like this.
  8. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    619
    Likes Received:
    397
    Speaking of chiplet GPUs, this AMD patent is just making the rounds.

    It describes an active interposer which holds the L3 cache of a GPU, which is connected via 2.5 integration methods to multiple GPU chiplets. The detailed description makes clear that the system will be exposed as a single GPU to the programmers.

    (edit, added quote: )

     
    #208 tunafish, Apr 5, 2021
    Last edited: Apr 5, 2021
    Pete, Cuthalu, Krteq and 5 others like this.
  9. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,129
    Likes Received:
    510
    Let's amp it a bit and show the packaging flow too.
    https://www.freepatentsonline.com/y2021/0098419.html
    (yes, that's what GCD stands for)
     
    Pete, Krteq, ethernity and 1 other person like this.
  10. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    619
    Likes Received:
    397
    An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute. However, any access outside of a chiplet's own L2 requires looking at the L3 on the interposer, so any memory access always requires a roundtrip to the interposer, even if the memory request will be served by an interface on the same chiplet where it originated.
     
    Pete, BRiT, Lightman and 2 others like this.
  11. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,129
    Likes Received:
    510
    Yep it's a bit like Zen2 parts where you've always walked into the IODs to check the directories even if your target was the 2nd CCX on the same die.
     
  12. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    88
    Likes Received:
    207
    Komachi somehow got wind of this almost 9 months ago, seems he has his sources. On twitter world he and Kopite seems connected to people in the know.

     
    Krteq and BRiT like this.
  13. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    88
    Likes Received:
    207
    Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
    This could mean adding additional SAs is like adding another chiplet.
     
  14. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,129
    Likes Received:
    510
    No it means they could move LLC offdie (well, they did) with little to no losses versus ondie.
    Without any additional d2d links a-la GMI or QPI/UPI or w/ever else.
    It's basically one chungus GPU from your POV.
    Imagine if you sawed off N21 in half and moved its LLC into an active bridge.
    N-n-nope, it's merely for the sake of heat and not dealing with fuckton more TSVs wherever possible.
    Sooner or later analog bs would be ghetto'd into its own set of chiplets (okay sorry, Intel sorta did that already with PVC Xe-Link (well, CXL with fancy) tile), but for now this would do.
     
    #214 Bondrewd, Apr 5, 2021
    Last edited: Apr 5, 2021
    Pete likes this.
  15. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    619
    Likes Received:
    397
    Yeah, the way I parsed it was that it was just a complicated way of saying: "Our LLC on the interposer is almost as fast as the LLC was on-die."

    By scaling, I just mean that if you double the amount of GCDs, you also double the amount of memory PHYs, so they keep up.

    Another interesting point I noticed: It appears that they will not have a separate "interface die", that is, every GCD will have what it needs to connect to PCI-e and displays.

    The advantage of this is that they can sell the GCD alone for their lowest-end part, and save a few bucks there, the disadvantage of course is that they lose a few % of silicon on every higher-end part because only the interfaces on the first die are in use.
     
    Malo likes this.
  16. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,129
    Likes Received:
    510
    It's a patent, stuff has to be worded in a very obtuse manner there.
    Well you see, AMD sources most of their client GPU bandwidth from their chungus SRAM array.
    For now, yes.
    Later, all gets ghetto'd like Intel did in PVC.
    Remember, this is AMD trying to tinker with expensive, hardly ever proven packaging tech in client of all things.
    Gotta do it in increments.
    Nope, only the upper two N3x parts are chiplet, rest are N-1 single dies.
    Technically it also allows them to salvage more but TSMC yields are excellence as is thus well, not entirely necessary.
     
    Lightman likes this.
  17. yuri

    Regular Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    272
    Likes Received:
    280
    True, this stuff is very advanced technically. Targeting client (patent mentions "acting as one GPU" several times) is bold.

    Intel's tiles are way ahead of AMD's CPU SerDes. However, they seems to target servers with PVC and SPR(?).

    Kinda can't believe this form comes straight as Navi 3x with no intermediate generation. Is the MI200 MCM acting as one?
     
  18. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,129
    Likes Received:
    510
    It's also very fucking ghetto; ridicilously advanced parts that look like they've been made in a shack.
    Being fair, AMD/ATi always targeted new stuff with client usually first, see GDDR5 or HBM with 2.5D.
    Granted they're not using organic carriers.
    The parts that do (haha CL-AP) are unarguably worse than what AMD offers in Rome/Milan.
    Yep, the former of which is more of a moonshot that many would like to agree.
    AMD is doing it way more gradually (i.e. N31 is 3 tiles total while PVC is 40-something).
    The tile count will ramp with each RDNA/CDNA gen until it looks very held-by-spit-and-tape.

    Also Intel is just schizo when it comes to leveraging their very cool advanced packaging roadmap.
    You know it when the most relevant EMIB part shipped to date is Kaby Lake-G.
    YEP.
     
    #218 Bondrewd, Apr 5, 2021
    Last edited: Apr 5, 2021
  19. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,798
    Likes Received:
    1,981
    Location:
    Earth
    It should be doable. Nvidia's a100 does that already via nvlink. Nvidia had that whole blurb about world's biggest gpu via nvlink(was it 8 or 16gpu's in one box?). NVlink offers unified memory over nvlink. Memory and caches are coherent. It's more of a question of performance. In case of nvlink the gpu to gpu traffic definitely is slower than inter gpu traffic. User still has to be aware of individual gpu's to avoid memory related performance bottle necks.

    AMD solution looks like it could potentially offer a lot more bandwidth than nvlink. If AMD was able to provide so low latency and high bandwidth that chiplets would just act as one giant gpu without any specific optimizations needed that would be really awesome.
     
  20. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,129
    Likes Received:
    510
    That's an understatement if I've ever seen one.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...