AMD: RDNA 3 Speculation, Rumours and Discussion

CarstenS · Feb 12, 2021

I was under the impression that it actually saved power.

neckthrough · Feb 12, 2021

CarstenS said:
I was under the impression that it actually saved power.

Yeah caches reduce DRAM energy consumption by filtering accesses, but they themselves do consume energy. A crude estimate is that a 1MB cache would consume around 2 orders of magnitude lower energy than a DRAM access (in recent technology nodes). As the cache size grows, the energy per access grows roughly linearly (again a crude estimate) but with a very low slope.

T2098 · Feb 19, 2021

Frenetic Pony said:
Ok, here we go, again.

A reminder: Bandwidth for desktop GPUs is insanely cheap compared to their compute. The entire GDDR6X bus on a 3090 uses up maybe a handful of watts at most. 7.25 picojoules per byte. Picjoules is ^-12 joules and 1 joule = 1 watt without time. So even a terabyte isn't that much (like 7 watts). Bandwidth is only dear compared to mobile power usage. Compared to desktop/HPC stuff, where compute frequencies hit the exponential growth curve hard while bandwidth costs remain constant, bandwidth power usage is negligible. Remember when citing math, to actually do the math.

One thing I've been curious about is the fact that GDDR6X seems disproportionately power hungry compared to GDDR6 even when you adjust for the higher frequencies. I'm wondering if Micron was somewhat coyly talking up the power dissipation at the DRAM package, and intentionally leaving out how much more power is dissipated by the memory controller on the GPU die (in comparison to GDDR6)

If I were a betting person, I'd be betting on most of the extra power being dissipated on die, and not being reflected in the datasheets for the GDDR6x modules themselves.

Scott_Arm · Feb 19, 2021

Buildzoid just did a vid on gddr6x power draw. I'll admit I haven't watched it yet, but he's a good information source so this might be relevant to the discussion.

Frenetic Pony · Feb 20, 2021

Hmmm, video is saying it's 7.25pj per bit, while I was going off anandtech which had it per byte: https://www.anandtech.com/show/1597...g-for-higher-rates-coming-to-nvidias-rtx-3090

Taking a look around, it does appear to be per bit, though a ton of sites screw it up, replacing little b with big B and watnot, so be careful of that. Checklist for time machine is to tell whoever came up with bytes to just not since it's just annoying later (also switch negative and positive charge, etc.) Still, an eighth of the total power for memory isn't huge, and that's going off chip. Staying on package with a substrate for Infinity Fabric 1.0 is only 2pj per bit, and by the time we get to chiplet GPUs we'll be on 3.0, however much more efficient that'll be. So two terabyes a second (or more) is easily doable for a bigger card.

Deleted member 90741 · Feb 20, 2021

New CDNA1 successor GFX90A, has been added in LLVM.
Looks like a more comprehensive update than GFX908 is over the baseline GFX9

Supports full rate FP64, 64Bit DPP, Packed FP32, additional instructions on top of GFX908

FeatureGFX90AInsts,
Feature64BitDPP,
FeaturePackedFP32Ops,
FeaturePackedTID,
FullRate64Ops

https://github.com/llvm/llvm-project/commit/a8d9d50762c42d726274d3f1126ec97ff96e2a22#

CarstenS · Feb 20, 2021

Frenetic Pony said:
Hmmm, video is saying it's 7.25pj per bit, while I was going off anandtech which had it per byte: https://www.anandtech.com/show/1597...g-for-higher-rates-coming-to-nvidias-rtx-3090

Taking a look around, it does appear to be per bit, though a ton of sites screw it up, replacing little b with big B and watnot, so be careful of that. Checklist for time machine is to tell whoever came up with bytes to just not since it's just annoying later (also switch negative and positive charge, etc.) Still, an eighth of the total power for memory isn't huge, and that's going off chip. Staying on package with a substrate for Infinity Fabric 1.0 is only 2pj per bit, and by the time we get to chiplet GPUs we'll be on 3.0, however much more efficient that'll be. So two terabyes a second (or more) is easily doable for a bigger card.

It's per Bit alright. Table on page 2 spells it out very clearly.
https://media-www.micron.com/-/medi...rief/ultra_bandwidth_solutions_tech_brief.pdf

tunafish · Apr 5, 2021

Speaking of chiplet GPUs, this AMD patent is just making the rounds.

It describes an active interposer which holds the L3 cache of a GPU, which is connected via 2.5 integration methods to multiple GPU chiplets. The detailed description makes clear that the system will be exposed as a single GPU to the programmers.

(edit, added quote: )

patent said:
Accordingly, as discussed herein, an active bridge chiplet deploys monolithic GPU functionality using a set of interconnected GPU chiplets in a manner that makes the GPU chiplet implementation appear as a traditional monolithic GPU from a programmer model/developer perspective. The scalable data fabric of one GPU chiplet is able to access the lower level cache(s) on the active bridge chiplet in nearly the same time as to access the lower level cache on its same chiplet, and thus allows the GPU chiplets to maintain cache coherency without requiring additional inter-chiplet coherency protocols. This low-latency, inter-chiplet cache coherency in turn enables the chiplet-based system to operate as a monolithic GPU from the software developer's perspective, and thus avoids chiplet-specific considerations on the part of a programmer or developer.

Bondrewd · Apr 5, 2021

tunafish said:
Speaking of chiplet GPUs, this AMD patent is just making the rounds.

It describes an active interposer which holds the L3 cache of a GPU, which is connected via 2.5 integration methods to multiple GPU chiplets. The detailed description makes clear that the system will be exposed as a single GPU to the programmers.

Let's amp it a bit and show the packaging flow too.
https://www.freepatentsonline.com/y2021/0098419.html
(yes, that's what GCD stands for)

tunafish · Apr 5, 2021

An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute. However, any access outside of a chiplet's own L2 requires looking at the L3 on the interposer, so any memory access always requires a roundtrip to the interposer, even if the memory request will be served by an interface on the same chiplet where it originated.

Bondrewd · Apr 5, 2021

tunafish said:
An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute. However, any access outside of a chiplet's own L2 requires looking at the L3 on the interposer, so any memory access always requires a roundtrip to the interposer, even if the memory request will be served by an interface on the same chiplet where it originated.

Yep it's a bit like Zen2 parts where you've always walked into the IODs to check the directories even if your target was the 2nd CCX on the same die.

Deleted member 90741 · Apr 5, 2021

Bondrewd said:
Let's amp it a bit and show the packaging flow too.
https://www.freepatentsonline.com/y2021/0098419.html
(yes, that's what GCD stands for)

Komachi somehow got wind of this almost 9 months ago, seems he has his sources. On twitter world he and Kopite seems connected to people in the know.

https://twitter.com/x/status/1291796703767457792

Deleted member 90741 · Apr 5, 2021

Accordingly, as discussed herein, an active bridge chiplet deploys monolithic GPU functionality using a set of interconnected GPU chiplets in a manner that makes the GPU chiplet implementation appear as a traditional monolithic GPU from a programmer model/developer perspective. The scalable data fabric of one GPU chiplet is able to access the lower level cache(s) on the active bridge chiplet in nearly the same time as to access the lower level cache on its same chiplet, and thus allows the GPU chiplets to maintain cache coherency without requiring additional inter-chiplet coherency protocols. This low-latency, inter-chiplet cache coherency in turn enables the chiplet-based system to operate as a monolithic GPU from the software developer's perspective, and thus avoids chiplet-specific considerations on the part of a programmer or developer.

Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
This could mean adding additional SAs is like adding another chiplet.

Bondrewd · Apr 5, 2021

ethernity said:
Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?

No it means they could move LLC offdie (well, they did) with little to no losses versus ondie.
Without any additional d2d links a-la GMI or QPI/UPI or w/ever else.
It's basically one chungus GPU from your POV.
Imagine if you sawed off N21 in half and moved its LLC into an active bridge.

tunafish said:
An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute.

N-n-nope, it's merely for the sake of heat and not dealing with fuckton more TSVs wherever possible.
Sooner or later analog bs would be ghetto'd into its own set of chiplets (okay sorry, Intel sorta did that already with PVC Xe-Link (well, CXL with fancy) tile), but for now this would do.

tunafish · Apr 5, 2021

ethernity said:
Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
This could mean adding additional SAs is like adding another chiplet.

Bondrewd said:
No it means they could move LLC offdie (well, they did) with little to no losses versus ondie.

Yeah, the way I parsed it was that it was just a complicated way of saying: "Our LLC on the interposer is almost as fast as the LLC was on-die."

Bondrewd said:
N-n-nope, it's merely for the sake of heat and not dealing with fuckton more TSVs wherever possible.

By scaling, I just mean that if you double the amount of GCDs, you also double the amount of memory PHYs, so they keep up.

Another interesting point I noticed: It appears that they will not have a separate "interface die", that is, every GCD will have what it needs to connect to PCI-e and displays.

The advantage of this is that they can sell the GCD alone for their lowest-end part, and save a few bucks there, the disadvantage of course is that they lose a few % of silicon on every higher-end part because only the interfaces on the first die are in use.

Bondrewd · Apr 5, 2021

tunafish said:
Yeah, the way I parsed it was that it was just a complicated way of saying: "Our LLC on the interposer is almost as fast as the LLC was on-die."

It's a patent, stuff has to be worded in a very obtuse manner there.

tunafish said:
By scaling, I just mean that if you double the amount of GCDs, you also double the amount of memory PHYs, so they keep up.

Well you see, AMD sources most of their client GPU bandwidth from their chungus SRAM array.

tunafish said:
Another interesting point I noticed: It appears that they will not have a separate "interface die", that is, every GCD will have what it needs to connect to PCI-e and displays.

For now, yes.
Later, all gets ghetto'd like Intel did in PVC.
Remember, this is AMD trying to tinker with expensive, hardly ever proven packaging tech in client of all things.
Gotta do it in increments.

tunafish said:
The advantage of this is that they can sell the GCD alone for their lowest-end part

Nope, only the upper two N3x parts are chiplet, rest are N-1 single dies.

tunafish said:
the disadvantage of course is that they lose a few % of silicon on every higher-end part because only the interfaces on the first die are in use.

Technically it also allows them to salvage more but TSMC yields are excellence as is thus well, not entirely necessary.

yuri · Apr 5, 2021

Bondrewd said:
For now, yes.
Later, all gets ghetto'd like Intel did in PVC.
Remember, this is AMD trying to tinker with expensive, hardly ever proven packaging tech in client of all things.
Gotta do it in increments.

True, this stuff is very advanced technically. Targeting client (patent mentions "acting as one GPU" several times) is bold.

Intel's tiles are way ahead of AMD's CPU SerDes. However, they seems to target servers with PVC and SPR(?).

Kinda can't believe this form comes straight as Navi 3x with no intermediate generation. Is the MI200 MCM acting as one?

Bondrewd · Apr 5, 2021

yuri said:
True, this stuff is very advanced technically

It's also very fucking ghetto; ridicilously advanced parts that look like they've been made in a shack.

yuri said:
Targeting client (patent mentions "acting as one GPU" several times) is bold.

Being fair, AMD/ATi always targeted new stuff with client usually first, see GDDR5 or HBM with 2.5D.

yuri said:
Intel's tiles are way ahead of AMD's CPU SerDes

Granted they're not using organic carriers.
The parts that do (haha CL-AP) are unarguably worse than what AMD offers in Rome/Milan.

yuri said:
However, they seems to target servers with PVC and SPR(?).

Yep, the former of which is more of a moonshot that many would like to agree.
AMD is doing it way more gradually (i.e. N31 is 3 tiles total while PVC is 40-something).
The tile count will ramp with each RDNA/CDNA gen until it looks very held-by-spit-and-tape.

Also Intel is just schizo when it comes to leveraging their very cool advanced packaging roadmap.
You know it when the most relevant EMIB part shipped to date is Kaby Lake-G.

yuri said:
Is the MI200 MCM acting as one?

YEP.

manux · Apr 5, 2021

yuri said:
True, this stuff is very advanced technically. Targeting client (patent mentions "acting as one GPU" several times) is bold.

It should be doable. Nvidia's a100 does that already via nvlink. Nvidia had that whole blurb about world's biggest gpu via nvlink(was it 8 or 16gpu's in one box?). NVlink offers unified memory over nvlink. Memory and caches are coherent. It's more of a question of performance. In case of nvlink the gpu to gpu traffic definitely is slower than inter gpu traffic. User still has to be aware of individual gpu's to avoid memory related performance bottle necks.

AMD solution looks like it could potentially offer a lot more bandwidth than nvlink. If AMD was able to provide so low latency and high bandwidth that chiplets would just act as one giant gpu without any specific optimizations needed that would be really awesome.

Bondrewd · Apr 5, 2021

manux said:
AMD solution looks like it could potentially offer a lot more bandwidth than nvlink

That's an understatement if I've ever seen one.

AMD: RDNA 3 Speculation, Rumours and Discussion

CarstenS

Moderator

neckthrough

T2098

Scott_Arm

Frenetic Pony

Deleted member 90741

Guest

CarstenS

Moderator

tunafish

Bondrewd

tunafish

Bondrewd

Deleted member 90741

Guest

Deleted member 90741

Guest

Bondrewd

tunafish

Bondrewd

yuri

Bondrewd

manux

Bondrewd

Similar threads