AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
I was under the impression that it actually saved power.
Yeah caches reduce DRAM energy consumption by filtering accesses, but they themselves do consume energy. A crude estimate is that a 1MB cache would consume around 2 orders of magnitude lower energy than a DRAM access (in recent technology nodes). As the cache size grows, the energy per access grows roughly linearly (again a crude estimate) but with a very low slope.
 
Ok, here we go, again.

A reminder: Bandwidth for desktop GPUs is insanely cheap compared to their compute. The entire GDDR6X bus on a 3090 uses up maybe a handful of watts at most. 7.25 picojoules per byte. Picjoules is ^-12 joules and 1 joule = 1 watt without time. So even a terabyte isn't that much (like 7 watts). Bandwidth is only dear compared to mobile power usage. Compared to desktop/HPC stuff, where compute frequencies hit the exponential growth curve hard while bandwidth costs remain constant, bandwidth power usage is negligible. Remember when citing math, to actually do the math.

One thing I've been curious about is the fact that GDDR6X seems disproportionately power hungry compared to GDDR6 even when you adjust for the higher frequencies. I'm wondering if Micron was somewhat coyly talking up the power dissipation at the DRAM package, and intentionally leaving out how much more power is dissipated by the memory controller on the GPU die (in comparison to GDDR6)

If I were a betting person, I'd be betting on most of the extra power being dissipated on die, and not being reflected in the datasheets for the GDDR6x modules themselves.
 
Hmmm, video is saying it's 7.25pj per bit, while I was going off anandtech which had it per byte: https://www.anandtech.com/show/1597...g-for-higher-rates-coming-to-nvidias-rtx-3090

Taking a look around, it does appear to be per bit, though a ton of sites screw it up, replacing little b with big B and watnot, so be careful of that. Checklist for time machine is to tell whoever came up with bytes to just not since it's just annoying later (also switch negative and positive charge, etc.) Still, an eighth of the total power for memory isn't huge, and that's going off chip. Staying on package with a substrate for Infinity Fabric 1.0 is only 2pj per bit, and by the time we get to chiplet GPUs we'll be on 3.0, however much more efficient that'll be. So two terabyes a second (or more) is easily doable for a bigger card.
 
Hmmm, video is saying it's 7.25pj per bit, while I was going off anandtech which had it per byte: https://www.anandtech.com/show/1597...g-for-higher-rates-coming-to-nvidias-rtx-3090

Taking a look around, it does appear to be per bit, though a ton of sites screw it up, replacing little b with big B and watnot, so be careful of that. Checklist for time machine is to tell whoever came up with bytes to just not since it's just annoying later (also switch negative and positive charge, etc.) Still, an eighth of the total power for memory isn't huge, and that's going off chip. Staying on package with a substrate for Infinity Fabric 1.0 is only 2pj per bit, and by the time we get to chiplet GPUs we'll be on 3.0, however much more efficient that'll be. So two terabyes a second (or more) is easily doable for a bigger card.
It's per Bit alright. Table on page 2 spells it out very clearly.
https://media-www.micron.com/-/medi...rief/ultra_bandwidth_solutions_tech_brief.pdf
 
Speaking of chiplet GPUs, this AMD patent is just making the rounds.

It describes an active interposer which holds the L3 cache of a GPU, which is connected via 2.5 integration methods to multiple GPU chiplets. The detailed description makes clear that the system will be exposed as a single GPU to the programmers.

(edit, added quote: )

patent said:
Accordingly, as discussed herein, an active bridge chiplet deploys monolithic GPU functionality using a set of interconnected GPU chiplets in a manner that makes the GPU chiplet implementation appear as a traditional monolithic GPU from a programmer model/developer perspective. The scalable data fabric of one GPU chiplet is able to access the lower level cache(s) on the active bridge chiplet in nearly the same time as to access the lower level cache on its same chiplet, and thus allows the GPU chiplets to maintain cache coherency without requiring additional inter-chiplet coherency protocols. This low-latency, inter-chiplet cache coherency in turn enables the chiplet-based system to operate as a monolithic GPU from the software developer's perspective, and thus avoids chiplet-specific considerations on the part of a programmer or developer.
 
Last edited:
Speaking of chiplet GPUs, this AMD patent is just making the rounds.

It describes an active interposer which holds the L3 cache of a GPU, which is connected via 2.5 integration methods to multiple GPU chiplets. The detailed description makes clear that the system will be exposed as a single GPU to the programmers.
Let's amp it a bit and show the packaging flow too.
https://www.freepatentsonline.com/y2021/0098419.html
(yes, that's what GCD stands for)
 
An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute. However, any access outside of a chiplet's own L2 requires looking at the L3 on the interposer, so any memory access always requires a roundtrip to the interposer, even if the memory request will be served by an interface on the same chiplet where it originated.
 
An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute. However, any access outside of a chiplet's own L2 requires looking at the L3 on the interposer, so any memory access always requires a roundtrip to the interposer, even if the memory request will be served by an interface on the same chiplet where it originated.
Yep it's a bit like Zen2 parts where you've always walked into the IODs to check the directories even if your target was the 2nd CCX on the same die.
 
Accordingly, as discussed herein, an active bridge chiplet deploys monolithic GPU functionality using a set of interconnected GPU chiplets in a manner that makes the GPU chiplet implementation appear as a traditional monolithic GPU from a programmer model/developer perspective. The scalable data fabric of one GPU chiplet is able to access the lower level cache(s) on the active bridge chiplet in nearly the same time as to access the lower level cache on its same chiplet, and thus allows the GPU chiplets to maintain cache coherency without requiring additional inter-chiplet coherency protocols. This low-latency, inter-chiplet cache coherency in turn enables the chiplet-based system to operate as a monolithic GPU from the software developer's perspective, and thus avoids chiplet-specific considerations on the part of a programmer or developer.

Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
This could mean adding additional SAs is like adding another chiplet.
 
Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
No it means they could move LLC offdie (well, they did) with little to no losses versus ondie.
Without any additional d2d links a-la GMI or QPI/UPI or w/ever else.
It's basically one chungus GPU from your POV.
Imagine if you sawed off N21 in half and moved its LLC into an active bridge.
An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute.
N-n-nope, it's merely for the sake of heat and not dealing with fuckton more TSVs wherever possible.
Sooner or later analog bs would be ghetto'd into its own set of chiplets (okay sorry, Intel sorta did that already with PVC Xe-Link (well, CXL with fancy) tile), but for now this would do.
 
Last edited:
Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
This could mean adding additional SAs is like adding another chiplet.

No it means they could move LLC offdie (well, they did) with little to no losses versus ondie.

Yeah, the way I parsed it was that it was just a complicated way of saying: "Our LLC on the interposer is almost as fast as the LLC was on-die."

N-n-nope, it's merely for the sake of heat and not dealing with fuckton more TSVs wherever possible.

By scaling, I just mean that if you double the amount of GCDs, you also double the amount of memory PHYs, so they keep up.

Another interesting point I noticed: It appears that they will not have a separate "interface die", that is, every GCD will have what it needs to connect to PCI-e and displays.

The advantage of this is that they can sell the GCD alone for their lowest-end part, and save a few bucks there, the disadvantage of course is that they lose a few % of silicon on every higher-end part because only the interfaces on the first die are in use.
 
Yeah, the way I parsed it was that it was just a complicated way of saying: "Our LLC on the interposer is almost as fast as the LLC was on-die."
It's a patent, stuff has to be worded in a very obtuse manner there.
By scaling, I just mean that if you double the amount of GCDs, you also double the amount of memory PHYs, so they keep up.
Well you see, AMD sources most of their client GPU bandwidth from their chungus SRAM array.
Another interesting point I noticed: It appears that they will not have a separate "interface die", that is, every GCD will have what it needs to connect to PCI-e and displays.
For now, yes.
Later, all gets ghetto'd like Intel did in PVC.
Remember, this is AMD trying to tinker with expensive, hardly ever proven packaging tech in client of all things.
Gotta do it in increments.
The advantage of this is that they can sell the GCD alone for their lowest-end part
Nope, only the upper two N3x parts are chiplet, rest are N-1 single dies.
the disadvantage of course is that they lose a few % of silicon on every higher-end part because only the interfaces on the first die are in use.
Technically it also allows them to salvage more but TSMC yields are excellence as is thus well, not entirely necessary.
 
For now, yes.
Later, all gets ghetto'd like Intel did in PVC.
Remember, this is AMD trying to tinker with expensive, hardly ever proven packaging tech in client of all things.
Gotta do it in increments.
True, this stuff is very advanced technically. Targeting client (patent mentions "acting as one GPU" several times) is bold.

Intel's tiles are way ahead of AMD's CPU SerDes. However, they seems to target servers with PVC and SPR(?).

Kinda can't believe this form comes straight as Navi 3x with no intermediate generation. Is the MI200 MCM acting as one?
 
True, this stuff is very advanced technically
It's also very fucking ghetto; ridicilously advanced parts that look like they've been made in a shack.
Targeting client (patent mentions "acting as one GPU" several times) is bold.
Being fair, AMD/ATi always targeted new stuff with client usually first, see GDDR5 or HBM with 2.5D.
Intel's tiles are way ahead of AMD's CPU SerDes
Granted they're not using organic carriers.
The parts that do (haha CL-AP) are unarguably worse than what AMD offers in Rome/Milan.
However, they seems to target servers with PVC and SPR(?).
Yep, the former of which is more of a moonshot that many would like to agree.
AMD is doing it way more gradually (i.e. N31 is 3 tiles total while PVC is 40-something).
The tile count will ramp with each RDNA/CDNA gen until it looks very held-by-spit-and-tape.

Also Intel is just schizo when it comes to leveraging their very cool advanced packaging roadmap.
You know it when the most relevant EMIB part shipped to date is Kaby Lake-G.
Is the MI200 MCM acting as one?
YEP.
 
Last edited:
True, this stuff is very advanced technically. Targeting client (patent mentions "acting as one GPU" several times) is bold.

It should be doable. Nvidia's a100 does that already via nvlink. Nvidia had that whole blurb about world's biggest gpu via nvlink(was it 8 or 16gpu's in one box?). NVlink offers unified memory over nvlink. Memory and caches are coherent. It's more of a question of performance. In case of nvlink the gpu to gpu traffic definitely is slower than inter gpu traffic. User still has to be aware of individual gpu's to avoid memory related performance bottle necks.

AMD solution looks like it could potentially offer a lot more bandwidth than nvlink. If AMD was able to provide so low latency and high bandwidth that chiplets would just act as one giant gpu without any specific optimizations needed that would be really awesome.
 
Status
Not open for further replies.
Back
Top