Yeah caches reduce DRAM energy consumption by filtering accesses, but they themselves do consume energy. A crude estimate is that a 1MB cache would consume around 2 orders of magnitude lower energy than a DRAM access (in recent technology nodes). As the cache size grows, the energy per access grows roughly linearly (again a crude estimate) but with a very low slope.I was under the impression that it actually saved power.
Ok, here we go, again.
A reminder: Bandwidth for desktop GPUs is insanely cheap compared to their compute. The entire GDDR6X bus on a 3090 uses up maybe a handful of watts at most. 7.25 picojoules per byte. Picjoules is ^-12 joules and 1 joule = 1 watt without time. So even a terabyte isn't that much (like 7 watts). Bandwidth is only dear compared to mobile power usage. Compared to desktop/HPC stuff, where compute frequencies hit the exponential growth curve hard while bandwidth costs remain constant, bandwidth power usage is negligible. Remember when citing math, to actually do the math.
FeatureGFX90AInsts,
Feature64BitDPP,
FeaturePackedFP32Ops,
FeaturePackedTID,
FullRate64Ops
It's per Bit alright. Table on page 2 spells it out very clearly.Hmmm, video is saying it's 7.25pj per bit, while I was going off anandtech which had it per byte: https://www.anandtech.com/show/1597...g-for-higher-rates-coming-to-nvidias-rtx-3090
Taking a look around, it does appear to be per bit, though a ton of sites screw it up, replacing little b with big B and watnot, so be careful of that. Checklist for time machine is to tell whoever came up with bytes to just not since it's just annoying later (also switch negative and positive charge, etc.) Still, an eighth of the total power for memory isn't huge, and that's going off chip. Staying on package with a substrate for Infinity Fabric 1.0 is only 2pj per bit, and by the time we get to chiplet GPUs we'll be on 3.0, however much more efficient that'll be. So two terabyes a second (or more) is easily doable for a bigger card.
patent said:Accordingly, as discussed herein, an active bridge chiplet deploys monolithic GPU functionality using a set of interconnected GPU chiplets in a manner that makes the GPU chiplet implementation appear as a traditional monolithic GPU from a programmer model/developer perspective. The scalable data fabric of one GPU chiplet is able to access the lower level cache(s) on the active bridge chiplet in nearly the same time as to access the lower level cache on its same chiplet, and thus allows the GPU chiplets to maintain cache coherency without requiring additional inter-chiplet coherency protocols. This low-latency, inter-chiplet cache coherency in turn enables the chiplet-based system to operate as a monolithic GPU from the software developer's perspective, and thus avoids chiplet-specific considerations on the part of a programmer or developer.
Let's amp it a bit and show the packaging flow too.Speaking of chiplet GPUs, this AMD patent is just making the rounds.
It describes an active interposer which holds the L3 cache of a GPU, which is connected via 2.5 integration methods to multiple GPU chiplets. The detailed description makes clear that the system will be exposed as a single GPU to the programmers.
Yep it's a bit like Zen2 parts where you've always walked into the IODs to check the directories even if your target was the 2nd CCX on the same die.An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute. However, any access outside of a chiplet's own L2 requires looking at the L3 on the interposer, so any memory access always requires a roundtrip to the interposer, even if the memory request will be served by an interface on the same chiplet where it originated.
Let's amp it a bit and show the packaging flow too.
https://www.freepatentsonline.com/y2021/0098419.html
(yes, that's what GCD stands for)
Accordingly, as discussed herein, an active bridge chiplet deploys monolithic GPU functionality using a set of interconnected GPU chiplets in a manner that makes the GPU chiplet implementation appear as a traditional monolithic GPU from a programmer model/developer perspective. The scalable data fabric of one GPU chiplet is able to access the lower level cache(s) on the active bridge chiplet in nearly the same time as to access the lower level cache on its same chiplet, and thus allows the GPU chiplets to maintain cache coherency without requiring additional inter-chiplet coherency protocols. This low-latency, inter-chiplet cache coherency in turn enables the chiplet-based system to operate as a monolithic GPU from the software developer's perspective, and thus avoids chiplet-specific considerations on the part of a programmer or developer.
No it means they could move LLC offdie (well, they did) with little to no losses versus ondie.Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
N-n-nope, it's merely for the sake of heat and not dealing with fuckton more TSVs wherever possible.An interesting detail: The patent describes the memory interfaces being placed on the GPU chiplets, which makes sense in that it naturally scales the bandwidth with the compute.
Does this imply it would have similar latency as accessing the lower level cache of another chiplet as if accessing it in another shader array ?
This could mean adding additional SAs is like adding another chiplet.
No it means they could move LLC offdie (well, they did) with little to no losses versus ondie.
N-n-nope, it's merely for the sake of heat and not dealing with fuckton more TSVs wherever possible.
It's a patent, stuff has to be worded in a very obtuse manner there.Yeah, the way I parsed it was that it was just a complicated way of saying: "Our LLC on the interposer is almost as fast as the LLC was on-die."
Well you see, AMD sources most of their client GPU bandwidth from their chungus SRAM array.By scaling, I just mean that if you double the amount of GCDs, you also double the amount of memory PHYs, so they keep up.
For now, yes.Another interesting point I noticed: It appears that they will not have a separate "interface die", that is, every GCD will have what it needs to connect to PCI-e and displays.
Nope, only the upper two N3x parts are chiplet, rest are N-1 single dies.The advantage of this is that they can sell the GCD alone for their lowest-end part
Technically it also allows them to salvage more but TSMC yields are excellence as is thus well, not entirely necessary.the disadvantage of course is that they lose a few % of silicon on every higher-end part because only the interfaces on the first die are in use.
True, this stuff is very advanced technically. Targeting client (patent mentions "acting as one GPU" several times) is bold.For now, yes.
Later, all gets ghetto'd like Intel did in PVC.
Remember, this is AMD trying to tinker with expensive, hardly ever proven packaging tech in client of all things.
Gotta do it in increments.
It's also very fucking ghetto; ridicilously advanced parts that look like they've been made in a shack.True, this stuff is very advanced technically
Being fair, AMD/ATi always targeted new stuff with client usually first, see GDDR5 or HBM with 2.5D.Targeting client (patent mentions "acting as one GPU" several times) is bold.
Granted they're not using organic carriers.Intel's tiles are way ahead of AMD's CPU SerDes
Yep, the former of which is more of a moonshot that many would like to agree.However, they seems to target servers with PVC and SPR(?).
YEP.Is the MI200 MCM acting as one?
True, this stuff is very advanced technically. Targeting client (patent mentions "acting as one GPU" several times) is bold.
That's an understatement if I've ever seen one.AMD solution looks like it could potentially offer a lot more bandwidth than nvlink