If someone wants more knowledge on electro migration being described above, see here:
One point that came up in the video was the 95C operating temperature for the 290 line leading to a higher rate of thermal cycling failures. I wouldn't have the data to know, although I recall at the time that Hawaii launched I theorized that the constant 95C could reduce the impact of cycling. I think there was an article or blurb from AMD that insinuated the same, but it's been so long that I can't find it.
Part of the impact of thermal cycling is the cycle part of the process, which a constant 95C wouldn't be doing. The power-up and power-down cycles are some of the most extreme transitions, but they are relatively infrequent. Spiky utilization of a high-power chip and the back-and-forth trips up and down the temp/fan curve can happen many times between system power-on events, which is something vendors have an eye on as a more persistent threat to the mechanical reliability of the package.
There's a wide array of optimization points for the choice of materials, their arrangement, and the power behavior of the chip. There's the coefficient of thermal expansion that vendors can try to match, or try to handle when it doesn't. On top of that, there are the properties of the connections and layers like the underfill. Over the whole operating range, the physical properties can shift. Layers can expand/contract, and they can be stiff/soft or brittle/flexible as well. Selecting a target range or operating limit can influence what materials are chosen, and mistakes can lead to materials that weaken too much at a high temp, or remain too stiff at some temperatures, causing them to transmit excessive force up or down the stack. In theory, a chip package with materials that matched well at 95C and didn't have overly stiff adhesive or support layers could exist at a comfortable balance at a fixed operating temperature, with only the rarer power-up or power-down ramps being the place where stresses rise. Taking the same stack and putting it out of its range or not being consistent could actually increase the rate of wear, even if cooler.
That was the theory at the time, although I don't have the long-term data to know if that turned out the be the case. It's possible it wasn't that helpful, or there could have been other reasons AMD moved away from that operating point. It was at a time something AMD indicated was a design advantage, that their DVFS could react quickly enough at 95C to maintain constant temp and not allow utilization spikes to push hotspot temps into dangerous territory. Competing GPUs needed much more safety margin in order to catch temperature ramps and give their longer driver-controlled loops time to react.
HBM memory, user fears of overheating, cooler variability, iffy leakage and efficiency effects, and possibly concerns of other temperature-driven effects besides cycling may have led to it being a solution appropriate only for that specific set of circumstances.
A stupid question
We are all talking about how cool BCPack and kraken are, and how we would like to compress and decompress stuff all day.
From what I've read using hardware the overhead is really small, I don't know small compared to what, but small.
So why nobody uses them in a memory controller, even in bespoke products as consoles?
Maybe just on a pair of memory controllers, reserved for not latency sensitive data like texture, to store double the data.
There are two items that come to mind.
First is that IBM has Active Memory Expansion, which works by setting aside part of RAM and treating it more like a storage device. There's the regular set of pages, and then a pool of compressed pages. Less-active pages are moved to the compressed pool, and compressed pages get decompressed and moved to the active pool when they are accessed.
https://www.ibm.com/support/pages/aix-active-memory-expansion-ame
Chips like the Power9 aren't cheap, and while the decompression block's bandwidth is theoretically quite high at 32 GB/s or so, this is far below what normal memory bandwidth of a major SOC (source: Power9 processor manual).
(edit: Correction, the block needs to compress and decompress for up to 16 GB/s into the compressor and 16GB/s out of the decompressor. There are other accelerator blocks in the engine, and the total bandwidth they share is 32GB/s in each direction.)
Given that this is paging memory blocks back and forth in a similar fashion to a disk access, the latency of the operation is significant. Real performance-sensitive operations depend on data remaining in the active pool. The motivation isn't bandwidth savings or outright performance, but is focused on workloads like keeping more VM instances active in memory than would be possible if they weren't compressed. Some workloads like a database might benefit from having more data in DRAM in a big server system because the latency hit for the compressed memory is still smaller than a trip to a storage node or network access to get data.
An alternative form is the in-line compression done by Qualcomm's cancelled server chip.
https://www.qualcomm.com/media/documents/files/qualcomm-centriq-2400-processor.pdf
It's low-latency and is able to work on data in memory that is actively being accessed, but it's described as allowing for 128B lines to sometimes compress to 64B, so it's not as capable for compression and the compressed lines in RAM leave gaps that cannot be used, meaning overall RAM consumption would be unchanged. It would save power for data transfers over the DRAM bus.