Samsung HBM-PIM - Processing in Memory

Jawed

Legend
Wow, literally in the memory:

Samsung's New HBM2 Memory Has 1.2 TFLOPS of Embedded Processing Power | Tom's Hardware

Today, Samsung announced that its new HBM2-based memory has an integrated AI processor that can push out (up to) 1.2 TFLOPS of embedded computing power, allowing the memory chip itself to perform operations that are usually reserved for CPUs, GPUs, ASICs, or FPGAs.
Now it can be argued that this much compute is barely worth bothering with (it's only FP16 FLOPS being counted).

The cost/packaging/thermal constraints of HBM and the 20nm node being used here seem to indicate this is a prototyping sample for partners and I suppose there'll be a couple of years of testing...
 
By a quarter.

"Naturally, making room for the PCU units reduces memory capacity — each PCU-equipped memory die has half the capacity (4Gb) per die compared to a standard 8Gb HBM2 die. To help defray that issue, Samsung employs 6GB stacks by combining four 4Gb die with PCUs with four 8Gb dies without PCUs (as opposed to an 8GB stack with normal HBM2)."

 
"Naturally, making room for the PCU units reduces memory capacity — each PCU-equipped memory die has half the capacity (4Gb) per die compared to a standard 8Gb HBM2 die. To help defray that issue, Samsung employs 6GB stacks by combining four 4Gb die with PCUs with four 8Gb dies without PCUs (as opposed to an 8GB stack with normal HBM2)."
If you'd said die, I wouldn't have posted :) You said chip though.
 
  • Like
Reactions: HLJ
FP16 calculations in the HBM doesn't seem to be very useful for GPUs, and if the target was AI applications I wonder if it would be better served by Samsung's own NPUs consisted of CPU cores and MAC engines.
At first this looks like more of an academic exercise, but I can't figure out if Samsung intends to sell these half-RAM/half-FP16 ALUs dies as they are right now.

For GPUs, I wonder if they could put e.g. ROPs in there instead of the FP16 ALUs. It would be akin to Xenos' eDRAM die.
It would allow for a more modular approach to memory channels on a GPU, as total bandwidth is often scaled according to the number of ROPs enabled (and vice versa).
 
Back
Top